functions: linear, hardtanh or ReLU. The hardtanh activation is a piecewise linear approximation **of** the tanh, hardtanh(x) = −1 for x<−1, x for −1<x<1, **and** 1 for x>1, for which the integrals **in** the replica formula can be evaluated faster than for the tanh.
**In** the linear **and** hardtanh case, the non-parametric methods are following the tendency **of** the replica estimate when σ is varied, but appear to systematically over-estimate the **entropy**. For linear **networks** with Gaussian inputs **and** additive Gaussian noise, every layer is also a multivariate Gaussian **and** therefore entropies can be directly computed **in** closed form (exact **in** the plot legend). When using the Kolchinsky estimate **in** the linear case we also check the consistency **of** two strategies, either fitting the MoG to the noisy sample or fitting the MoG to the deterministic part **of** the T ` **and** augment the resulting variance with σ noise 2 , as done **in** [44] (Kolchinsky et al. parametric **in** the plot legend). **In** the network with hardtanh non-linearities, we check that for small weight values, the entropies are the same as **in** a linear network with same weights (linear approx **in** the plot legend, computed using the exact analytical result for linear **networks** **and** therefore plotted **in** a similar color to exact ). Lastly, **in** the case **of** the ReLU-ReLU network, we note that non-parametric methods are predicting an **entropy** increasing as the one **of** a linear network with identical weights, whereas the replica computation reflects its knowledge **of** the cut-off **and** accurately features a slope equal to half **of** the linear network **entropy** (1/2 linear approx **in** the plot legend). While non-parametric estimators are invaluable tools able to approximate entropies from the mere knowledge **of** samples,they inevitably introduce estimation errors. The replica method is taking the opposite view. While being restricted to a class **of** **models**, it can leverage its knowledge **of** the **neural** network structure to provide a reliable estimate. To our knowledge, there is no other **entropy** estimator able to incorporate such **information** about the underlying multi-layer model.

Show more
66 Read more

we are actually minimizing the cross-**entropy** during training (i.e. the gradients are computed based on the cross-**entropy** cost, square error is only computed for visualization).
The first observation that we can make is that natural conjugate gradient, which adds second order **information** to natural gradient, does perform better than natural gradient (**and** it seems to outperform, **in** terms **of** time, SGD as well). **In** particular, NatCG-L is doing particularly well. This algorithm relies on an o↵-the-shelf solver (**in** this case COBYLA) to find both the right step size **and** the next conjugate direction as we proposed **in** Section 4.3.6. This provides evidence supporting our hypothesis that relying on the Polak–Riebiere formula when implementing natural conjugate gradient (even if we reset the direction often) is harmful **in** practice. To apply any **of** these formula one needs to compute the inner products between vectors belonging to di↵erent tangential spaces. If we ignore this fact **and** assume the metric does not change from one step to another (as NatCG-F does), the assumption will hurt learning. The metric matrix stays about the same only if one takes smallish steps.

Show more
267 Read more

FIG. 5. Configuration space **of** the 4 × 4 Ising model. The colors represent the density **of** states.
to the number **of** example configurations provided during training. Exceptional accuracy is achieved at 27 000 training examples (1800 per class). Each dataset contains less than 20% **of** configuration space (some energy classes are over-sampled to fill the 1800-example quota). We trained our **neural** network architecture on each **of** these three datasets. The **neural** network was able to classify all but a handful **of** Ising configurations, on average. On one dataset, it achieved an accuracy **of** 100%. **In** all cases **of** misclassification, 100% **of** misclassified examples only have an error **of** ±1 energy level, indicating the **neural** network is just barely failing to classify such examples. All misclassified configurations had energies near zero. **In** this region there is considerable variation due to the degeneracy **of** the Ising model (apparent **in** Fig. 5 ), **and** therefore predictions based on a uniform number **of** training examples per class are slightly more challenging. At the extreme energies (±32), individual configurations are repeated many times to fill the quota **of** training examples. It is worth noting again that this **neural** network had access to less than 20% **of** configuration space, so it is clearly correctly inferring **information** about examples it has not yet seen.

Show more
11 Read more

1 Introduction
The tremendous empirical success **of** **deep** **neural** **networks** (DNN) for many machine learning tasks such as image classification **and** object recognition (Krizhevsky et al., 2017) contrasts with their relatively poor theoretical understanding. One feature commonly attributed to DNN to explain their performance is their ability to build hierarchical representations **of** the data, able to capture relevant **information** **in** the data at different scales (Bengio et al., 2013; Tishby **and** Zaslavsky, 2015; Mallat, 2012). An important idea to create good sets **of** representations is to reduce redundancy **and** increase diversity **in** the representation, an idea that can be traced back to early investigations about learning (BARLOW, 1959) **and** that has been implemented **in** a variety **of** methods such as independent component analysis (Hyv¨ arinen, 2013) or feature selection (Peng et al., 2005). Explicitly encouraging diversity has been shown to improve the performance **of** ensemble learning **models** (Kuncheva **and** Whitaker, 2003; Dietterich, 2000), **and** techniques have been proposed to limit redundancy **in** DNN by pruning units or connections (Hassibi **and** Stork, 1993; LeCun et al., 1990; Mariet **and** Sra, 2016) or by explicitly encouraging diversity between units **of** each layer during training (Cogswell et al., 2015; Desjardins et al., 2015; Rodr´ıguez et al., 2016; Luo, 2017).

Show more
17 Read more

From Figure 2, we report the performance **of** our model when trained with ground- truth trees as input. It is encouraging to see that our recurrent-recursive encoder improves performance over Transformer (FAN) [159] **and** LSTM, especially for long sequences. The best performance on this dataset is given by TreeLSTM [16], which has access to the whole sequence **and** does not encode sequences on-the-fly. TreeLSTM significantly outperforms the other **models**, especially **in** the generalization tasks, i.e., those tasks when sequence lengths are bigger than 6. Note that TreeLSTM here utilizes extra **information** at test time, i.e., the ground truth parse tree. On the other hand, if we consider from the optimization point **of** view, our model is facing a much harder optimization problem than TreeLSTM, **in** trade for being auto-regressive **and** a learnable structure. The TreeLSTM is utterly a recursive network. Thus it takes approximately O(log n) steps from the leaf nodes to the final root encoding, which makes it easy for the gradients to flow back to every nodes. However, for our model the gradients are not that easy to propagate. As although the shortest path from a certain node to the final node is O(log n) as well (travelling through the shortcuts), the longest path could be O(n) (traversing all the leaf nodes).

Show more
140 Read more

2 Ent (ν), (1.5)
where W is the Brownian motion, P is the space **of** probability measures **and** the regularizer Ent is the relative **entropy** with respect to the Lebesgue measure, see e.g. [17]. Moreover, the marginal law **of** the process (1.4) converges to its invariant measure. As analyzed **in** the recent paper [16], this result is basically due to the fact that the function ν 7→ R F (a)ν(da) is convex (indeed linear). **In** the present paper we wish to apply a similar regularization to the optimal control problem. **In** order to do that we first recall the relaxed formulation **of** the control problem (1.1). Instead **of** controlling the process α, we will control the flow **of** laws (ν t ) t∈[0,T ] . Then the

Show more
25 Read more

travel behavior through the change **of** tolls or subsidies [118, 53]. VOT, as one impor- tant instance **of** MRS, can be used to measure the monetary gain **of** saved time after the improvement **of** a transportation system **in** a benefit-cost analysis [118, 119].
Recently researchers started to use machine learning **models** to analyze individ- ual decisions. Karlaftis **and** Vlahogianni (2011) [65] summarized 86 studies **in** six transportation fields **in** which DNNs were applied. Researchers used DNNs to predict travel mode choice [26], car ownership [101], travel accidents [149], travelers’ decision rules [132], driving behaviors [60], trip distribution [89], **and** traffic flows [104, 82, 142]. DNNs are also used to complement the smartphone-based survey [143], improve sur- vey efficiency [115], **and** impute survey data [35]. **In** the studies that focus on predic- tion accuracy, researchers often compare many classifiers, including DNNs, support vector machines (SVM), decision trees (DT), random forests (RF), **and** DCMs, typi- cally yielding the finding that DNNs **and** RF perform better than the classical DCMs [106, 98, 113, 48, 26]. **In** other fields, researchers also found the superior performance **of** DNNs **in** prediction compared to all the other machine learning (ML) classifiers [38, 72]. Besides high prediction power, DNNs are powerful due to its versatility, as they are able to accommodate various **information** formats such as images, videos, **and** text [76, 73, 61].

Show more
128 Read more

Average 41908 22.4 17.1
Table 1. Word Error Rate (%) for the 11 shows obtained using the GMM-HMM **and** DNN-HMM KATS systems.
From 2012, **deep** learning has shown excellent results **in** many domains: image recognition, speech recognition, language modelling, parsing, **information** retrieval, speech synthesis, translation, autonomous cars, gaming, etc. **In** this article, we presented **deep** **neural** **networks** for speech recognition: different architectures **and** training procedures for acoustic **and** language **models** are visited. Using our speech recognition system, we compared GMM **and** DNN acoustic **models**. **In** the framework **of** broadcast news transcription, we shown that the DNN-HMM acoustic model decreases the word error rate dramatically compared to classical GMM- HMM acoustic model (24% relative significant improvement).

Show more
This thesis focuses on computational strategies aimed at reducing CNN computational complexity **and**/or increasing accuracy **of** the speciﬁc task, i.e. image classiﬁcation.
Our ﬁrst goal is to increase the accuracy **of** speciﬁc task using eﬃcient transfer learning with the new method for feature selection based on **information** theory. A number **of** diﬀerent strategies **in** this goal, where the primary methodology focused on analyzing the **deep** CNN **in** terms **of** probability **and** **information** theory based on Chaddad et al. (2017, 2019), as described **in** 2.1. The general methodology is to model the **neural** network as a probabilistic Bayes network, where **information** at any point **in** the network is modeled as distribution conditional upon inputs **and** ﬁltering operations. Conditional **entropy** is then used identify class-informative features throughout the network for the task **of** image classiﬁcation. Speciﬁcally output **of** a network layer is considered as a random variable 𝑌, deﬁned by a distribution 𝑝(𝑌 |𝐶, 𝐹) conditioned with the object class 𝐶 **and** ﬁlter 𝐹. The conditional **entropy** 𝐻(𝑌 |𝐶, 𝐹) introduced as 𝐶𝐸𝑁𝑇 is very comprehensive codes used to achieve higher classiﬁcation accuracy.

Show more
180 Read more

need to drop some phase **information**. Results are reported **in** Table 2.1. More discussion about phase **information** encoding is presented **in** section 2.7.7.
2.4.2. Automatic Music Transcription
**In** this section we present results for the automatic music transcription (AMT) task. The nature **of** an audio signal allows one to exploit complex operations as presented earlier **in** the paper. The experiments were performed on the MusicNet dataset [51]. For computational efficiency we resampled the original input from the original 44.1kHz to the reduced 11kHz using the algorithm described **in** [52]. This sampling rate is sufficient to recognize frequencies present **in** the dataset while reducing the computational cost dramatically. We modeled each **of** the 84 notes that are present **in** the dataset with independent sigmoids (due to the fact that notes can fire simultaneously). We initialized the bias **of** the last layer to the value **of** −5 to reflect the distribution **of** silent/non-silent notes. As **in** the baseline, we performed experiments on the raw signal **and** the frequency spectrum. For complex experiments with the raw signal, we considered its imaginary part equal to zero. When using the spectrum input we used its complex representation (instead **of** only the magnitudes, as usual for AMT) for both real **and** complex **models**. For the real model, we considered the real **and** imaginary components **of** the spectrum as separate channels.

Show more
57 Read more

For the audio-only speech separation category the use **of** **deep** learning techniques has also gained growing interest **in** recent years. Our work falls into this category **of** methods. Huang et al. (2014a) were the earliest to use a **deep** learning approach to modeling monaural speech separation. They combine **of** a feed-forward **and** a recurrent network that are jointly optimized with a soft masking function. Concomitantly, **in** a closely-related work, Du et al. (2014) proposed a **neural** network to estimate the log power spectrum **of** the target speakers. Hershey et al. (2015) proposed a **deep** clustering approach to speech separation. The basic idea is to learn high-dimensional embedding **of** the mixture signals ; then standard clustering techniques use the obtained embedding to separate the speech targets. The **deep** attractor network suggested by Chen et al. (2016) is an extension **of** the **deep** clustering approach. The network also creates the so-called “attractors” to better cluster time-frequency points dominated by different speakers. The aforementioned approaches estimate only the magnitude **of** the STFTs **and** reconstruct the time-domain signal. Similarly to our work, other papers have recently proposed to integrate the phase-**information** within a speech separation system. The work by Erdogan et al. (2015), for instance, proposes to train a **deep** **neural** network with a phase-sensitive loss. Another noteworthy attempt has been described **in** Wang et al. (2018), where the **neural** network still estimates the magnitude **of** the spectrum, but the time-domain speech signals are retrieved directly. Further, another trend, instead **of** explicitly integrating phase-**information**, performs speech separation **in** the time domain directly, as described **in** Venkataramani **and** Smaragdis (2018). Likewise, the TasNet architectures proposed **in** Luo **and** Mesgarani (2017) **and** Luo **and** Mesgarani (2018) accomplish speech separation using the mixed time signal as input. TasNet directly **models** the mixture waveform using an encoder- decoder framework, **and** performs the separation on the output **of** the encoder.

Show more
146 Read more

5 Generative Stochastic **Networks**
We introduce a novel training principle for generative probabilistic **models** that is an alternative to maximum likelihood. The proposed Generative Stochastic Net- works (GSN) framework generalizes Denoising Auto-Encoders (DAE) **and** is based on learning the transition operator **of** a Markov chain whose stationary distribu- tion estimates the data distribution. The transition distribution is a conditional distribution that generally involves a small move, so it has fewer dominant modes **and** is unimodal **in** the limit **of** small moves. This simplifies the learning prob- lem, making it less like density estimation **and** more akin to supervised function approximation, with gradients that can be obtained by backprop. The theorems provided here provide a probabilistic interpretation for denoising autoencoders **and** generalize them; seen **in** the context **of** this framework, auto-encoders that learn with injected noise are a special case **of** GSNs **and** can be interpreted as generative **models**. The theorems also provide an interesting justification for dependency net- works **and** generalized pseudolikelihood **and** define an appropriate joint distribution **and** sampling mechanism, even when the conditionals are not consistent. GSNs can be used with missing inputs **and** can be used to sample subsets **of** variables given the rest. Experiments validating these theoretical results are conducted on both synthetic datasets **and** image datasets. The experiments employ a particular archi- tecture that mimics the **Deep** Boltzmann Machine Gibbs sampler but that allows training to proceed with backprop through a recurrent **neural** network with noise injected inside **and** without the need for layerwise pretraining.

Show more
191 Read more

This paper deals with a class **of** compact **neural** **networks**: **deep** **networks** **in** which all weight matrices are either diag- onal or circulant matrices. Up to our knowledge, training such **networks** with a large number **of** layers had not been done before.We also endowed this kind **of** **models** with the- oretical guarantees, hence enriching **and** refining previous theoretical work from the literature. More importantly, we showed that **deep** circulant **networks** outperform their com- peting structured alternatives, including the very recently general approach based on low-displacement rank matrices. Our results suggest that stacking circulant layers with non linearities improves the convergence rate **and** the final ac- curacy **of** the network. Formally proving these statements constitutes the future directions **of** this work.

Show more
14 Read more

† Universit´e de Lorraine, LORIA, UMR 7503, Vandœuvre-l`es-Nancy, F-54506, France
‡ CNRS, LORIA, UMR 7503, Vandœuvre-l`es-Nancy, F-54506, France
E-mails: {aditya.nugraha, antoine.liutkus, emmanuel.vincent}@inria.fr
Abstract—This article addresses the problem **of** multichannel music separation. We propose a framework where the source spectra are estimated using **deep** **neural** **networks** **and** combined with spatial covariance matrices to encode the source spatial characteristics. The parameters are estimated **in** an iterative expectation-maximization fashion **and** used to derive a multichan- nel Wiener filter. We evaluate the proposed framework for the task **of** music separation on a large dataset. Experimental results show that the method we describe performs consistently well **in** separating singing voice **and** other instruments from realistic musical mixtures.

Show more
L l=1 kW
l k which is an upper bound **of** the Lipschitz constant **of** the network has also been proposed as regularization that pro- motes robustness. It accounts for the overall Lipschitz reg- ularity **of** the network **and** acts also as an overall control on the contraction power **of** the network by coupling lay- ers **and** allowing some weights to grow for some layers as long as **in** other layers others weights are getting smaller to compensate. When Q W = WW > , its Frobenius norm **and** the Lipschitz constant gradient can be explicitely de- rived **and** integrated into the backpropagation scheme **and** chain rule **of** gradients **in** order to optimize the augmented loss during the training phase. However, for the spectral norm, approximation methods are necessary **and** gradient will have to be computed using numerical differentiation techniques. **In** Appendix B, we discuss several available approximation methods **and** **in** the next section, we propose to carry out experiments with these various regularization strategies **and** evaluate their respective impact on the ro- bustness properties **of** the network.

Show more
10 Read more

4.1.1. Influence **of** the size **of** the DNN
**In** this paragraph, we consider a fixed number **of** 15 input frames (so the size **of** the input layer **of** our DNN is 15× 30 = 450) **and** we compare the performances **in** function **of** the number **of** hidden layers **and** the number **of** neurons for each hidden layer. For simplicity, we have considered the same number **of** neurons for each hidden layer. **In** figure 2 we have displayed the recognition rate **in** function **of** these parameters. Overall, DNN perform better when the number **of** hidden lay- ers **and** neurons is greater but it can be seen that when the number **of** neurons is weak, increasing the number **of** hidden layers does not necessarily improve the performances. The worse recognition rate (80%) is obtained for 2 hidden layers **of** 50 neurons **and** the best performances (91.6%) are obtained with 5 hidden layers **of** 500 neurons. For reasons **of** space, we have not reproduced the confusion matrix associated to this configuration. Roughly speaking, all classes have a recog- nition rate greater than 80%, except the quiet street **and** the pedestrian street which have a recognition rate **of** 66.66% **and** 75%, respectively. Theses classes are mainly confused with the shop **and** market classes. For larger DNN, we have not ob- served a major improvement. Indeed, for a DNN with 7 hid- den layers **and** 1000 neurons, the recognition rate is 92.2%.

Show more
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]

30 Read more

I. I NTRODUCTION
**Deep** **neural** **networks** are currently the most widely investigated architecture **in** Artificial Intelligence (AI) systems, with incredible achievements **in** image recognition, automatic translation, Go or Poker games. Unfortunately, when operated on central or graphics processing units (CPUs or GPUs), they consume considerable energy, **in** particular due to the intensive data exchanges between processors **and** memory [1,2]. **Neural** **networks** using **in**-memory computing (iMC) with RRAM are widely proposed as a solution to the Von-Neumann bottleneck [1]. However, RRAMs are prone to variability [3], **and** using Error Correcting Codes (ECC) as **in** more standard memories would ruin the benefits **of** iMC. ECCs indeed require large decoding circuits [4], which would need to be replicated multiple times **in** the case **of** iMC. This last point is the key challenge that we have to face for reliable **neural** **networks** on large RRAM memory arrays. **In** this paper, an experimental RRAM array with differential memory bit-cell (2T2R) based on HfO 2 devices, including all peripheral **and** a differential sensing

Show more
In addition, we compare these representations along with both long short-term memory networks (LSTM) and convolutional neural networks (CNN) for prediction of five i[r]

56 Read more