Top PDF Entropy and mutual information in models of deep neural networks

Entropy and mutual information in models of deep neural networks

Entropy and mutual information in models of deep neural networks

functions: linear, hardtanh or ReLU. The hardtanh activation is a piecewise linear approximation of the tanh, hardtanh(x) = −1 for x<−1, x for −1<x<1, and 1 for x>1, for which the integrals in the replica formula can be evaluated faster than for the tanh. In the linear and hardtanh case, the non-parametric methods are following the tendency of the replica estimate when σ is varied, but appear to systematically over-estimate the entropy. For linear networks with Gaussian inputs and additive Gaussian noise, every layer is also a multivariate Gaussian and therefore entropies can be directly computed in closed form (exact in the plot legend). When using the Kolchinsky estimate in the linear case we also check the consistency of two strategies, either fitting the MoG to the noisy sample or fitting the MoG to the deterministic part of the T ` and augment the resulting variance with σ noise 2 , as done in [44] (Kolchinsky et al. parametric in the plot legend). In the network with hardtanh non-linearities, we check that for small weight values, the entropies are the same as in a linear network with same weights (linear approx in the plot legend, computed using the exact analytical result for linear networks and therefore plotted in a similar color to exact ). Lastly, in the case of the ReLU-ReLU network, we note that non-parametric methods are predicting an entropy increasing as the one of a linear network with identical weights, whereas the replica computation reflects its knowledge of the cut-off and accurately features a slope equal to half of the linear network entropy (1/2 linear approx in the plot legend). While non-parametric estimators are invaluable tools able to approximate entropies from the mere knowledge of samples,they inevitably introduce estimation errors. The replica method is taking the opposite view. While being restricted to a class of models, it can leverage its knowledge of the neural network structure to provide a reliable estimate. To our knowledge, there is no other entropy estimator able to incorporate such information about the underlying multi-layer model.
Show more

66 Read more

On Recurrent and Deep Neural Networks

On Recurrent and Deep Neural Networks

we are actually minimizing the cross-entropy during training (i.e. the gradients are computed based on the cross-entropy cost, square error is only computed for visualization). The first observation that we can make is that natural conjugate gradient, which adds second order information to natural gradient, does perform better than natural gradient (and it seems to outperform, in terms of time, SGD as well). In particular, NatCG-L is doing particularly well. This algorithm relies on an o↵-the-shelf solver (in this case COBYLA) to find both the right step size and the next conjugate direction as we proposed in Section 4.3.6. This provides evidence supporting our hypothesis that relying on the Polak–Riebiere formula when implementing natural conjugate gradient (even if we reset the direction often) is harmful in practice. To apply any of these formula one needs to compute the inner products between vectors belonging to di↵erent tangential spaces. If we ignore this fact and assume the metric does not change from one step to another (as NatCG-F does), the assumption will hurt learning. The metric matrix stays about the same only if one takes smallish steps.
Show more

267 Read more

Deep neural networks for direct, featureless learning through observation: the case of two-dimensional spin models

Deep neural networks for direct, featureless learning through observation: the case of two-dimensional spin models

FIG. 5. Configuration space of the 4 × 4 Ising model. The colors represent the density of states. to the number of example configurations provided during training. Exceptional accuracy is achieved at 27 000 training examples (1800 per class). Each dataset contains less than 20% of configuration space (some energy classes are over-sampled to fill the 1800-example quota). We trained our neural network architecture on each of these three datasets. The neural network was able to classify all but a handful of Ising configurations, on average. On one dataset, it achieved an accuracy of 100%. In all cases of misclassification, 100% of misclassified examples only have an error of ±1 energy level, indicating the neural network is just barely failing to classify such examples. All misclassified configurations had energies near zero. In this region there is considerable variation due to the degeneracy of the Ising model (apparent in Fig. 5 ), and therefore predictions based on a uniform number of training examples per class are slightly more challenging. At the extreme energies (±32), individual configurations are repeated many times to fill the quota of training examples. It is worth noting again that this neural network had access to less than 20% of configuration space, so it is clearly correctly inferring information about examples it has not yet seen.
Show more

11 Read more

Adaptive structured noise injection for shallow and deep neural networks

Adaptive structured noise injection for shallow and deep neural networks

1 Introduction The tremendous empirical success of deep neural networks (DNN) for many machine learning tasks such as image classification and object recognition (Krizhevsky et al., 2017) contrasts with their relatively poor theoretical understanding. One feature commonly attributed to DNN to explain their performance is their ability to build hierarchical representations of the data, able to capture relevant information in the data at different scales (Bengio et al., 2013; Tishby and Zaslavsky, 2015; Mallat, 2012). An important idea to create good sets of representations is to reduce redundancy and increase diversity in the representation, an idea that can be traced back to early investigations about learning (BARLOW, 1959) and that has been implemented in a variety of methods such as independent component analysis (Hyv¨ arinen, 2013) or feature selection (Peng et al., 2005). Explicitly encouraging diversity has been shown to improve the performance of ensemble learning models (Kuncheva and Whitaker, 2003; Dietterich, 2000), and techniques have been proposed to limit redundancy in DNN by pruning units or connections (Hassibi and Stork, 1993; LeCun et al., 1990; Mariet and Sra, 2016) or by explicitly encouraging diversity between units of each layer during training (Cogswell et al., 2015; Desjardins et al., 2015; Rodr´ıguez et al., 2016; Luo, 2017).
Show more

17 Read more

Deep neural networks for natural language processing and its acceleration

Deep neural networks for natural language processing and its acceleration

From Figure 2, we report the performance of our model when trained with ground- truth trees as input. It is encouraging to see that our recurrent-recursive encoder improves performance over Transformer (FAN) [159] and LSTM, especially for long sequences. The best performance on this dataset is given by TreeLSTM [16], which has access to the whole sequence and does not encode sequences on-the-fly. TreeLSTM significantly outperforms the other models, especially in the generalization tasks, i.e., those tasks when sequence lengths are bigger than 6. Note that TreeLSTM here utilizes extra information at test time, i.e., the ground truth parse tree. On the other hand, if we consider from the optimization point of view, our model is facing a much harder optimization problem than TreeLSTM, in trade for being auto-regressive and a learnable structure. The TreeLSTM is utterly a recursive network. Thus it takes approximately O(log n) steps from the leaf nodes to the final root encoding, which makes it easy for the gradients to flow back to every nodes. However, for our model the gradients are not that easy to propagate. As although the shortest path from a certain node to the final node is O(log n) as well (travelling through the shortcuts), the longest path could be O(n) (traversing all the leaf nodes).
Show more

140 Read more

Mean-field Langevin System, Optimal Control and Deep Neural Networks

Mean-field Langevin System, Optimal Control and Deep Neural Networks

2 Ent (ν), (1.5) where W is the Brownian motion, P is the space of probability measures and the regularizer Ent is the relative entropy with respect to the Lebesgue measure, see e.g. [17]. Moreover, the marginal law of the process (1.4) converges to its invariant measure. As analyzed in the recent paper [16], this result is basically due to the fact that the function ν 7→ R F (a)ν(da) is convex (indeed linear). In the present paper we wish to apply a similar regularization to the optimal control problem. In order to do that we first recall the relaxed formulation of the control problem (1.1). Instead of controlling the process α, we will control the flow of laws (ν t ) t∈[0,T ] . Then the
Show more

25 Read more

Deep neural networks for choice analysis

Deep neural networks for choice analysis

travel behavior through the change of tolls or subsidies [118, 53]. VOT, as one impor- tant instance of MRS, can be used to measure the monetary gain of saved time after the improvement of a transportation system in a benefit-cost analysis [118, 119]. Recently researchers started to use machine learning models to analyze individ- ual decisions. Karlaftis and Vlahogianni (2011) [65] summarized 86 studies in six transportation fields in which DNNs were applied. Researchers used DNNs to predict travel mode choice [26], car ownership [101], travel accidents [149], travelers’ decision rules [132], driving behaviors [60], trip distribution [89], and traffic flows [104, 82, 142]. DNNs are also used to complement the smartphone-based survey [143], improve sur- vey efficiency [115], and impute survey data [35]. In the studies that focus on predic- tion accuracy, researchers often compare many classifiers, including DNNs, support vector machines (SVM), decision trees (DT), random forests (RF), and DCMs, typi- cally yielding the finding that DNNs and RF perform better than the classical DCMs [106, 98, 113, 48, 26]. In other fields, researchers also found the superior performance of DNNs in prediction compared to all the other machine learning (ML) classifiers [38, 72]. Besides high prediction power, DNNs are powerful due to its versatility, as they are able to accommodate various information formats such as images, videos, and text [76, 73, 61].
Show more

128 Read more

New Paradigm in Speech Recognition: Deep Neural Networks

New Paradigm in Speech Recognition: Deep Neural Networks

Average 41908 22.4 17.1 Table 1. Word Error Rate (%) for the 11 shows obtained using the GMM-HMM and DNN-HMM KATS systems. From 2012, deep learning has shown excellent results in many domains: image recognition, speech recognition, language modelling, parsing, information retrieval, speech synthesis, translation, autonomous cars, gaming, etc. In this article, we presented deep neural networks for speech recognition: different architectures and training procedures for acoustic and language models are visited. Using our speech recognition system, we compared GMM and DNN acoustic models. In the framework of broadcast news transcription, we shown that the DNN-HMM acoustic model decreases the word error rate dramatically compared to classical GMM- HMM acoustic model (24% relative significant improvement).
Show more

8 Read more

2020 — Modeling information flow through deep convolutional neural networks

2020 — Modeling information flow through deep convolutional neural networks

This thesis focuses on computational strategies aimed at reducing CNN computational complexity and/or increasing accuracy of the specific task, i.e. image classification. Our first goal is to increase the accuracy of specific task using efficient transfer learning with the new method for feature selection based on information theory. A number of different strategies in this goal, where the primary methodology focused on analyzing the deep CNN in terms of probability and information theory based on Chaddad et al. (2017, 2019), as described in 2.1. The general methodology is to model the neural network as a probabilistic Bayes network, where information at any point in the network is modeled as distribution conditional upon inputs and filtering operations. Conditional entropy is then used identify class-informative features throughout the network for the task of image classification. Specifically output of a network layer is considered as a random variable 𝑌, defined by a distribution 𝑝(𝑌 |𝐶, 𝐹) conditioned with the object class 𝐶 and filter 𝐹. The conditional entropy 𝐻(𝑌 |𝐶, 𝐹) introduced as 𝐶𝐸𝑁𝑇 is very comprehensive codes used to achieve higher classification accuracy.
Show more

180 Read more

Applications of complex numbers to deep neural networks

Applications of complex numbers to deep neural networks

need to drop some phase information. Results are reported in Table 2.1. More discussion about phase information encoding is presented in section 2.7.7. 2.4.2. Automatic Music Transcription In this section we present results for the automatic music transcription (AMT) task. The nature of an audio signal allows one to exploit complex operations as presented earlier in the paper. The experiments were performed on the MusicNet dataset [51]. For computational efficiency we resampled the original input from the original 44.1kHz to the reduced 11kHz using the algorithm described in [52]. This sampling rate is sufficient to recognize frequencies present in the dataset while reducing the computational cost dramatically. We modeled each of the 84 notes that are present in the dataset with independent sigmoids (due to the fact that notes can fire simultaneously). We initialized the bias of the last layer to the value of −5 to reflect the distribution of silent/non-silent notes. As in the baseline, we performed experiments on the raw signal and the frequency spectrum. For complex experiments with the raw signal, we considered its imaginary part equal to zero. When using the spectrum input we used its complex representation (instead of only the magnitudes, as usual for AMT) for both real and complex models. For the real model, we considered the real and imaginary components of the spectrum as separate channels.
Show more

57 Read more

Stabilizing and Enhancing Learning for Deep Complex and Real Neural Networks

Stabilizing and Enhancing Learning for Deep Complex and Real Neural Networks

For the audio-only speech separation category the use of deep learning techniques has also gained growing interest in recent years. Our work falls into this category of methods. Huang et al. (2014a) were the earliest to use a deep learning approach to modeling monaural speech separation. They combine of a feed-forward and a recurrent network that are jointly optimized with a soft masking function. Concomitantly, in a closely-related work, Du et al. (2014) proposed a neural network to estimate the log power spectrum of the target speakers. Hershey et al. (2015) proposed a deep clustering approach to speech separation. The basic idea is to learn high-dimensional embedding of the mixture signals ; then standard clustering techniques use the obtained embedding to separate the speech targets. The deep attractor network suggested by Chen et al. (2016) is an extension of the deep clustering approach. The network also creates the so-called “attractors” to better cluster time-frequency points dominated by different speakers. The aforementioned approaches estimate only the magnitude of the STFTs and reconstruct the time-domain signal. Similarly to our work, other papers have recently proposed to integrate the phase-information within a speech separation system. The work by Erdogan et al. (2015), for instance, proposes to train a deep neural network with a phase-sensitive loss. Another noteworthy attempt has been described in Wang et al. (2018), where the neural network still estimates the magnitude of the spectrum, but the time-domain speech signals are retrieved directly. Further, another trend, instead of explicitly integrating phase-information, performs speech separation in the time domain directly, as described in Venkataramani and Smaragdis (2018). Likewise, the TasNet architectures proposed in Luo and Mesgarani (2017) and Luo and Mesgarani (2018) accomplish speech separation using the mixed time signal as input. TasNet directly models the mixture waveform using an encoder- decoder framework, and performs the separation on the output of the encoder.
Show more

146 Read more

Auto-Encoders, Distributed Training and Information Representation in Deep Neural Networks

Auto-Encoders, Distributed Training and Information Representation in Deep Neural Networks

5 Generative Stochastic Networks We introduce a novel training principle for generative probabilistic models that is an alternative to maximum likelihood. The proposed Generative Stochastic Net- works (GSN) framework generalizes Denoising Auto-Encoders (DAE) and is based on learning the transition operator of a Markov chain whose stationary distribu- tion estimates the data distribution. The transition distribution is a conditional distribution that generally involves a small move, so it has fewer dominant modes and is unimodal in the limit of small moves. This simplifies the learning prob- lem, making it less like density estimation and more akin to supervised function approximation, with gradients that can be obtained by backprop. The theorems provided here provide a probabilistic interpretation for denoising autoencoders and generalize them; seen in the context of this framework, auto-encoders that learn with injected noise are a special case of GSNs and can be interpreted as generative models. The theorems also provide an interesting justification for dependency net- works and generalized pseudolikelihood and define an appropriate joint distribution and sampling mechanism, even when the conditionals are not consistent. GSNs can be used with missing inputs and can be used to sample subsets of variables given the rest. Experiments validating these theoretical results are conducted on both synthetic datasets and image datasets. The experiments employ a particular archi- tecture that mimics the Deep Boltzmann Machine Gibbs sampler but that allows training to proceed with backprop through a recurrent neural network with noise injected inside and without the need for layerwise pretraining.
Show more

191 Read more

On the Expressive Power of Deep Fully Circulant Neural Networks

On the Expressive Power of Deep Fully Circulant Neural Networks

This paper deals with a class of compact neural networks: deep networks in which all weight matrices are either diag- onal or circulant matrices. Up to our knowledge, training such networks with a large number of layers had not been done before.We also endowed this kind of models with the- oretical guarantees, hence enriching and refining previous theoretical work from the literature. More importantly, we showed that deep circulant networks outperform their com- peting structured alternatives, including the very recently general approach based on low-displacement rank matrices. Our results suggest that stacking circulant layers with non linearities improves the convergence rate and the final ac- curacy of the network. Formally proving these statements constitutes the future directions of this work.
Show more

14 Read more

Multichannel Music Separation with Deep Neural Networks

Multichannel Music Separation with Deep Neural Networks

† Universit´e de Lorraine, LORIA, UMR 7503, Vandœuvre-l`es-Nancy, F-54506, France ‡ CNRS, LORIA, UMR 7503, Vandœuvre-l`es-Nancy, F-54506, France E-mails: {aditya.nugraha, antoine.liutkus, emmanuel.vincent}@inria.fr Abstract—This article addresses the problem of multichannel music separation. We propose a framework where the source spectra are estimated using deep neural networks and combined with spatial covariance matrices to encode the source spatial characteristics. The parameters are estimated in an iterative expectation-maximization fashion and used to derive a multichan- nel Wiener filter. We evaluate the proposed framework for the task of music separation on a large dataset. Experimental results show that the method we describe performs consistently well in separating singing voice and other instruments from realistic musical mixtures.
Show more

6 Read more

Probabilistic Robustness Estimates for Deep Neural Networks

Probabilistic Robustness Estimates for Deep Neural Networks

L l=1 kW l k which is an upper bound of the Lipschitz constant of the network has also been proposed as regularization that pro- motes robustness. It accounts for the overall Lipschitz reg- ularity of the network and acts also as an overall control on the contraction power of the network by coupling lay- ers and allowing some weights to grow for some layers as long as in other layers others weights are getting smaller to compensate. When Q W = WW > , its Frobenius norm and the Lipschitz constant gradient can be explicitely de- rived and integrated into the backpropagation scheme and chain rule of gradients in order to optimize the augmented loss during the training phase. However, for the spectral norm, approximation methods are necessary and gradient will have to be computed using numerical differentiation techniques. In Appendix B, we discuss several available approximation methods and in the next section, we propose to carry out experiments with these various regularization strategies and evaluate their respective impact on the ro- bustness properties of the network.
Show more

10 Read more

Deep neural networks for audio scene recognition

Deep neural networks for audio scene recognition

4.1.1. Influence of the size of the DNN In this paragraph, we consider a fixed number of 15 input frames (so the size of the input layer of our DNN is 15× 30 = 450) and we compare the performances in function of the number of hidden layers and the number of neurons for each hidden layer. For simplicity, we have considered the same number of neurons for each hidden layer. In figure 2 we have displayed the recognition rate in function of these parameters. Overall, DNN perform better when the number of hidden lay- ers and neurons is greater but it can be seen that when the number of neurons is weak, increasing the number of hidden layers does not necessarily improve the performances. The worse recognition rate (80%) is obtained for 2 hidden layers of 50 neurons and the best performances (91.6%) are obtained with 5 hidden layers of 500 neurons. For reasons of space, we have not reproduced the confusion matrix associated to this configuration. Roughly speaking, all classes have a recog- nition rate greater than 80%, except the quiet street and the pedestrian street which have a recognition rate of 66.66% and 75%, respectively. Theses classes are mainly confused with the shop and market classes. For larger DNN, we have not ob- served a major improvement. Indeed, for a DNN with 7 hid- den layers and 1000 neurons, the recognition rate is 92.2%.
Show more

6 Read more

Mutual information, Fisher information and population coding

Mutual information, Fisher information and population coding

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]

30 Read more

In-Memory and Error-Immune Differential RRAM Implementation of Binarized Deep Neural Networks

In-Memory and Error-Immune Differential RRAM Implementation of Binarized Deep Neural Networks

I. I NTRODUCTION Deep neural networks are currently the most widely investigated architecture in Artificial Intelligence (AI) systems, with incredible achievements in image recognition, automatic translation, Go or Poker games. Unfortunately, when operated on central or graphics processing units (CPUs or GPUs), they consume considerable energy, in particular due to the intensive data exchanges between processors and memory [1,2]. Neural networks using in-memory computing (iMC) with RRAM are widely proposed as a solution to the Von-Neumann bottleneck [1]. However, RRAMs are prone to variability [3], and using Error Correcting Codes (ECC) as in more standard memories would ruin the benefits of iMC. ECCs indeed require large decoding circuits [4], which would need to be replicated multiple times in the case of iMC. This last point is the key challenge that we have to face for reliable neural networks on large RRAM memory arrays. In this paper, an experimental RRAM array with differential memory bit-cell (2T2R) based on HfO 2 devices, including all peripheral and a differential sensing
Show more

5 Read more

Unsupervised Layer-Wise Model Selection in Deep Neural Networks

Unsupervised Layer-Wise Model Selection in Deep Neural Networks

and Sebag Mich`ele 3 Abstract. Deep Neural Networks (DNN) propose a new and ef- ficient ML architecture based on the layer-wise building of several representation layers. A critical issue for DNNs remains model se- lection, e.g. selecting the number of neurons in each DNN layer. The hyper-parameter search space exponentially increases with the number of layers, making the popular grid search-based approach used for finding good hyper-parameter values intractable. The ques- tion investigated in this paper is whether the unsupervised, layer-wise methodology used to train a DNN can be extended to model selec- tion as well. The proposed approach, considering an unsupervised criterion, empirically examines whether model selection is a modu- lar optimization problem, and can be tackled in a layer-wise manner. Preliminary results on the MNIST data set suggest the answer is pos- itive. Further, some unexpected results regarding the optimal size of layers depending on the training process, are reported and discussed.
Show more

7 Read more

Clinical event prediction and understanding with deep neural networks

Clinical event prediction and understanding with deep neural networks

In addition, we compare these representations along with both long short-term memory networks (LSTM) and convolutional neural networks (CNN) for prediction of five i[r]

56 Read more

Show all 10000 documents...