small datasets

Top PDF small datasets:

Mixture of markov trees for bayesian network structure learning with small datasets in high dimensional space

Mixture of markov trees for bayesian network structure learning with small datasets in high dimensional space

where P a S i (X p ) is the parent variable of X p in the tree structure S i . Several versions of Markov tree mixture learning algorithm described in Algo- rithm 2 were proposed in [2, 4] as an alternative to classical methods for density estimation in the context of high-dimensional space and small datasets : mix- tures of tree structures generated in a totally randomized fashion and ensembles of optimal trees derived from bootstrap replicas of the dataset by the Chow and Liu algorithm [9] (i.e. bagging of Markov trees). In [4, 3], we also proposed three sub-quadratic heuristics to approximate the optimal tree and then to con- struct mixture of trees in a sub-quadratic way. Our best heuristic (Inertial search heuristic) complexity is n log(n) log(n log(n)). These works have fruitful results for density estimation in terms of scalability and efficiency. But result of these methods, described by a mixture of several models, cannot directly identify a single model that can be graphically visualized and interpreted.
En savoir plus

11 En savoir plus

Classification of Small Datasets: Why Using Class-Based Weighting Measures?

Classification of Small Datasets: Why Using Class-Based Weighting Measures?

Abstract. In text classification, providing an efficient classifier even if the num- ber of documents involved in the learning step is small remains an important is- sue. In this paper we evaluate the performance of traditional classification meth- ods to better evaluate their limitation in the learning phase when dealing with small amount of documents. We thus propose a new way for weighting features which are used for classifying. These features have been integrated in two well known classifiers: Class-Feature-Centroid and Na¨ıve Bayes, and evaluations have been performed on two real datasets. We have also investigated the influence on parameters such as number of classes, documents or words in the classification. Experiments have shown the efficiency of our proposal relatively to state of the art classification methods. Either with a very few amount of data or with a small number of features that can be extracted from poor content documents, we show that our approach performs well.
En savoir plus

11 En savoir plus

L6DNet: Light 6 DoF Network for Robust and Precise Object Pose Estimation with Small Datasets

L6DNet: Light 6 DoF Network for Robust and Precise Object Pose Estimation with Small Datasets

accuracy to be growing with the number of patches extracted. This balance allows our method to be suitable to a wide range of methods. The flexibility it brings lets the user tune the extraction stride to better meet the application needs. To decrease inference time we retrained a light 2D detection algorithm (namely tiny YOLOv3 [34] with a darknet backbone) on the driller. As the training set is very small the estimated bounding box is coarse but sufficiently precise to reduce inference time which is reported in table IV. As we can see, using a stride of 12 to 16 we can reach real time inference while losing little accuracy. To speed up the voting step we chose to use only 3 clusters that are selected to minimize their respective variance. We also report inference times and accuracy for different strides in Fig. 7.
En savoir plus

9 En savoir plus

Sound event detection in remote health care - Small learning datasets and over constrained Gaussian Mixture Models

Sound event detection in remote health care - Small learning datasets and over constrained Gaussian Mixture Models

the problem of how to obtain useful GMM based PDF approximations, even when datasets are too small. Our approach is greatly simplified if we define model regularization in a wide point of view, from which Parzen models with Gaussian kernels are regarded as over- regularized GMM, as explained in Section II. Signal segmentation is explained in Section III whereas, in Section IV, we gather experimental evidences that the tradeoff between model degree of freedom and amount of data for model adaptation may be a key for useful probabilistic classifier, even with very small datasets. Finalli, in Section V, we briefly analyze our claims as a contribution to improve remote healthcare applications.
En savoir plus

6 En savoir plus

Biologically-plausible learning algorithms can scale to large datasets

Biologically-plausible learning algorithms can scale to large datasets

1 Department of Molecular and Cellular Biology, Harvard Univeristy 2 Center for Brains, Minds, and Machines, MIT Abstract The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feed- back pathways. To address this “weight transport problem” (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP’s weight sym- metry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of target-propagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry algorithm (Liao et al., 2016), which dif- fers from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with sign-symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.
En savoir plus

10 En savoir plus

Biologically-Plausible Learning Algorithms Can Scale to Large Datasets

Biologically-Plausible Learning Algorithms Can Scale to Large Datasets

1 Harvard Medical School 2 Center for Brains, Minds, and Machines, MIT Abstract The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feed- back pathways. To address this “weight transport problem” (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP’s weight sym- metry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of target-propagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry algorithm (Liao et al., 2016), which dif- fers from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with sign-symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.
En savoir plus

9 En savoir plus

COMMET: comparing and combining multiple metagenomic datasets

COMMET: comparing and combining multiple metagenomic datasets

Methods were proposed to compare metagenomes without using any a priori knowledge. These de novo methods use global features like GC content [12], genome size [13] or sequence signatures [14]. Theses methods face limitations as they are based on rough imprecise criteria and as they only compute a similarity distance: they do not extract similar elements between samples. We believe that it is possible to go further by comparing metagenomic samples at the read sequence level. This provides a higher precision distance and, importantly, it provides reads that are similar between datasets or that are specific to a unique dataset, enabling their latter analysis: assembly with better coverage or comparison with other metagenomic samples. Such comparisons may be performed using Blast [15] or Blat [16] like tools. Unfortu- nately, these methods do not scale up on large comparative metagenomic studies in which hundreds of millions of reads have to be compared to other hundreds of millions of reads. For instance, one can estimate that comparing a hundred of metagenomes each composed by a hundred of millions of reads of size 100 would require centuries of CPU computation. The crAss approach [17] constructs a reference metagenome by cross assembling reads of all samples. Then, it maps the initial reads on the so obtained contigs and several measures are derived, based on the repartition of mapped reads. This method provides results of high quality. However, due to its assembly and mapping approach, it does not scale up to large metagenomic datasets. Simpler methods such as Triage- Tools [18] or Compareads [19] measure the sequence similarity of a read with a databank by counting the number of k- mers (words of length k) shared with the databank. Due to memory consumption, TriageTools cannot use k values larger than 15 and is thus limited to small datasets (a few hundred of thousands reads of length 100). The Compareads tool scales up to large datasets with a small memory footprint and acceptable running time. However, applied on large metagenomic projects, this tools generates an important number of large intermediate result files. In practice, applying Compareads to N datasets generates N 2
En savoir plus

6 En savoir plus

Latency-Based Anycast Geolocation: Algorithms, Software, and Datasets

Latency-Based Anycast Geolocation: Algorithms, Software, and Datasets

(a) (b) Fig. 1. Synoptic of (a) anycast measurement scenario and (b) anycast instance detection via latency measurements on an anycast host can be assumed, soliciting response on specific transport layer ports (e.g., UDP 53 for DNS or TCP 80 for CDN) would likely only obtain service-specific response (i.e., conditioned to the availability of that anycast service on the target under test). Conversely, ICMP based latency measurement are not affected by this per-service bias. We further discuss quality of latency samples in Sec. IV-B and the impact that latency noise has on iGreedy in Sec. VII-C. Finally, in terms of (v) overhead, we remark that the amount of probe traffic in our datasets is much lower to what considered in recent studies employing from 20k vantage points [11], to soliciting responses from about to 300k recursive DNS resolvers plus 60k Netalyzr datapoints [6]. In our framework, algorithms employs as few as 1/100 of the Netalyzr (or 1/1000 of the recursive DNS) data points: while challenging, our results show that fairly complete enumeration and correct geolocation are achievable even with few latency samples.
En savoir plus

16 En savoir plus

ICA-based sparse feature recovery from fMRI datasets

ICA-based sparse feature recovery from fMRI datasets

ABSTRACT Spatial Independent Components Analysis (ICA) is increas- ingly used in the context of functional Magnetic Resonance Imaging (fMRI) to study cognition and brain pathologies. Salient features present in some of the extracted Independent Components (ICs) can be interpreted as brain networks, but the segmentation of the corresponding regions from ICs is still ill-controlled. Here we propose a new ICA-based pro- cedure for extraction of sparse features from fMRI datasets. Specifically, we introduce a new thresholding procedure that controls the deviation from isotropy in the ICA mixing model. Unlike current heuristics, our procedure guarantees an exact, possibly conservative, level of specificity in feature detec- tion. We evaluate the sensitivity and specificity of the method on synthetic and fMRI data and show that it outperforms state-of-the-art approaches.
En savoir plus

5 En savoir plus

Comparing functional connectivity based predictive models across datasets

Comparing functional connectivity based predictive models across datasets

Classifier with ` 2 penalization. III. E XPERIMENTS ON MULTIPLE DATASETS A. Datasets We experiment our classification pipeline on three rs- fMRI datasets. Models built from connectivity features pre- dict various clinical outcomes (neuro-degenerative and neuro- psychiatric disorders, drug abuse impact). The first dataset is from the Center for Biomedical Research Excellence (CO- BRE) [ 18 ]. The pipeline predicts the schizophrenia diagnosis of the subjects. The second dataset is the Alzheimer’s Dis- ease Neuroimaging Initiative (ADNI) [ 19 ]. We discriminate Alzheimer’s Disease (AD) from Mild Cognitive Impairment (MCI) group. The third dataset is the Addiction Connectome Preprocessed Initiative (ACPI) 1 , where we discriminate Mar-
En savoir plus

5 En savoir plus

Testing the Attraction Effect on Two Information Visualization Datasets

Testing the Attraction Effect on Two Information Visualization Datasets

More extensive information on the attraction effect, the related literature, the motivations for this work, and the rationale for the methods employed can be found in the main article (Dimara et al, 2016). 2 Design Rationale In the first experiment “Gyms” reported in (Dimara et al, 2016), we found evidence for an attraction effect in scatterplots by replicating a standard experimental protocol. However, the datasets are limited to 2 or 3 alternatives, which is not realistic for a dataset people may want to visualize. The main reason for this limitation of previous work is that in numerical table representations it is hard to perform rapid attribute- to-attribute comparisons, and thus recognition of dominance relationships, between many alternatives.
En savoir plus

18 En savoir plus

Uniqueness Assessment of Human Mobility on Multi-Sensor Datasets

Uniqueness Assessment of Human Mobility on Multi-Sensor Datasets

INSA-Lyon, UMR5205, F-69621, France {antoine.boutet,sonia.benmokhtar,vincent.primault}@liris.cnrs.fr Abstract The widespread adoption of handheld devices (e.g., smartphones, tablets) makes mobility traces of users broadly available to third party services. These traces are collected by means of various sensors embedded in the users’ devices, including GPS, WiFi and GSM. We study in this paper the mobility of 300 users over a period up to 31 months from the perspective of the above three types of data and with a focus on two cities, i.e., Lausanne (Switzerland) and Lyon (France). We found that users’ mobility traces, no matter if they are collected using GPS, WiFi or GSM antennas, are highly unique. We show that on average only four spatio-temporal points from the WiFi, GSM and GPS traces are enough to uniquely iden- tify 94% of the individuals, on both datasets. In addi- tion, we show that using the temporal dimension (i.e., whether users move or are in a meaningful location such as their home or their working place) drastically improves the capacity to uniquely identify them compared to when only exploiting the spatial dimension (by 14% on aver- age). In some cases, using the temporal dimension alone can represent a better mobility footprint than the spatial dimension to discriminate users. We further conduct a de-anonymisation attack to assess how mobility traces can be re-identified, and show that almost all users can be de-anonymised with a high success rate. Finally, we apply different location privacy protection mechanisms (LPPMs), including spatial filtering, temporal cloaking, adding spatial noise to mobility data, or using generali- sation, and analyse the impact of these mechanisms on
En savoir plus

19 En savoir plus

Parallel Euclidean distance matrix computation on big datasets

Parallel Euclidean distance matrix computation on big datasets

D consists of calculating each d ij inside of two nested loop, using BLAS like routines or language intrinsic routines (i.e. dot product in Fortran). En practice, this standard algorithm has a low performance even with small size problems. Unlike other distance matrices, Euclidean distance matrix can be computed using a matrix operations based method to achieve the maximum performance. If we set

16 En savoir plus

Large scale statistical analysis of GEO datasets

Large scale statistical analysis of GEO datasets

organization. Their excellent correlation (0.94) is not a surprise. Three datasets, BEC, DLB, MMD have relatively good correlations with those of the above four groups (around 0.5), but no particular links with those groups, nor between themselves. The relative surprise comes from the weak correlations of PRS, and the negative correlations of PVA. Both come from blood RNA samples, and they could have been expected to be close to the WBS, PLE, HPS, HAV group. That PRS and PVA are far from any other matrix, can be explained by their inner heterogeneity. It is illustrated for PVA on Figure 2, where the values over features ALPP and CA4 are represented: samples separate into 4 clusters, according to over- or underexpression of the two genes. As an example, if PVA is split into samples for which the value of ALPP is positive (overexpression), or negative (underexpression), and the row medians are calculated over the two submatrices as before, a correlation of −0.69 is found: thus one half of PVA has a strong negative correlation with the other half. Similar results are obtained for many other features. We considered that the heterogeneity of PVA and PRS did not qualify them for merging.
En savoir plus

16 En savoir plus

Benchmarking MRI Reconstruction Neural Networks on Large Public Datasets

Benchmarking MRI Reconstruction Neural Networks on Large Public Datasets

3 AIM, CEA, CNRS, Universit´ e Paris-Saclay, Universit´ e Paris Diderot, Sorbonne Paris Cit´ e Abstract Deep learning is starting to offer promising results for reconstruction in MRI. A lot of networks are being developed, but the comparisons remain hard because the frameworks used are not the same among studies, the networks are not properly re-trained and the datasets used are not the same among comparisons. The recent release of a public dataset, fastMRI, consisting of raw k-space data, encouraged us to write a consistent benchmark of several deep neural networks for MR image reconstruction. This paper shows the results obtained for this benchmark allowing to compare the networks and links the open source imple- mentation of all these networks in Keras. The main finding of this benchmark is that it is beneficial to perform more iterations between the image and the measurement spaces compared to having a deeper per-space network.
En savoir plus

19 En savoir plus

Towards Algorithmic Analytics for Large-scale Datasets

Towards Algorithmic Analytics for Large-scale Datasets

Bayesian models are susceptible to the consequences of the curse of dimensionality 29 . These side-effects of rich multi-variable phenotyping need to be tackled in an increasing number of modern neuroscience studies. Linear but flexible pattern-learning models have repeatedly yielded useful dimensionality reductions of high-dimensional subject descriptions and integration of different modalities of detailed measurements. In this spirit, CCA was used by Smith and colleagues to uncover population co-variation that links coupling measures of various brain networks and extensive phenotyping by a diversity of behavioral indicators 30 . Standard CCA can be viewed as reminiscent of classical statistics because this model is fitted based on MLE without deliberately imposing prior knowledge or bias that would guide parameter estimation. However, the same CCA method can be viewed to represent a proto-typical approach suited for modern datasets because of in-built
En savoir plus

32 En savoir plus

Evaluation of 23 gridded precipitation datasets across West Africa

Evaluation of 23 gridded precipitation datasets across West Africa

Version postprint Reanalysis P-datasets, ERA-Interim, MERRA-2, JRA-55 and WFDEI, performed better during the dry than during the wet season (Fig. 3). This agrees with previous results obtained over the CONUS (Beck et al., 2019). The authors explained that reanalysis P-datasets are better adapted to detecting large- scale stratiform systems, which are typical in the dry season, than unpredictable small-scale convective cells, which are typical in the wet season. On the contrary, only satellite-based P-datasets performed better during the wet than the dry season (Beck et al., 2019; Salles et al., 2019; Satgé et al., 2017a). Actually, the irregular sampling of the low earth orbiting satellites and the limited number of overpasses hardly captures short precipitation events which are typical during the dry season (Gebregiorgis and Hossain, 2013; Tian et al., 2009). Therefore, GSMaP-RT v.6 presented a better KGE value during the wet than that during the dry season (Fig. 3). The seasonality sensitivity of the other P- datasets incorporating satellite, reanalysis, or gauge-based information shows a greater contrast because they consider the different inputs.
En savoir plus

50 En savoir plus

Diffeomorphic Iterative Centroid Methods for Template Estimation on Large Datasets

Diffeomorphic Iterative Centroid Methods for Template Estimation on Large Datasets

The advantage of iterative methods, like IC1 and IC2, is that we can stop the deformation at any step, resulting in a centroid built with part of the population. Thus, for large databases (composed for instance of 1000 subjects), it may not be necessary to include all subjects in the computation since the weight of these subjects will be very small. The iterative nature of IC1 and IC2 provides another interesting advantage which is the possible online refinement of the centroid estimation as new subjects are entered in the dataset. This leads to an increased possibility of interaction with the image analysis process. On the other hand, the recursive PW method has the advantage that it can be parallelized (still using GPU implementation), although we did not implement this specific feature in the present work.
En savoir plus

29 En savoir plus

Stability of ICA decomposition across within-subject EEG datasets.

Stability of ICA decomposition across within-subject EEG datasets.

For each subject, several clusters included ICs from all sessions showing that, even when recording session occurred on different days, ICA was able to identify recurrin[r]

6 En savoir plus

Modelling aggregation on the large scale and regularity on the small scale in spatial point pattern datasets

Modelling aggregation on the large scale and regularity on the small scale in spatial point pattern datasets

cesses have been derived by Palm theory, and their numerical evaluation can be obtained by approximations, cf. Andersen and Hahn (2015), while the spatial birth-death constructions are mathematical intractable. Another possibility is to consider a Gibbs point process with a well-chosen potential that incorporates inhibition at small scales and attraction at large scales. A famous example is the Lennard-Jones pair-potential (Ruelle, 1969), and other spe- cific potentials of this type can be found in Goldstein et al. (2015). Unfortunately, in general for Gibbs point processes the intensity and the pair correlation function are unknown and simulation requires elaborate MCMC methods (Møller and Waagepetersen, 2004).
En savoir plus

24 En savoir plus

Show all 1553 documents...