• Aucun résultat trouvé

Supplemental Methods File. Generation of a random number generator and a random sampling algorithm

N/A
N/A
Protected

Academic year: 2022

Partager "Supplemental Methods File. Generation of a random number generator and a random sampling algorithm"

Copied!
7
0
0

Texte intégral

(1)

Supplemental Methods File.

Generation of a random number generator and a random sampling algorithm

Uniform distribution of 1,000,000 random numbers is shown for the whole range in a histogram (Supplemental Figure 1) and dot-chart (Supplemental Figure 2). In addition, the algorithm does not show any gaps in its range. To show this we selected 10,000,000 random numbers and plotted their distribution in the range 0-300, using the bin window of 1. The plot shows that the algorithm does not make any gaps and all the numbers in the range 0-300 could be selected (Supplemental Figure 3).

The source code of the random sampling algorithm was used in this study to generate random numbers that were used in turn to select the lines from the input dataset. The user needs to specify at the entry how many lines are to be chosen by the program. We also introduced a limit that a single line of the input file could be selected only once. After compiling in Microsoft Visual C++, the user should type in the command line: the name of the exe file, followed by first the desired name of the output file and then the name of the input file, e.g. :

name_of_the_program.exe output_file.txt input_file.txt

After calling the exe file in such way, the user will be asked how many entry lines to select from the input file in a random manner.

As a final step, we wished to assess the robustness of comparisons of datasets of different sizes. Datasets of widely different sizes may be heteroscedastic, that is of unequal variance, which was linked to an overestimation of the difference between the datasets in some statistical analysis [1, 2]. However, random sub-sampling of the larger dataset may reduce the power of the test, as the size of the tested dataset becomes smaller. This latter

(2)

possibility was assessed using a simulation experiment, where two sets of 105 normally distributed numbers were picked randomly to have either identical means of 10, or different means of 10 and 11, using the R software (R Development Core Team, 2006. R:

A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. IURL http://www.R-project.org, R Software version 2.3.1).

The standard deviations were chosen to be the same and equal to 1 in all cases.

Then, the above-described random sampling algorithm was used to sub-sample these groups as performed in genomic analysis, to define subsets of either 104, 103, 102 or 101 values. A standard two-tailed t-test was used to compare selected sub-samples and the means and standard deviations of p-values were recorded. In either case of equal or non- equal means, the statistical comparison of datasets of identical sizes yielded similar p values as the comparisons of datasets of different sizes, when the dataset sizes were greater than or equal to 102(Supplemental Figure 4). However, the comparisons of sets of different means yielded erroneous conclusions (p-values >0.05) when at least one dataset of 101 values of was used. Thus, random sub-sampling does not decrease the power of the statistical test for datasets of sufficiently large sizes, e.g. of 100 values or more. In addition, random sampling of predicted binding affinities or ChIP-Seq tag occurrence did not change the variance of the subsets in comparison to the overall set of data (data not shown). Thus, we conclude that random sampling of genomic samples in itself will not introduce biases or decrease the robustness of statistically assays, or otherwise cause problems related to heteroscedacity in subsequent parametric statistical analysis.

1. Gamage J, Weerahandi S: Size performance of some tests in one-way anova.

Communications in Statistics - Simulation and Computation 1998, 27(3):625 - 640.

2. Wittkowski KM: Statistical analysis of unbalanced and incomplete designs - experiences with BMDP and SAS.Statistical software newsletter1991.

(3)

Supplemental Figures 1-4.

(4)

Supplemental figure 1. Uniform distribution created by the random number generator

Uniform distribution of random numbers is shown for the whole range (0-999,999) using a histogram and a bin size of 50,000.

(5)

Supplemental figure 2. Uniform distribution created by the random number generator

Uniform distribution of random numbers is shown for the whole range (0-999,999) using a dot-chart.

(6)

Supplemental figure 3. Uniform distribution created by the random number generator shown at the smaller scale

The distribution of 10,000,000 random numbers from 0-999,999 plotted for the range 0-300, using the bin size of 1.

(7)

A

B

Supplemental figure 4. Statistical properties of random sampling method

Two sets of 105 numbers were chosen to have a normal distribution and either identical means of 10, or different means of 10 and 11. The standard deviations were chosen to be the same and equal to 1. We next randomly sub-sampled these datasets to define subsets of either 104, 103, 102 or 101numbers. We performed statistical comparisons of the two datasets using two-tailed t-test either by comparing sampled datasets of the same size or by comparing the sampled sets with the dataset of original size. Each graph shows a distribution of the obtained p-values.A. Comparison of samples of equal means.B. Comparison of samples of non-equal means.

105vs.104 105vs.103 105vs.102 105vs.101

p value

Randomly selected datasets of different sizes

Diferent size random datasets

-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

10^5vs10^4 10^5vs10^3 10^5vs10^2 10^5vs10^1

p-value

105vs.104 105vs.103 105vs.102 105vs.101

p value

Randomly selected datasets of different sizes Same size random datasets

-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

10^4vs10^4 10^3vs10^3 10^2vs10^2 10^1vs10^1

p-value

104vs.104 103vs.103 102vs.102 101vs.101

p value

Randomly selected datasets of same size

p value

Randomly selected datasets of same size

104vs.104 103vs.103 102vs.102 101vs.101 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Références

Documents relatifs

Moreover, to justify that the successive digits of π (or any other given real number) in a given base b can be taken as random, it would be good to know that this sequence of digits

Especially, that will not be necessary because we have to avoid this problem by using sequences which are really samples of random variables and by study- ing the properties that

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

The EER value computed from DataSU is used to qualify the capacity of synthetic Keystroke dynam- ics data to be indistinguishable from real Keystroke dynamics data.. Thus, an EER of

Consequently, we show that the distribution of the number of 3-cuts in a random rooted 3-connected planar triangula- tion with n + 3 vertices is asymptotically normal with mean

Keywords: random matrices, norm of random matrices, approximation of co- variance matrices, compressed sensing, restricted isometry property, log-concave random vectors,

We use the finite Markov chain embedding technique to obtain the distribution of the number of cycles in the breakpoint graph of a random uniform signed permutation.. This further

We propose in this paper an original approach for the generation of synthetic keystroke data given samples from known users as a first step towards the generation of