Estimating and accounting for genotyping errors in RAD-seq experiments

(1)

Estimating and accounting for genotyping errors in RAD-seq

experiments

Supporting Information

Luisa Bresadola

1,*

, Vivian Link

1,2,*

, C. Alex Buerkle

3

, Christian Lexer

3

, and Daniel Wegmann

1,2

1

_{Department of Biology, University of Fribourg, Fribourg, Switzerland}

2

_{Swiss Institute of Bioinformatics, Fribourg, Switzerland}

3

_{Department of Botany, University of Wyoming, Laramie, WY, USA}

4

_{Department of Botany and Biodiversity Research, University of Vienna, Vienna, Austria}

*

_{These authors contributed equally}

Inferring Genotyping Error Rates

Error rate Model

Here we describe algorithms to infer per-allele genotyping error rates from hard genotype calls gil obtained for

individuals i = 1, . . . , I at loci l = 1, . . . , L. Let gil= 0, 1, 2 reflects the number of reference alleles and γil= 0, 1, 2

the corresponding true genotypes. Further, g = {g11, . . . , g1L, . . . , gIL} and γ = {γ11, . . . , γ1L, . . . , γIL}.

We assume a simple error model with the per allele genotyping error rates = {0, 1} such that

P(g|γ, 0, 1) =                      (1 − 0)2 if g = γ; γ = 0, 2 2 1+ (1 − 1)2 if g = γ = 1 2 0 if g = 0; γ = 2 or g = 2; γ = 0 20(1 − 0) if g = 1; γ = 0, 2 (1 − 1) if g = 0, 2; γ = 1 . (1)

Often, not all genotype calls are affected by the same genotyping error rates. For instance, genotyping error rates might vary between calls obtained with different sequencing depth or between samples of different batches such as from different libraries or sequencing experiments. We account for this by stratifying the error rates by different sets s = 1, . . . , S of genotypes and denote the corresponding error rates by 0(s) and 1(s), respectively. We further

denote by sil the set to which genotype call gil belongs. Finally, 0 = {0(1), . . . , 0(S)}, 1 = {1(1), . . . , 1(S)}

and = {0, 1}.

Inferring Genotyping Error Rates by Comparing to a Truth Set

We first describe an algorithm to infer per-allele genotyping error rates by comparing calls to a truth set, i.e. the case in which all γil are known. Under the assumption that genotyping errors are independent between loci and

individuals, the likelihood of the full data is given by P(g|γ, ) = I Y i=1 L Y l=1 P(gil|γil, ) where P(gil|γil, ) = P(gil|γil, 0(sil), 1(sil)) according to (1).

In order to obtain a maximum likelihood (ML) estimate of , we numerically maximize the likelihood function by finding such that

∂

∂log P(g|γ, ) = 0.

Consider a table with entries ng,γ(s) reflecting the counts of observing genotype g at a site with true genotype

γ for all calls of set s. Given this table, the log-likelihood function can be calculated as log P(n|) = 2 X g=0 2 X γ=0 ng,γ(s) log P(g|γ, ) = X s a0(s) log 0(s) + b0(s) log(1 − 0(s)) + . . .

(3)

where C is a constant independent of and a0(s) = 2n0,2(s) + n1,0(s) + n1,2(s) + 2n2,0(s) a1(s) = n0,1(s) + n2,1(s) b0(s) = 2n0,0(s) + n1,0(s) + n1,2(s) + 2n2,2(s) b1 = n0,1+ n2,1 c0 = n1,1

From this we get for 0(s)

∂ ∂0(s)log P(n|) = a 0(s) − b 1 − 0(s) = 0, and the MLE estimate

ˆ 0(s) = a0(s) a0(s) + b0(s) . For 1(s) we get ∂ ∂1(s)log P(n|) = −2[a 1(s) + b1(s) + 2c(s)]1(s)3+ 2[2a1(s) + b1(s) + 3c(s)]1(s)2− . . . . . . − [3a1(s) + b1(s) + 2c(s)]1(s) + a1(s),

which we set to zero using a numerical search algorithm to obtain the MLE estimate1ˆ(s).

Inferring Genotyping Error Rates from Sample Replicates

We next develop an algorithm to infer genotyping error rates from individuals for which genotypes were estimated from multiple, independent replicate data sets. Consider inferred genotypes gi = {g

(1) i , . . . , g

(ri)

i } for multiple

individuals i = 1, . . . , I, where g(j)_i = {g(j)_i1 , . . . , g_iL(j)} indicates the inferred genotypes of individual i at sites l = 1 . . . , L from the jth replicate. The likelihood of the full data is then given by

P(g|) = I Y i=1 L Y l=1 2 X g=0 P(γil= g) ri Y j=1 P(gil(j)|γil= g, 0(s (j) il ), 1(s (j) il )), (2)

where P(γil = g) = fig is the frequency of genotype g among all sites typed in individual i, s (j)

il the set to which

call g(j)_il _{belongs, and P(g}_il(j)|γil= g, 0(s (j) il ), 1(s

(j)

il )) is given by (1).

In order to find the MLE estimates of and f = {fi, . . . , fI}, fi = {fi0, fi1, fi2} we will employ an EM algorithm.

E-step. The expected complete data likelihood is given by Q(, f ; 0, f0) = _E[`c(, f )|g, 0, f0] = I X i=1 L X l=1 2 X g=0 P γil= g|gil, 00(s (j) il ), 0 1(s (j) il ), f 0  log fig+ ri X j=1 log Pg(j)_il |γil= g, 00(s (j) il ), 0 1(s (j) il )  , where we used g_il= {g(1)_il , . . . , g(ri)

il } and we have, according to Bayes formula,

P γil= g|gil, 00(s (j) il ), 0 1(s (j) il ), f 0 = f_ig0 Qri j=1P g_il(j)|γil= g, 00(s (j) il ), 0 1(s (j) il ) P2 h=0f 0 ih Qri j=1P g(j)_il |γil= h, 00(s (j) il ), 0 1(s (j) il ) . M-step. We have to maximize Q subject to the constraints

(4)

for all individuals i = 1, . . . , I, we form the Lagrangian L(, f , µ) = Q − I X i=0 µi 2 X g=0 fig− 1 ! , (4)

where µ = {µ1, . . . , µI} is the vector of Lagrangian multipliers. All f can be maximized analytically. We get the

following derivatives of the Lagrangian: ∂ ∂µi L = 2 X g=0 fig− 1 = 0 ∂ ∂fig L = 1 fig L X l=1 P γil= g|gil, 00(s (j) il ), 0 1(s (j) il ), f 0 − µi= 0.

From these we get

fig = 1 L L X l=1 P γil= g|gil, 00(s (j) il ), 0 1(s (j) il ), f 0 .

In contrast, we will need to maximize Q with respect to each 0(s) and 1(s) numerically. For this, we need to

maximize Q0(s) = I X i=1 L X l=1 X g∈{0,2} I(s(j)_il _{= s)P(γ}il= g|gil, 0 0(s), f 0₎ ri X j=1 log P(g(j)il |γil= g, 0(s)), Q1(s) = I X i=1 L X l=1 I(s(j)_il _{= s)P(γ}il= 1|gil, 1(s), f0) ri X j=1 log P(gil(j)|γil= 1, 1(s)),

where the sums run only over the genotypes of set s, as given by the indicator function I(s(j)_il = s) =    1 if s(j)_il = s, 0 otherwise.

Inferring Genotyping Error Rates from Population Samples

We finally develop an algorithm to infer genotyping error rates from multiple samples from each of multiple populations. Consider observed genotypes gpil from individual i = 1 . . . , np from population p = 1, . . . , P at locus

l = 1, . . . , L. Let us further denote by gpi = {gpi1, . . . , gpiL} and by gp = {gp1, . . . , gpnp}. The likelihood of the

full data given per allele genotyping error rates = {1, . . . , S} for different sets s = 1, . . . , S of genotype calls is

then given by P(g|) = P Y p=1 np Y i=1 L Y l=1 2 X g=0

P(γpil= g|fpl)P (gpil|γpil= g, 0(spil), 1(spil)) , (5)

where g = {g1, . . . , gP}, fpl is the frequency in population p of one of the two alleles at locus l, and under the

assumption of Hardy-Weinberg equailibrium in each population, P(gpil|fpl) = ₂ gpil fgpil pl (1 − fpl) 2−gpil₌          (1 − fpl)2 if gpil= 0, 2f (1 − fpl) if gpil= 1, f_pl2 if gpil= 2. (6) Finally, spil is the set of call gpil and P

g_il(j)|γil, 0(s), 1(s)

(5)

Maximum Likelihood Inference

In order to find the MLE estimates of and f = {f₁, . . . , f_P}, f_p = {fp1, . . . , fpL} we will employ an EM

algorithm.

E-step. The expected complete data likelihood is given by Q(, f ; 0, f0) = _E[`c(, f )|g, 0, f0] = P X p=1 np X i=1 L X l=1 2 X g=0

P γpil= g|gpil, 00(spil), 01(spil), fplg0

h

log fplg+ log P (gpil|γpil= g, 0(spil), 1(spil))

i , According to Bayes formula,

P γpil= g|gpil, 0(spil), 1(spil), fplg0 =

P (gpil|γpil= g, 0(spil), 1(spil)) fplg0

P2

h=0P (gpil|γpil= h, 0(spil), 1(spil)) f 0 plh

. M-step. We have to maximize Q subject to the constraints

2

X

g=0

fplg= fpl0+ fpl1+ fpl2= 1

for all populations p = 1, . . . , P , we form the Lagrangian L(, f , µ) = Q − P X p=1 µp 2 X g=0 fplg− 1 ! ,

where µ = {µ1, . . . , µP} is the vector of Lagrangian multipliers. All f can be maximized analytically. We get the

following derivatives of the Lagrangian: ∂ ∂µp L = 2 X g=0 fplg− 1 = 0 ∂ ∂fplg L = 1 fplg np X i=1 P(γpil= g|gpil, 0 0(spil), 01(spil), fplg0 ) − µp= 0.

From these we get

fplg= 1 np np X i=1

P(γpil= g|gpil, 00(spil), 01(spil), fplg0 ).

In contrast, we will need to maximize Q with respect to each snumerically. For this, we need to maximize

Qs = P X p=1 np X i=1 L X l=1 I(spil= s) 2 X g=0

P γpil = g|gpil, 00(spil), 01(spil), fplg0

h

log fplg+log P (gpil|γpil= g, 0(spil), 1(spil))

i , where the sums run only over the genotypes of set s, as given by the indicator function

I(spil= c) =    1 if spil= s, 0 otherwise. Bayesian Inference

The MLE estimator introduced above results in an underestimation of the error rates because the allele frequencies are estimated too close to the data, i.e. the MLE of the allele frequencies are biased. Importantly, this problem only disappears when using excessively many samples, but not when using more loci since an allele frequency must be estimated for every locus. To obtain meaningful estimates also for reasonable sample sizes, we develop a

(6)

P(g|, f ) = P(g|, f )P()P(f ) P(, f )

, (7)

assuming an exponential prior on P() =

C

Y

s=1

[P(0(s))P(1(s))]; 0(s), 1(s) ∼ Exponential(λ)

and the uninformative Jeffrey’s prior on P(f ) P(f ) = P Y p=1 L Y l=1 P(fpl); P(fpl) ∼ Beta(0.5, 0.5)

Since P(, f ) can not be evaluated analytically, we resort to an MCMC algorithm to generate samples from the posterior distribution P(g|, f ).

Supplementary Tables

Table S1: Samples included in the RAD vs GBS comparison. The RAD data is available through the SRA Biproject PRJNA528699. Compared genotypes are available at Zenodo through DOI 10.5281/zenodo.2604109.

F008 01 F008 15 F009 12 F021 06 F026 06 F030 03 F032 03 F033 05 F039 05 I373 05 F008 02 F008 16 F009 13 F021 07 F026 07 F030 04 F032 04 F033 06 F039 06 I373 06 F008 03 F008 17 F020 01 F022 01 F026 08 F030 05 F032 05 F033 07 F039 07 I373 07 F008 04 F009 01 F020 02 F022 02 F026 09 F030 06 F032 06 F036 01 I345 01 I396 01 F008 05 F009 02 F020 03 F022 03 F026 10 F030 07 F032 08 F036 02 I345 02 I396 02 F008 06 F009 03 F020 04 F022 04 F026 11 F031 01 F032 09 F036 03 I345 03 I396 03 F008 07 F009 04 F020 05 F022 05 F026 12 F031 02 F032 10 F036 04 I345 04 I396 04 F008 08 F009 05 F020 06 F022 06 F026 13 F031 03 F032 11 F036 05 I345 05 I396 05 F008 09 F009 06 F020 07 F022 07 F026 14 F031 04 F032 12 F036 06 I345 06 I396 06 F008 10 F009 07 F021 01 F026 01 F026 15 F031 05 F032 13 F036 07 I345 07 I396 07 F008 11 F009 08 F021 02 F026 02 F026 16 F031 06 F033 01 F039 01 I373 01 F008 12 F009 09 F021 03 F026 03 F026 17 F031 07 F033 02 F039 02 I373 02 F008 13 F009 10 F021 04 F026 04 F030 01 F032 01 F033 03 F039 03 I373 03 F008 14 F009 11 F021 05 F026 05 F030 02 F032 02 F033 04 F039 04 I373 04

(7)

Supplementary Figures

Figure S1: Accuracy of error rate estimates from two (bottom row), three (middle row) and five (top row) replicates per samples. Shown are the estimated error rates relative to the true error rates (red line) of 100 replicates for different samples sizes n (shown on top), the two error rates 0.1 and 0.01 (first and last two columns, respectively) and different numbers of unlinked loci. The horizontal dashed lines indicate Q2, the interval within which an estimate is less than a factor of two away from the true value.

0 5 10 15 20 25 30 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 Relativ e bias of ε0 Depth 50% 55% 60% 70% 80% 90% 95% 99% ε0=0.01 0 5 10 15 20 25 30 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 Relativ e bias of ε1 Depth 50% 55% 60% 70% 80% 90% 95% 99% ε1=0.01 0 5 10 15 20 25 30 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 Relativ e bias of ε0 Depth 50% 55% 60% 70% 80% 90% 95% 99% ε0=0.1 0 5 10 15 20 25 30 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 Relativ e bias of ε1 Depth 50% 55% 60% 70% 80% 90% 95% 99% ε1=0.1

Figure S2: Distribution of the relative bias when inferring error rates from a truth set generated with next-generation sequencing data with a specific mean depth. Each line indicates a quantile q such that a smaller bias was observed in a fraction q of 106 simulations of 104 homozygous and 104heterozygous loci. The left two panels

(8)

Figure S3: Distribution of the allelic depth observed in the RAD-seq data for genotypes found to be homozy-gous reference (left, 8,701 genotypes), heterozyhomozy-gous (middle, 6,022 genotypes) and homozyhomozy-gous alternative (right, 1,914 genotypes) genotypes in the GBS truth set exploiting familial relationships. Allelic depth distributions are normalized per total depth for plotting (i.e. they sum to 1 per top-left to lower-right diagonal).

Figure S4: Accuracy of error rate estimates under the population sample model if data deviates from Hardy-Weinberg proportion as quantified by the inbreeding coefficient F . Shown are the estimated error rates relative to the true error rates (red line) of 100 replicates for different samples sizes n (shown on top), the two error rates 0.1 and 0.01 (first and last two columns, respectively) and different numbers of unlinked loci. The horizontal dashed lines indicate Q2, the interval within which an estimate is less than a factor of two away from the true value.

Estimating and accounting for genotyping errors in RAD-seq experiments