• Aucun résultat trouvé

A Bayesian framework for high-throughput T cell receptor pairing

N/A
N/A
Protected

Academic year: 2021

Partager "A Bayesian framework for high-throughput T cell receptor pairing"

Copied!
12
0
0

Texte intégral

(1)

A Bayesian framework for high-throughput T cell receptor pairing

The MIT Faculty has made this article openly available.

Please share

how this access benefits you. Your story matters.

Citation

Holec, Patrick V et al. "A Bayesian framework for high-throughput T

cell receptor pairing." Bioinformatics 35, 8 (September 2018): 1318–

1325 © 2018 The Author(s)

As Published

http://dx.doi.org/10.1093/bioinformatics/bty801

Publisher

Oxford University Press (OUP)

Version

Author's final manuscript

Citable link

https://hdl.handle.net/1721.1/126158

Terms of Use

Creative Commons Attribution-Noncommercial-Share Alike

(2)

Supplementary Figures

Patrick V. Holec, Joseph Berleant, Mark Bathe, and Michael Birnbaum

——————————————————–

A

C

B

D

Figure S1. Robustness on varying degrees of simulated noise. Simulations were performed using the default simulated repertoire and sample (1000 clonotypes, 96 wells, 100 cells/well, power-law distribution, α = 2). Two types of experimental noise were considered. Chain deletion refers to the probability each independent chain will be observed during sequencing after placement in a well. This noise covers sources such as mRNA degredation, amplification failures, and sequencing subsampling. Clonal matches (A) and repertoire coverage (B) show a high resilience to this noise for experimentally relevant regimes (est. 10-20%). The second type of noise is chain misplacement, which refers to the independent probability each chain migrates randomly

(3)

Figure S2. Performance of MAD-HYPE over clonal populations with varying dual clone probability. We define the dual clone probability pdual such that for any clone, the probability it will have two α chains is p and the probability it will have two β chains is also p. Whether a clone has two α chains is determined independently of whether it will have two β chains, so that the probability of a clone having two α and β chains is p2. Note that here clonal matches refers to the percent of correct αβ pairings that were predicted, and that MAD-HYPE was not used to determine if clones contained three or more chains. Each dual clone contributed two chain pairs (if it contained two α or two β chains) or three chain pairs (if it contained two α and two β chains) to the total number of correct chain pairs. Repertoire coverage is defined here as the sum of clonal frequencies for which all corresponding chain pairs were identified. (A) and (C) show clonal matches and repertoire coverage without chain sharing. The performance of MAD-HYPE is not significantly affected by the presence of dual clones because it still effectively identifies all chain pairs within the frequency band for which it has strong resolving power (Figure S11C-D). (B) and (D) show clonal matches and repertoire coverage with chain sharing (i.e. in which α and β chains may appear in multiple distinct clones). Chain sharing probabilities were taken from Lee et al. (2017), which used an average of probabilities found in previous studies. MAD-HYPE performance remains high over a broad range of dual clone probabilities. However, in the presence of chain sharing, repertoire coverage decreases slightly despite clonal match percentage staying consistent. This indicates that for some dual clones only some of the corresponding chain pairs are identified. In all plots, the colored bar shows the median while the central black bar shows the mean over 10 simulations. All simulations were run with fmax= 5%, α = 2, repertoires of size 1000 clones, well partitions as in Figure S1 (two 48-well partitions of 50 cells/well and 1000 cells/well), and a 10% probability that a chain will fail to be sequenced each time it appears in a well.

(4)

C

D

Figure S3. Performance of MAD-HYPE over clonal populations with varying skew in the distribution of clonal frequencies. Distributions of clonal frequencies are defined following a power law, P (f ) ∝ f−α, and are defined by the power-law constant α and the maximum clonal frequency fmax. Higher α indicates clones are more concentrated in the low-frequency range. (A) and (C) show performance without chain sharing. The percentage of clones identified improves with increasing α because more clones end up in the frequency band for which MAD-HYPE effectively identifies clones with these cell/well values. (B) and (D) show performance with chain sharing probabilities taken from Lee et al. (2017), which averaged experimentally determined chain sharing probabilities from multiple prior studies. Although repertoire coverage was not impacted by α in (A) and (C), chain sharing reduced overall performance so that repertoire coverage decreased with higher α. The reason for this performance decrease is unclear. In all plots, the colored bar shows the median while the central black bar shows the mean over 10 simulations. All simulations were run with fmax= 1%, dual clone probability of 33%, repertoires of size 1000 clones, well partitions as in Figure S1 (two 48-well partitions of 50 cells/well and 1000 cells/well), and a 10% probability that a chain will fail to be sequenced each time it appears in a well.

(5)

A

C

B

D

Figure S4. Performance on constant frequency distributions and sensitivity to priors. To demonstrate the algorithm is insensitive to distribution type, we simulated repertoires with a constant clonal frequency, and used MAD-HYPE to deconvolute samples with 48 wells (A) and 96 wells (B). We note a largely linear log-log relationship between the number of cells/well and the clonal frequencies that are effectively resolved. This can be used as a map to identify which clonal frequencies a sample will successfully identify. The MAD-HYPE algorithm showed robustness to priors for the distribution of frequencies in the sample. Varying the defining parameter (α) resulted in nearly no change in both clonal matches (C) and repertoire coverage (D), despite variation throughout two orders of magnitude. Colored bars are means and central black bars represent means.

(6)

10

-4

10

-3

10

-2

Fraction Identified

0.8

0.6

0.4

0.2

0.0

Clone Frequency

MAD-HYPE

ALPHABETR

# clones

656.2 ± 167.1

968.1 ± 20.0

% repertoire

62.7 ± 15.1

55.4 ± 1.8

ALPHABETR

MAD-HYPE

Figure S5. Plot of the fraction of clones identified at various frequencies, for simulated repertoires of 3000 clones, with 300 cells/well across a 96-well plate. Larger repertoire size was used here to cover a wide range of clonal frequencies. As noted in Lee et al. (2017), ALPHABETR successfully identifies high-frequency clones; however, MAD-HYPE identifies more low-frequency clones. This discrepancy causes ALPHABETR to sometimes outperform MAD-HYPE in repertoire coverage despite identifying fewer clones. We note that multicell-per-well algorithms are generally of interest for identifying high numbers of low-frequency clones, and that identification of high-frequency clones may be better suited to more conventional single-cell sequencing approaches. Results were calculated with an FDR cutoff of 5%. Plot and table show averages and standard deviations from 20 simulations. Plot was smoothed using a moving window average applied to the log-transformed frequency to include frequencies within a factor of 1.5×.

(7)

A

B

Top 100

matches

Top 250

matches

C

D

E

F

Top 500

matches

MAD-HYPE

ALPHABETR

Figure S6. False detection rates (FDR) for MAD-HYPE and ALPHABETR over varying chain attrition probabilities. Simulations were performed with a clonal repertoire of 1000 clones, 300 cells per well, chain sharing probabilities described by Lee et al. (2017), and no dual clones. In all plots, central black bars are means and colored bars are medians, computed over 10 simulations. MAD-HYPE outperforms ALPHABETR in all conditions, and exhibits negligible FDR for chain deletion probabilities at or below 20%. (A-B) FDR for varying chain deletion probability over the top 100 predicted matches for MAD-HYPE and ALPHABETR, respectively. (C-D) FDR for varying chain deletion probability over the top 250 predicted matches for MAD-HYPE and ALPHABETR, respectively. (E-F) FDR for varying chain deletion probability over the top 500 predicted matches for MAD-HYPE and ALPHABETR, respectively.

(8)

C

α = 2.12 α = 2.30

D

9617 matches 97.0% 48 wells: 500 c/w 48 wells: 10000 c/w 17721/17721 clonal matches 95.7% repertoire coverage 48 wells: 500 c/w 48 wells: 10000 c/w 16514/17674 clonal matches 68.4% repertoire coverage

A

B

❌ : Under FDR threshold Increasing Match Confidence

: Fit Distribution : Experimental Data 48 wells: 50 c/w 48 wells: 1000 c/w 995 clonal matches 99.3% repertoire coverage 48 wells: 50 c/w 48 wells: 1000 c/w 996 clonal matches 99.4% repertoire coverage

Figure S7. Experimental recommendations, based on sample type. Bolkhovskaya et al. found peripheral blood T cell repertoires had varied distributions depending on age. Their observed parameters were used as seeds for simulated repertoires. Peripheral bloods simulations were performed with 10,000 clonotypes each. (A) For simulations representing patients ages 9-25, repertoires were high diverse with α = 2.21 and fmax= 0.25%. (B) For simulations representing patients ages 61-66, repertoires were expanded with α = 2.15 and fmax = 8.6%. Both repertoire populations were well-resolved using 48 wells with 500 cells/well, and 48 wells with 10,000 cells/well. (C) Tissue lymphocytes and (D) tumor-infiltrating lymphocyte data from Zheng et al. 2017 was used to fit power-law distributions. Both sample types were resolved using 48 wells at 50 cells/well and 48 wells at 1000 cells/well, with repertoires containing 1000 clonotypes, as observed in Sherwood et al. 2013.

(9)

0

250

500

750

1000 1250 1500 1750 2000

Cells per Well

10

1

10

0

10

1

10

2

10

3

10

4

Computation Time (sec)

MAD-HYPE

ALPHABETR

Figure S8. Figure S8. Compute times required for MAD-HYPE and ALPHABETR for varying cell counts per well. MAD-HYPE compute time are expected to scale approximately as O(n2) in the total number of observed chains because each α and β chain pair is independently considered. ALPHABETR compute times are expected to scale as O(n3) in the total number of observed chains because it internally uses the Hungarian algorithm to assign α and β chains. In addition, ALPHABETR times will scale as O(n2) in the number of unique chains per well, due to its scoring heuristic (MAD-HYPE times do not depend on the number of chains per well, except as this affects the total number of observed chains). In all cases, MAD-HYPE analyzes an experiment significantly faster than ALPHABETR. Experiments over larger repertoires or with greater numbers of cells per well are expected to increase the time difference between the two algorithms. Larger simulations were not computationally feasible because of the time required to analyze with ALPHABETR. For comparison, Experiments 1 and 2 described in Howie et al. (2015) used 2,000 cells per well and 80,000 cells per well, respectively. All simulations were performed with a repertoire of 1000 clones, fmax = 5%, FDR cutoff of 5%, and the chain sharing probabilities proposed by Lee et al. (2017). Plotted values show compute times averaged over 15 simulations. Error bars show standard deviations in the compute time required. ALPHABETR compute times were calculated using the R implementation released by Lee et al. (2017).

(10)

A

C

B

D

Figure S9. Well partition performance on simulated repertoires. To demonstrate the efficacy of partitions to identify clonotypes, simulations were performed with varying well counts and cells per well. Other parameters were set using defaults (1,000 clonotypes, power-law distribution, α = 2, 96 wells, 100 cells per well, 10% chain deletion rate). Four different settings for number of wells were used: (A) 12 wells, (B) 24 wells, (C) 36 wells, and (D) 48 wells.

(11)

AWS EC2

Instance # CPU cores

Computation Time 103 clones m5.2xlarge 8 ∼ 2 sec 104 clones m5.2xlarge 8 ∼ 36 sec 105 clones m5.2xlarge 8 ∼ 36 min 106 clones m5.16xlarge 96 ∼ 3.3 hr

Table S1. Computational time for MAD-HYPE analysis on simulated datasets.

# Unique α-chains # Unique β-chains AWS EC2 Instance # CPU cores Computation Time Experiment 1 1.49 · 104 1.07 · 104 m5.2xlarge 8 ∼ 10 min Experiment 2 1.46 · 106 7.27 · 105 m5.2xlarge 96 ∼ 24 hr

(12)

Lee, E. S., Thomas, P. G., Mold, J. E., and Yates, A. J. (2017). Identifying t cell receptors from high-throughput sequencing: dealing with promiscuity in tcrα and tcrβ pairing. PLoS computational biology, 13(1):e1005313.

Sherwood, A. M., Emerson, R. O., Scherer, D., Habermann, N., Buck, K., Staffa, J., Desmarais, C., Halama, N., Jaeger, D., Schirmacher, P., et al. (2013). Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of t cell receptor sequences that differ from the t cells in adjacent mucosal tissue. Cancer Immunology, Immunotherapy, 62(9):1453–1461.

Zheng, C., Zheng, L., Yoo, J.-K., Guo, H., Zhang, Y., Guo, X., Kang, B., Hu, R., Huang, J. Y., Zhang, Q., et al. (2017). Landscape of infiltrating t cells in liver cancer revealed by single-cell sequencing. Cell, 169(7):1342–1356.

Figure

Figure S1. Robustness on varying degrees of simulated noise. Simulations were performed using the default simulated repertoire and sample (1000 clonotypes, 96 wells, 100 cells/well, power-law distribution, α = 2).
Figure S2. Performance of MAD-HYPE over clonal populations with varying dual clone probability
Figure S3. Performance of MAD-HYPE over clonal populations with varying skew in the distribution of clonal frequencies
Figure S4. Performance on constant frequency distributions and sensitivity to priors. To demonstrate the algorithm is insensitive to distribution type, we simulated repertoires with a constant clonal frequency, and used MAD-HYPE to deconvolute samples with
+7

Références

Documents relatifs

Again, it can be surmised that the partial oxidation of ytterbium upon grinding in air before sintering leads to smaller effective ytterbium concentration in the Yb y Co 4 Sb 12

Responses to four phases will be conducted during this research on ASD participants and aged-matched controls: (1) 24 h pre-experimental recording for baseline, (2) a 2 h

During the past several years, sharp inequalities involving trigonometric and hyperbolic functions have received a lot of attention.. Thanks to their use- fulness in all areas

A two step approach is used: first the lithofacies proportions in well-test investigation areas are estimated using a new kriging algorithm called KISCA, then these estimates

Figure 3 compares well flow rate on production wells given by three simulations: the fine grid simulation and two coarse grid simulations using “linear flow” upscaling procedure

Absolute electroluminescence and photoluminescence measurements were carried out on strain-balanced quantum well solar cells Over a range of bias, a reduced radiative

Les expériences fondées sur des modèles de déficiences génétiques en autophagie ont permis de démontrer l’importance de ce processus dans l’homéostasie des lymphocytes B et T

We describe the restriction of the Dehornoy ordering of braids to the dual braid monoids introduced by Birman, Ko and Lee: we give an inductive characterization of the ordering of