1
SP4 2005 Commissioned Research - Project 30
Development of decision support systems for sampling germplasm
CIRAD – Agropolis
UR75 Biometrics and Computer science unit
J.P. Jacquemoud-Collet jean-pierre.jacquemoud-collet@cirad.fr
Xavier Perrier(task leader) xavier.perrier@cirad.fr
UMR PIA Claire Billot claire.billot@cirad.fr Brigitte Courtois brigitte.courtois@cirad.fr
Monique Deu monique.deu@cirad.fr
Jean-François Rami jean-francois.rami@cirad.fr
IPGRI SGRP
Samy Gaiji s.gaiji@cgiar.org
Rajesh Sood r.sood@cgiar.org
WUR Biometris
Marco Bink marco.bink@wur.nl
Rationale and Objective
address the needs formulated by SP1 scientists
- localize genes involved in agronomic traits, in using molecular markers as tags - detection by association mapping
based on disequilibrium between linked loci
on germplasm collections to track a large allelic diversity
but…
composite collections = complex pools of genetically differentiated objects:
wild ancestors, relatives, landraces, inbred lines, elite material … accumulating various demographic/breeding events:
selection, genetic drift, bottleneck, founder effects …
→ Structured populations with disturbed balances of alleles generating spurious associations and disequilibria even between unlinked loci
3
Rationale and Objective
Practical constraint: phenotyping is long and expensive and can concern only a small part of the collection
Combining the two problems :
how to take advantage of the necessary sampling to minimize disequilibria due to structures?
Two approaches:to sample in a collection / unlinked markers
1. Starting from a diversity tree, extract an unstructured sub tree
2. Starting from disequilibria observed between loci, extract a sample minimizing these disequilibria
1. maximal length sub-tree algorithm
Structures = over representation of some groups → redundancy between units
It is expected that the deletion of redundant units will reduce the global disequilibrium, ( a posteriori control )
→
Search for a star like subtree by successive pruning of redundant accessionsand of maximal length to maintain the allelic diversity
¾Algorithm
- build a tree with a convenient method
- estimate distances between accessions in the tree - select pair of accessions of minimal distance
- prune accession with smallest edge - iterate on this subtree
2nd step
5
z Assumption of independence between markers does not hold for linked markers:
a system of weights based on map distances taking into account for correlation among linked loci
z Joint analysis of several sets of variables of different nature (molecular, DNA, morphological...) and different type (ordinal, nominal, binary…)
- global dissimilarity as a weighted sum of partial standardized dissimilarities
- phenotypic diversity conditionally to genetic diversity as inferred from molecular markers
¾ Map-based weighting algorithm
¾Function wgtdaisy
extension of function daisy in S-Plus (or R)
¾ Algorithm of agglomerative classification under topological constraints (Darwin)
1. maximal length sub-tree
2. Min SD sampling
Disequilibrium = LD ‘physical’ component (linkage) + SD ‘structural’ component (structure)
independent markers: only the structural component
→
a sample that minimizes the observed disequilibriumstepwise algorithm removing at each step the accession
7 Linkage disequilibrium (LD) measures (haplotypes or known phase)
> diallelic loci statistical equilibrium: 0 21 12 22 11 22 12 21 11 22 21 12 11 . . ⇔ = = ⇔ − = = n n n n n n n n n n n n n n n n n nij i j 21 12 22 11n n n n d = −
depends on allele frequencies
several LD measures → ' 1 1
'
j ij K i L j ip
D
p
D
∑ ∑
= ==
>
multiallelic loci: < = > = 0 if ) , min( 0 if ) , min( 2 . . 2 1 . . 1 max 1 . . 2 2 . . 1 max d n n n n d d n n n n d max ' d d D = 1 . . 2 2 . . 1 2 2 n n n n d r = z z non-j j non-i i n n.2 n.1 n2. n22 n21 2 n1 n12 n11 1 A 2 1 B 2. Min SD sampling A1 A2 B1 B2 A1 A2 B1 B2100 60 40 30 14 16 2 70 46 24 1 A 2 1 B 100 60 40 30 14+4 16-4 70 46-4 24+4 93 60 33 23 14 16-7 70 46 24 . maximum . maximum 100 60 40 30 0 30 70 60 10 100 50 50 50 50 0 50 0 50 standardization: dmax =2500 . nearest equilibrium . equilibrium
LD measures cannot be used directly for SD measures
Linkage disequilibrium
. allelic frequencies are fixed
Structure disequilibrium
. allelic frequencies are sample dependent 100 4 16 46 14 24× − × = × = d standardization: dmax =1800 2. Min SD sampling
9
- multi allelic loci
Two loci I and J: sum on all 2x2 subtables of 2 alleles of I and 2 alleles of J
∑ ∑ ∑ ∑ − = ≠ ≠ i i i j j j ij i j ij i j n n n n d ' ' ' ' ' ' Standardization: n/3 0 0 0 n/3 0 0 0 n/3 2 max ( 1) 2 1 − = NA n NA NA d with NA=min(K,L)
depends on K and L , sensitive to rare alleles
3 3 3 3 2 max n n n D = = n/2 0 0 n/2 4 2 max n d = Structure disequilibrium - biallelic loci d = n11n22 −n12n21 max d d SD = 2. Min SD sampling
…the algorithme
z for a pair of loci I andJ
contribution of accession k :
z score of accession k :
with to favour reduction of highest disequilibria
z remove k such that Sckis maximum
z reiterate on the sample
k IJ k IJ IJ k SD SD X( ) = + − −
Algorithme in o(NM²) reduced in o(M²+N)
) (IJ k I J IJ k w X Sc =∑ ∑ 2 IJ IJ SD w = 2. Min SD sampling
11
Tools and application
Sorghum core-collection: 205 accessions homozygotes 69 mapped RFLP markers (2 to 5 alleles) (quasi) independent SSM249 IS2807 SSM379 SSM1284 12531 31681 10234 2262 3073 3511 8685 929 1398 2263 SSM275 SSM276 2430 2645726554 23 19455 29691 30175 34213967 22239 22294 27164 1216912179 13 26833 11119 SSM215 2787 33353 SSM1592 SSM1611 32569 23 23666236 7861 7889 24887 14351 14414 14417 4963 19466 2940729569 29233 2398 20689 20706 20727 2549931559 8882 23142 25596 1102611827 12542 9597SSM505 SSM1057 3905 25733 26041 21519 21622 23777 3957 3958 2416 9303 9468 13845 2156 049 15752 16044 16125 16186 33116 16101 16173 20351 19953 20064 21849 16396 17658 25077 14317 14331 24072 3959 12931 9527 13989 24481 29606 22282 26731 33261 SSM547 SSM552 SSM1123 X SSM546 SSM964 SSM973 28409 SSM1046 6745 6828 27490 SSM557 12447 22893 10801 10844 21124 21891 27390 25702 26110 23178 23254 SSM1370 10876 10882 2848 2814 SSM12 SSM19 SSM29 19453 SSM205 SSM232 SSM2612001620097 1514815443 16545 29375 30436 19685 27891 29409 303 3780 30405 13452 29876 20864 30317 30385 22330 3971 4285 4821 23100 24139 13848 27146 29496 29872 2991129310 22287 22332 22334 13791 13926 30400 4027 13113 30538 12804 30441 30451 24009 5972 6193 6351 19847 19026 SSM1102 SSM1103 SSM501 SSM625 SSM1267 5867 6118 28645 19132 3771 30443 30352 30417
dissimilarities for 0/1; qualitative, quantitative, sequences missing data options
Transformations, weighted sum
dissimilarity properties
factorial analysis on distance matrix
draw tree with edition tools bitmap or vectorial copy
fit criterion, pruning, grafting, edges contraction, grouping… tree distance methods: hierarchical clustering, NJ, Scores…, bootstraps
NJ under topological constraints
influent unit detection
tree distances
consensus methods, max agreement subtree on line documentation
DARWIN software (platform Microsoft VisualStudio 6.0)
maximal length subtree minimum SD
a lab product freely distributed
13 decreasing sample size
Tools and application
Min SD sampling sample size = 100 mean random mean random 95-percentile 95-percentile 100
copy the graphic in the clipboard record matrix of
disequilibrium between pairs of loci
Input:
accessions × markers file accession identifier file
display info on input datafiles
force or exclude accessions
select loci
record accession status in the sample
display SD distributions for
random sampling: step and drawing number
minimal allele frequency
percentile level
INPUT, options
Tools and application
15 Sorghum core-collection SSM249 IS2807 SSM379 SSM1284 12531 31681 10234 2262 3073 3511 8685 929 1398 2263 SSM275SSM276 2430 2645726554 23644 19455 29691 30175 34213967 22239 2229427164 12169 12179 13 26833 11119 SSM215 2787 33353 SSM1592 SSM1611 32569 23645 2366623669 7861 7889 24887 14351 14414 14417 4963 19466 29407295 292 239 20689 20706 20727 2549931559 8882 23142 25596 11026 11827 12542 9597 SSM505 SSM1057 3905 25733 26041 21519 21622 23777 39573958 2416 930 9468 13845 2156 049 15752 16044 16125 16186 33116 16101 16173 20351 1995320064 21849 16396 17658 25077 1431714331 24072 3959 12931 9527 13989 24481 29606 22282 26731 33261 SSM547 SSM552 SSM1123 X SSM546 SSM964 SSM973 28409 SSM1046 6745 6828 27490 SSM557 12447 22893 10801 10844 21124 21891 27390 25702 26110 23178 23254 SSM1370 10876 10882 2848 2814 SSM12 SSM19 SSM29 19453 SSM205 SSM232 SSM261 20016 20097 15148 15443 16545 29375 30436 19685 27891 29409 303 3780 30405 13452 29876 20864 30317 30385 22330 3971 4285 4821 23100 24139 13848 27146 29 29872 29911 29310 22287 22332 22334 13791 13926 30400 4027 13113 30538 12804 30441 30451 24009 5972 6193 6351 19847 19026 SSM1102 SSM1103 SSM501 SSM625 SSM1267 5867 6118 28645 19132 3771 30443 30352 30417 Sample of 100 accesions (in red) largely under sampled groups
Tools and application
Tools and application
17
Tools and application
comparison between max subtree and min SD strategies
Sorghum core-collection min SD max subtree random means → 95-perc. → sample size SD 0 0.1 0.2 0.3 0.4 0.5 0.6 15 25 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175 185 195 205 100
Sorghum data (GCP data)
•
660 accessions
(200 accessions in the RFLP data)- 8 dinucleotides
•
20 SSR loci - 10 trinucleotides
- 2 tetranucleotides
- 72 alleles
- 2 to 5 alleles / locus (mean= 3.6) - no alleles below 5%
- 616 genotypes
- 272 alleles
- 4 to 30 alleles / locus (mean= 13.6) - high number of rare alleles
- 200 alleles below 5% - 132 alleles below 1% - 651 genotypes 0 50 100 150 200 250 300 216 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 xtxp15 I II III IV V allele pooling →
max subtree versus SD min strategies
19
in red the 200 sampled accessions
China group
a logical result but …
200 0.458 0.190 0.178 0.067 alleles: 64 / 72 660 Sorghum accessions / 20 SSR
Min SD sampling: sample 200 / 660
660 Sorghum accessions / 20 SSR
Max subtree sampling: sample 200 / 660
200 0.458 0.190 0.283 0.151 alleles: 67 / 72
21
660 Sorghum accessions / 20 SSR
two steps procedure:
- max sub tree: sample 550 / 660
- min SD: sample 200 / 550 200 0.418 0.178 0.146 0.073 550 alleles: 67 / 72 alleles: 72 / 72
in red the 200 sampled accessions
in green: excluded accessions
Further developments within the 2005 project
Project in progress!
Methods
Tree construction
# bootstrap for dissimilarities on linked markers
# standardization for weighted sum of partial dissimilarities Measure of disequilibrium measure when the phase is unknown
# estimation of haplotype frequencies from the data
Free software implement the Expectation-Maximization algorithm (Hill, 1974) Arlequin, Haplo…
23
Further developments within the 2005 project Validation
compare samples defined from SSR and RFLP on Sorghum data sampling on SSR and test on RFLP disequilibrium
and conversely
test these samples on LD reduction for closely linked markers (4 cM)
Tools
management of missing data optimise algorithms
direct import of files at GCP format
validate documentation and user's interface distribute the software through GCP network
Proposal for commissioned 2006 projects
many SP1 projects use SSR markers
Problem of hypervariabilty of SSR markers (cf poster) # allele pooling:
- for each locus, aggregation on statistical kernels
- minimizing the loss of mutual information between pair of loci
- minimizing tree structure perturbations # fuzzy code?
haplotype frequency estimations when phase is unknown HWE assumption for EM algorithms does not hold
coupling disequilibrium estimations and haplotype frequency estimations Bayesian approach? polyploidy 0 50 100 150 200 250 300 216 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 xtxp15 I II III IV V …..but there are still some pendent methodological questions for sampling and
25
2006 project
Chennai Workshop SP1 “Molecular Markers for Allele Mining”, Aug 22-26 2005
Software for efficient and effective sampling of germplasm
DARwinN
SP1 scientists desire support in the sampling of germplasm and statististical analysis especially in association analysis.
New project aimed at supporting the SP1 scientists:
27
Proposal for 2006 (P.I. Marco Bink WUR)
H
elpdesk to support SP1 projects
- in sampling of germplasm, using the tools developed in the current project - but also in association mapping analyses that should follow the sampling step • a contact person at WUR that
- either helps the SP1-researcher directly
- or directs him to an appropriate other person for short consultancies on data analysis strategies
help on tools and software
improvement or customization of tools… - is responsible for proper feedback
• a website: manuals, examples, links to software, exchanges between SP1- researchers…
Involved institutions : WUR (+ CIRAD and other resource persons) HOWEVER
- Helpdesk / websites already present at other SP’s
- Association mapping courses also organised by CIMMYT (Nov ‘05, Nairobi) ALSO
- Consultancy/support to SP1 scientists -> Doing the analysis for SP1 ….. - Proper Linkage Disequilibrium mapping software still lacking