Development of decision support systems for sampling germplasm

(1)

1

SP4 2005 Commissioned Research - Project 30

Development of decision support systems for sampling germplasm

CIRAD – Agropolis

UR75 Biometrics and Computer science unit

J.P. Jacquemoud-Collet jean-pierre.jacquemoud-collet@cirad.fr

Xavier Perrier(task leader) xavier.perrier@cirad.fr

UMR PIA Claire Billot claire.billot@cirad.fr Brigitte Courtois brigitte.courtois@cirad.fr

Monique Deu monique.deu@cirad.fr

Jean-François Rami jean-francois.rami@cirad.fr

IPGRI SGRP

Samy Gaiji s.gaiji@cgiar.org

Rajesh Sood r.sood@cgiar.org

WUR Biometris

Marco Bink marco.bink@wur.nl

(2)

Rationale and Objective

address the needs formulated by SP1 scientists

- localize genes involved in agronomic traits, in using molecular markers as tags - detection by association mapping

based on disequilibrium between linked loci

on germplasm collections to track a large allelic diversity

but…

composite collections = complex pools of genetically differentiated objects:

wild ancestors, relatives, landraces, inbred lines, elite material … accumulating various demographic/breeding events:

selection, genetic drift, bottleneck, founder effects …

→ Structured populations with disturbed balances of alleles generating spurious associations and disequilibria even between unlinked loci

(3)

3

Rationale and Objective

Practical constraint: phenotyping is long and expensive and can concern only a small part of the collection

Combining the two problems :

how to take advantage of the necessary sampling to minimize disequilibria due to structures?

Two approaches:to sample in a collection / unlinked markers

1. Starting from a diversity tree, extract an unstructured sub tree

2. Starting from disequilibria observed between loci, extract a sample minimizing these disequilibria

(4)

1. maximal length sub-tree algorithm

Structures = over representation of some groups → redundancy between units

It is expected that the deletion of redundant units will reduce the global disequilibrium, ( a posteriori control )

→

Search for a star like subtree by successive pruning of redundant accessions

and of maximal length to maintain the allelic diversity

¾Algorithm

- build a tree with a convenient method

- estimate distances between accessions in the tree - select pair of accessions of minimal distance

- prune accession with smallest edge - iterate on this subtree

2nd step

(5)

5

z Assumption of independence between markers does not hold for linked markers:

a system of weights based on map distances taking into account for correlation among linked loci

z Joint analysis of several sets of variables of different nature (molecular, DNA, morphological...) and different type (ordinal, nominal, binary…)

- global dissimilarity as a weighted sum of partial standardized dissimilarities

- phenotypic diversity conditionally to genetic diversity as inferred from molecular markers

¾ Map-based weighting algorithm

¾Function wgtdaisy

extension of function daisy in S-Plus (or R)

¾ Algorithm of agglomerative classification under topological constraints (Darwin)

1. maximal length sub-tree

(6)

2. Min SD sampling

Disequilibrium = LD ‘physical’ component (linkage) + SD ‘structural’ component (structure)

independent markers: only the structural component

→

a sample that minimizes the observed disequilibrium

stepwise algorithm removing at each step the accession

(7)

7 Linkage disequilibrium (LD) measures (haplotypes or known phase)

> diallelic loci statistical equilibrium: 0 21 12 22 11 22 12 21 11 22 21 12 11 . . _⇔ ₌ ₌ _⇔ ₋ ₌ = n n n n n n n n n n n n n n n n n n_ij _i _j 21 12 22 11n n n n d = −

depends on allele frequencies

several LD measures → ' 1 1

'

_j _ij K i L j i

p

D

p

D

∑ ∑

= =

=

>

multiallelic loci:    < = > = 0 if ) , min( 0 if ) , min( 2 . . 2 1 . . 1 max 1 . . 2 2 . . 1 max d n n n n d d n n n n d max ' d d D = 1 . . 2 2 . . 1 2 2 n n n n d r = z z non-j j non-i i n n_.2 n_.1 n_2. n₂₂ n₂₁ 2 n₁ n₁₂ n₁₁ 1 A 2 1 B 2. Min SD sampling A1 A2 B1 B2 A1 A2 B1 B2

(8)

100 60 40 30 14 16 2 70 46 24 1 A 2 1 B 100 60 40 30 14+4 16-4 70 46-4 24+4 93 60 33 23 14 16-7 70 46 24 . maximum . maximum 100 60 40 30 0 30 70 60 10 100 50 50 50 50 0 50 0 50 standardization: d_max =2500 . nearest equilibrium . equilibrium

LD measures cannot be used directly for SD measures

Linkage disequilibrium

. allelic frequencies are fixed

Structure disequilibrium

. allelic frequencies are sample dependent 100 4 16 46 14 24× − × = × = d standardization: d_max =1800 2. Min SD sampling

(9)

9

- multi allelic loci

Two loci I and J: sum on all 2x2 subtables of 2 alleles of I and 2 alleles of J

∑ ∑ ∑ ∑ − = ≠ ≠ i i i j j j ij i j ij i j n n n n d ' ' ' ' ' ' Standardization: n/3 0 0 0 n/3 0 0 0 n/3 2 max ( 1) 2 1       − = NA n NA NA d with NA=min(K,L)

depends on K and L , sensitive to rare alleles

3 3 3 3 2 max n n n D = = n/2 0 0 n/2 4 2 max n d = Structure disequilibrium - biallelic loci d = n₁₁n₂₂ −n₁₂n₂₁ max d d SD = 2. Min SD sampling

(10)

…the algorithme

z for a pair of loci I andJ

contribution of accession k :

z score of accession k :

with to favour reduction of highest disequilibria

z remove k such that Sc_kis maximum

z reiterate on the sample

k IJ k IJ IJ k SD SD X( ) = + − −

Algorithme in o(NM²) reduced in o(M²+N)

) (IJ k I J IJ k w X Sc =∑ ∑ 2 IJ IJ SD w = 2. Min SD sampling

(11)

11

Tools and application

Sorghum core-collection: 205 accessions homozygotes 69 mapped RFLP markers (2 to 5 alleles) (quasi) independent SSM249 IS2807 SSM379 SSM1284 12531 31681 10234 2262 3073 3511 8685 929 1398 2263 SSM275 SSM276 2430 26457₂₆₅₅₄ 23 19455 29691 30175 34213967 22239 22294 27164 1216912179 13 26833 11119 SSM215 2787 33353 SSM1592 SSM1611 32569 23 23666236 7861 7889 24887 14351 14414 14417 4963 19466 2940729569 29233 2398 20689 20706 20727 25499₃₁₅₅₉ 8882 23142 25596 1102611827 12542 9597SSM505 SSM1057 3905 25733 26041 21519 21622 23777 3957 3958 2416 9303 9468 13845 2156 049 15752 16044 16125 16186 33116 16101 16173 20351 19953 20064 21849 16396 17658 25077 14317 14331 24072 3959 12931 9527 13989 24481 29606 22282 26731 33261 SSM547 SSM552 SSM1123 X SSM546 SSM964 SSM973 28409 SSM1046 6745 6828 27490 SSM557 12447 22893 10801 10844 21124 21891 27390 25702 26110 23178 23254 SSM1370 10876 10882 2848 2814 SSM12 SSM19 SSM29 19453 SSM205 SSM232 SSM2612001620097 1514815443 16545 29375 30436 19685 27891 29409 303 3780 30405 13452 29876 20864 30317 30385 22330 3971 4285 4821 23100 24139 13848 27146 29496 29872 2991129310 22287 22332 22334 13791 13926 30400 4027 13113 30538 12804 30441 30451 24009 5972 6193 6351 19847 19026 SSM1102 SSM1103 SSM501 SSM625 SSM1267 5867 6118 28645 19132 3771 30443 30352 30417

(12)

dissimilarities for 0/1; qualitative, quantitative, sequences missing data options

Transformations, weighted sum

dissimilarity properties

factorial analysis on distance matrix

draw tree with edition tools bitmap or vectorial copy

fit criterion, pruning, grafting, edges contraction, grouping… tree distance methods: hierarchical clustering, NJ, Scores…, bootstraps

NJ under topological constraints

influent unit detection

tree distances

consensus methods, max agreement subtree on line documentation

DARWIN software (platform Microsoft VisualStudio 6.0)

maximal length subtree minimum SD

a lab product freely distributed

(13)

13 decreasing sample size

Min SD sampling sample size = 100 mean random mean random 95-percentile 95-percentile 100

(14)

copy the graphic in the clipboard record matrix of

disequilibrium between pairs of loci

Input:

accessions × markers file accession identifier file

display info on input datafiles

force or exclude accessions

select loci

record accession status in the sample

display SD distributions for

random sampling: step and drawing number

minimal allele frequency

percentile level

INPUT, options

(15)

15 Sorghum core-collection SSM249 IS2807 SSM379 SSM1284 12531 31681 10234 2262 3073 3511 8685 929 1398 2263 SSM275_SSM276 2430 2645726554 23644 19455 29691 30175 34213967 22239 2229427164 12169 12179 13 26833 11119 SSM215 2787 33353 SSM1592 SSM1611 32569 23645 2366623669 7861 7889 24887 14351 14414 14417 4963 19466 29407295 292 239 20689 20706 20727 2549931559 8882 23142 25596 11026 11827 12542 9597 SSM505 SSM1057 3905 25733 26041 21519 21622 23777 39573958 2416 930 9468 13845 2156 049 15752 16044 16125 16186 33116 16101 16173 20351 19953₂₀₀₆₄ 21849 16396 17658 25077 14317₁₄₃₃₁ 24072 3959 12931 9527 13989 24481 29606 22282 26731 33261 SSM547 SSM552 SSM1123 X SSM546 SSM964 SSM973 28409 SSM1046 6745 6828 27490 SSM557 12447 22893 10801 10844 21124 21891 27390 25702 26110 23178 23254 SSM1370 10876 10882 2848 2814 SSM12 SSM19 SSM29 19453 SSM205 SSM232 SSM261 20016 20097 15148 15443 16545 29375 30436 19685 27891 29409 303 3780 30405 13452 29876 20864 30317 30385 22330 3971 4285 4821 23100 24139 13848 27146 29 29872 29911 29310 22287 22332 22334 13791 13926 30400 4027 13113 30538 12804 30441 30451 24009 5972 6193 6351 19847 19026 SSM1102 SSM1103 SSM501 SSM625 SSM1267 5867 6118 28645 19132 3771 30443 30352 30417 Sample of 100 accesions (in red) largely under sampled groups

(16)

(17)

17

comparison between max subtree and min SD strategies

Sorghum core-collection min SD max subtree random means → 95-perc. → sample size SD 0 0.1 0.2 0.3 0.4 0.5 0.6 15 25 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175 185 195 205 100

(18)

Sorghum data (GCP data)

• 660 accessions

(200 accessions in the RFLP data)

- 8 dinucleotides

• 20 SSR loci - 10 trinucleotides

- 2 tetranucleotides

- 72 alleles

- 2 to 5 alleles / locus (mean= 3.6) - no alleles below 5%

- 616 genotypes

- 272 alleles

- 4 to 30 alleles / locus (mean= 13.6) - high number of rare alleles

- 200 alleles below 5% - 132 alleles below 1% - 651 genotypes 0 50 100 150 200 250 300 216 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 xtxp15 I II III IV V allele pooling →

max subtree versus SD min strategies

(19)

19

in red the 200 sampled accessions

China group

a logical result but …

200 0.458 0.190 0.178 0.067 alleles: 64 / 72 660 Sorghum accessions / 20 SSR

Min SD sampling: sample 200 / 660

(20)

660 Sorghum accessions / 20 SSR

Max subtree sampling: sample 200 / 660

200 0.458 0.190 0.283 0.151 alleles: 67 / 72

(21)

21

660 Sorghum accessions / 20 SSR

two steps procedure:

- max sub tree: sample 550 / 660

- min SD: sample 200 / 550 200 0.418 0.178 0.146 0.073 550 alleles: 67 / 72 alleles: 72 / 72

in red the 200 sampled accessions

in green: excluded accessions

(22)

Further developments within the 2005 project

Project in progress!

Methods

Tree construction

# bootstrap for dissimilarities on linked markers

# standardization for weighted sum of partial dissimilarities Measure of disequilibrium measure when the phase is unknown

# estimation of haplotype frequencies from the data

Free software implement the Expectation-Maximization algorithm (Hill, 1974) Arlequin, Haplo…

(23)

23

Further developments within the 2005 project Validation

compare samples defined from SSR and RFLP on Sorghum data sampling on SSR and test on RFLP disequilibrium

and conversely

test these samples on LD reduction for closely linked markers (4 cM)

Tools

management of missing data optimise algorithms

direct import of files at GCP format

validate documentation and user's interface distribute the software through GCP network

(24)

Proposal for commissioned 2006 projects

many SP1 projects use SSR markers

Problem of hypervariabilty of SSR markers (cf poster) # allele pooling:

- for each locus, aggregation on statistical kernels

- minimizing the loss of mutual information between pair of loci

- minimizing tree structure perturbations # fuzzy code?

haplotype frequency estimations when phase is unknown HWE assumption for EM algorithms does not hold

coupling disequilibrium estimations and haplotype frequency estimations Bayesian approach? polyploidy 0 50 100 150 200 250 300 216 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 xtxp15 I II III IV V …..but there are still some pendent methodological questions for sampling and

(25)

25

(26)

2006 project

Chennai Workshop SP1 “Molecular Markers for Allele Mining”, Aug 22-26 2005

Software for efficient and effective sampling of germplasm

DARwinN

SP1 scientists desire support in the sampling of germplasm and statististical analysis especially in association analysis.

New project aimed at supporting the SP1 scientists:

(27)

27

Proposal for 2006 (P.I. Marco Bink WUR)

H

elpdesk to support SP1 projects

- in sampling of germplasm, using the tools developed in the current project - but also in association mapping analyses that should follow the sampling step • a contact person at WUR that

- either helps the SP1-researcher directly

- or directs him to an appropriate other person for short consultancies on data analysis strategies

help on tools and software

improvement or customization of tools… - is responsible for proper feedback

• a website: manuals, examples, links to software, exchanges between SP1- researchers…

Involved institutions : WUR (+ CIRAD and other resource persons) HOWEVER

- Helpdesk / websites already present at other SP’s

- Association mapping courses also organised by CIMMYT (Nov ‘05, Nairobi) ALSO

- Consultancy/support to SP1 scientists -> Doing the analysis for SP1 ….. - Proper Linkage Disequilibrium mapping software still lacking