HD28
.M414
Q
Dewey
Analysis ofPath-Based Approaches
to
Genomic
PhysicalMapping
by
Alan
Rimm-Kaufman
James Orlin
WP
#3742-94November
1994 revisedDecember
1994;/,AS£AC!MjSf;TTSINSVlTUTE
OFTECH^!OLOGY
JAN
19
1995Analysis
of
path-based
approaches
to
genomic
physical
mapping
Alan
Rimm-Kaufman
and
James
B. Orlini^ A. Rimm-Kaufman,OperationsResearchCenter,MIT,Cambridge, MA.02139,
USA
J.B.Orlin,SloanSchool ofManagementandtheWhiteheadInstituteMITCenterforGenomeResearch,
Abstract
An
integratedgenomic
physicalmap
combines
multiplesources
ofdata to position
landmarks
and
clones along agenome.
"Path-based"strategies for constructing integrated
maps
employ
paths ofoverlapping clones to bridge intervals
between
geneticallymapped
markers.
There
isa
difficulty withpath-based
strategies: a small rate of clone overlap errormay
createnumerous
spurious pathsand
result inmany
false linkages in the physicalmap.
We
propose an
analytical
method
to evaluatepath-based
approaches
in light of thisdifficulty,
and
demonstrate
themethod
on an
integratedmap
of theIntegrated
genomic
physicalmaps
combine
many
varieties ofstructural
and
positional data to order clonesand
landmarks
alongthe
genome.
Genetically-mappedsequence
tagged
sites (STSs)can
provide abackbone
foran
integrated physicalmap,
with overlapping yeast artificialchromosomes
(YACs)
used
to providecoverage
of thegenetic intervals
between
adjacent markers.The
overlapsbetween
YACs
may
be
detected througha
variety ofmeans
including fingerprintingand
hybridization data.In order to discuss the results of our analysis to integrated physical
maps,
we
first introducesome
terminology.We
consider adata set in
which each
STS
is assigned toa
genetic locus.A
path of length kbetween
genetic locus iand
genetic locus j is definedas
two
STSs,
Sjand
s:,and
kYACs,
y-\,y2
yksuch
that (i) Sjgenetically
maps
to locus i, (ii) s,maps
to locus j, (iii) si lies in y-|and
Sj lies in y^ bySTS
contentmapping,
and
(iv) foreach
step (Vi.yi+l) the data indicate that theYACs
yjand
yj^.i overlap. Pathsof length 1 include only
one
YAC
and correspond
to traditionalSTS
content
mapping,
while longer pathsdepend
on
the detection ofYAC
overlaps.An
integratedmap
will assign aYAC
toan
intervalconnecting genetically close loci if there is
a
short path joining thetwo
loci that contains theYAC.
A
seriousconcern
withsuch
path-based
mapping
strategies isthat falsely detected
YAC
overlapscan
create short false pathsbetween
genetically-mapped
STSs,
resulting in spuriousassignment
of
YACs
to genetic intervals,and
to spuriouscoverage
of thegenome.
structures, paths of
bounded
length suffice toconnect
essentiallyall pairs of points. This
phenomenon
has
recently gained popular attention throughJohn
Guare's award-winning
play, "SixDegrees
of Separation," inwhich
it is asserted thatany
two people
in the worldcan
be connected
througha
path of atmost
six acquaintances.Within
an
integratedgenomic
physicalmap,
thissame
phenomenon
could occur
even
if the majority of the clone overlaps are valid.A
small rate of falsely declared .clone overlaps
can
lead to spuriouscoverage
of large regions of thegenome.
To
address
this concern,we
developed a methodology
toassess
the probability that
a
shortest path joiningtwo
loci is correct.We
then applied thismethodology
to theCEPH-Genethon
'first-generation' integrated physical
map
of thehuman
genome
(2).We
describe our
methodology
in the context of this integrated physicalmap.
Briefly, the
CEPH-Genethon
physicalmapping
data
involvedscreening the 33,000-clone
CEPH
mega-YAC
library bytwo
differentmethods,
STS
contentmapping
and
Alu-PCR
probe
hybridization. Inthe first
method,
2100
genetically-mappedSTSs
(3)were
screened
against the
YAC
library (with half theSTSs
screened
completelyand
halfscreened
partially to obtain 1-2 positives). In thesecond
method,
Alu-PCR
productswere
prepared
from5300
individualYACs
and were
screened
by hybridization against spottedAlu-PCR
products from a subset of
25,000
YACs
and
monochromosomai
hybridcell lines. In addition,
many
YACs
were
subjected toA
path
of length kbetween
genetic locus Iand
genetic locusj
is defined as before.
Two YACs
are determined to overlap if (a) atleast
one
of thetwo
YACs
was
used
as anAlu-PCR
probeand
hybridized to the other
YAC,
or (b) the twoYACs
had
similarrestriction
fragment
fingerprints (4).A
chromosomally
allowable
path
is defined as a pathwhere
(i) loci iand
j lieon
thesame
chromosome,
c,and
(ii)each
yj thatwas
used
asan
Alu-PCR
probeeither
gave
no signalwhen
hybridized to themonochromosomal
panel or hybridized toa
set ofchromosomes
that includedchromosome
c.(For technical reasons,
chromosomal
assignments
were
notalways
unique:
50%
couldbe
assigned to a singlechromosome,
19%
hybridized to multiple
chromosomes, and
31%
could notbe assigned
to
any
chromosome.)
The CEPH-Genethon
integrated physicalmap
was
defined (2) tobe
the set of allchromosomally
allowable paths ofa
specifiedlength connecting loci within 10
cM.
Neither analytic norexperimental analysis
was
advanced
tosuggest
thatmost
paths ofa
given lengthwere
correct. Itwas
noted that as longer paths areallowed, the
coverage
of thegenome
increases. With paths of length1, 3, 5
and
7YACs,
the strategycovered
11%,
30%, 70%,
and
87%,
respectively, of the total genetic length of the
genome.
(Thepercentage coverage
is defined in (2)as
the proportion of totalcentiMorgans
lyingbetween
connected
loci. This isa
conservativemeasure
ofcoverage
when
restricted to valid paths, since thecoverage
does
notaccount
forends
of contigs that extendbeyond
the genetic loci. Forexample,
YACs
creatinga
pathbetween
two
genetically indistinguishable
STSs
do
not provide incrementalcoverage on
thegenome
by
this definition.)We
set out to evaluate theCEPH-Genethon
paths using data from the February,1994
CEPH-Genethon
data release (5)and
usingthe publically released
computer
package
QUICKMAP
(6)which
accompanies
theCEPH-Genethon
data.We
first constructed theminimum
lengthchromosomally
allowable path connecting every pairof loci
on
thesame
chromosome
-
regardless of the geneticdistance
between
them
(7). Figure 1shows
the proportion of locithat
may
be
connected, asa
function of the path lengthand
the width of the genetic intervalbetween
the loci.We
were
interested indetermining
what
fraction of theseobserved
pathswere
validand
what
fractionwere
spurious.A
simpleway
to estimate the proportion of falseconnections
is to consider loci separated by>50
cM
and
connected
by short paths.Such
pathsmust
surelybe
spuriousinasmuch
as theaverage
YAC
length is 1
Mb,
corresponding to only about 1cM
in thehuman
genome.
The
proportion ofsuch
widely-separated loci pairsconnected
bychromosomally
allowable paths of length 1, 3, 5,and
7
is 0.05, 4, 33,and 63%,
respectively. In particular, the curve rises dramatically for path lengthsexceeding
four, indicating thatrandom
connectionsdominate
at these distances.The
proportion ofconnected
loci pairs at distances 5-10, 10-20,and
20-50
cM
was
not
much
higher than for loci pairs at distances> 50 cM.
Thisspurious paths
span
close loci pairs at approximately thesame
rate as distant loci pairs.We
attempted to estimate the proportions of valid paths.We
partitioned the pairs of loci into sixgroups on
the basis of theirinter-locus distance: the
groups
consisted of loci at 0-2, 2-5, 5-10, 10-20, 20-50,and
>50 cM,
respectively. Fora randomly
chosen
pairof loci at distance
range
d, letA^
denote
the length of the shortest pathbetween
them, letV^
denote
the length of the shortest valid pathbetween
them,and
let l^jdenote
the length of the shortestinvalid path
between
them. With these definitions, A^-j = min(V(-j,l(-j).Figure 1 , then,
may
be viewed
as the cumulative probabilitydistribution of A^j for
each
of the six groups.We
nextmade
two assumptions: (i) the frequency of spurious pathsbetween
two
loci isindependent
of the distancebetween
them,
and
(ii) the length of the shortest spurious path joiningtwo
loci is
independent
of the length of the shortest valid path.Under
these
assumptions,
one
can
estimatean
empirical joint probability density forV^
and
A^j.Let P(j(a) = the probability that
two
points in distancerange
d
areconnected
bya
valid path of length < a. Let S(j(a)=
theprobability that
two
points in distancerange
d areconnected by a
spurious path of length < a. Let 7t(j(a) = the probability that
two
points in distance range d are
connected
byany
path of length<
a.It follows
from
theindependence
of \^and
V^j that n^ia) = Pc|(a)+
Sd(a) - Pd(a)Sd(a). Thus, Pd(a) = {n^{a) - Sd(a))/(1-Sd(a)).The
8
independent
of d,and
socan be
calculated from points separated by at least50 cM.
Therefore, Pc|(a)can
be
calculated.One
can
then estimate, foreach
distancegroup
and
each
path length, the probability P(V(j=a |A^=a)
that there is at leastone
valid shortest path
among
theobserved
shortest paths (Figure 2).This is the probability P(Vjj=a
&
Ajj=a)/P(A(-j=a).The
denominator
is 7u<j(a) - 71^(3-1).The
numerator
is the probability there is a valid path of length aand
there is.no
path (valid or invalid) of length atmost
a-1.Under
theindependence
of V^jand
1^, thenumerator
isequal to (Pd(a) - Pd(a-1))(1 -8^(3-1)). With this probability,
we
provide
a
corrected estimate of theobserved
proportion of locipairs
connected
by at leastone
valid path (Figure 3) by subtrscting out theexpected
fraction of loci pairswhose
shortest paths areinvalid.
The
results indicate that fortwo
loci within 5cM
ofone
another
and
spsnned
by shortest p3ths of<
3YACs,
the shortestP3ths are likely to include at least
one
valid path.On
the other hand,if the shortest paths are
> 4
YACs
or if thetwo
loci aremore
th3n 5cM
3part, then the shortest paths are likely tobe
invalid.Considering only shortest p3ths of 3t
most
3YACs
between
lociwithin 5
cM,
3nd
using the definition ofcover3ge
given in (2), theCEPH/Genethon
d3t3 cover only3bout
27%
of thehum3n
genome.
Thiscoverage
of thegenome
is far less than indicated in (2).The
previous 3n3lysisW3S
C3rried outon
thecomplete
d3t3 set including bothAlu-PCR
connections3nd
fingerprint d3t3.We
no fingerprints,
and one
with fingerprint data but noAlu-PCR
-- toinvestigate
whether
thecoverage
by valid paths might increase.The
amount
of validcoverage
may
possibly increase with the omission ofa data source if the omitted data
has
providedmany
spuriousconnections
and few
valid ones.The
results in both of these analyseswere
qualitatively thesame,
and
both additionalcases
resulted in a
decreased coverage
of thegenome
ascompared
to thefirst analysis. In the analysis, involving
Alu-PCR
connections but no fingerprints, for two loci within 5cM
ofone
anotherand
spanned
byshortest paths of
<
3YACs,
the shortest paths are likely to include at leastone
valid path.On
the other hand, if the shortest paths are>
4
YACs
or if the two loci aremore
than 5cM
apart, then the shortestpaths are likely to
be
invalid. (Forexample,
fortwo
loci that are within 2cM
ofeach
other,and
spanned
bya
shortest path of4
YACs,
the probability is
55%
that the shortest path is invalid.) Consideringonly shortest paths of at
most
3YACs
between
loci within 5cM,
theAlu-PCR
data (without fingerprints) cover24%
of thehuman
genome.
In the analysis including fingerprints data but
no
Alu-PCR
connections, the fingerprint data
generate
longer valid paths: lociwithin 5
cM
ofone
anotherand
spanned
by shortest fingerprint-only paths of<
5YACs
are likely tobe covered
by at leastone
validfingerprint-only path.
While
the fingerprint data support longer valid paths, the relative paucity of these paths leads to lowergenomic
coverage
overall. Fingerprint-only paths of<
5YACs
10
These
specificanalyses
address
neither the quality of theCEPH/Genethon
data nor alternativepath-based
strategies.(8)These
calculations consider only the quality of shortest paths in the
CEPH/Genethon
dataas generated
fromQUICKMAP.
A
larger fractionof reliable paths might
be
obtained using alternative analysis strategies,as
well as additional data sinceadded
byCEPH
and
Genethon
to their February,1994
release. Identifyingand
removing
noisy data, modifying the
CEPH/Genethon
definition of a path, considering longer length paths in addition to theminimum
length paths, oremploying a denser
geneticmap
are possible alternatives that might yielda
larger fraction of veracious paths coveringmore
of the
genome.
Further, while our analysiscan
indicate theprobability that there is
a
valid path, itdoes
not providea
means
todistinguish valid paths
from
invalid paths.Only
additional laboratoryexperiments
can
provide final confirmation or invalidation ofa
path.Aside from
an
application toa
specific data set,we
have
outlineda
generalmethod
for estimating thepercentage
of valid paths in thepresence
of false connections leading to invalid paths.In order to estimate the probability that
a
shortest path is valid,one
needs
to subtract out the probability that it is invalid. This, in turn,is estimated by evaluating the
percentage
of distance loci pairsconnected
by shortest paths ofeach
length. Thismethodology
extends
to other analysesas
well, includinganalyses
using alternative criteria for determiningYAC
overlaps or alternative definitions of paths.1 1
12
References
and
Notes
(1) B. Bollobas,
Random
Graphs.(Academic
Press,London,
1985). (2) D.Cohen,
I.Chumakov,
and
J.Weissenbach,
Nature, 366,698
(1993).
(3) J.
Weissenbach
et al., Nature,359,
794
(1992).(4) C. Bellanne-Chantelot et a!.. Cell, 70,
1059
(1992).(5)
These
data included6602
Alu-PCR
probesand
3450
STSs.
2035
ofthe
STSs
were
geneticallymapped
toa
specificchromosomal
location; the resulting geneticmap
contained1266
loci.These
data are thesame
data
employed
in (1), withan
additional1270
Alu-PCR
probes
added
to the dataset byCEPH/Genethon
beyond
that described in (1).(6) P. Rigault et al., available from rigault@ceph.cephb.fr.
(7)
We
employed
CLONESPATH,
one
module
of theQUICKMAP
package, togenerate
chromosomally
allowable pathsbetween
selected loci.CLONESPATH
offersa
variety of optionsand
parameters; the resultsdescribed in this
paper correspond
to the defaultCLONESPATH
settings,which
in turncorrespond
to the path creation rules described in (1).For the "link-type" parameter,
we
conducted
three analyses:once
usingALU
and
fingerprint links,once
using onlyALU
links,and once
using only fingerprint links.The
figurescorrespond
to usingALU
and
fingerprint links; the results
were
qualitatively thesame
usingALU
links only.(8) Varying the criteria
used
to declareYAC
overlaps or modifying the criteriaused
to declarea
genetic interval tobe covered
leads toalternative
path-based
construction strategies.A
few
examples
involving overlap detection include (i) requiring
two
ormore
ALU
probes
to link twoYACs, and
(ii) requiring bothan
ALU
probe and a
13
of covering a genetic interval include (iii) requiring two or
more
YAC-disjoint paths to
cover
the genetic interval,and
(iv) restricting theuse
of
each
Alu-PCR
probe toa
specific regionon
thegenome.
(9)
We
thank D.Cohen,
I.Chumakov,
and
J.Weissenbach
for sharingthis valuable data with us
and
the scientificcommunity
at large.We
thank E. Lander, D. Page, L. Kruglyak, L. Stein, D.
Cohen,
I.Chumakov,
and
P. Rigault for helpful discussionsand
comments
on
the manuscript.Supported
in part byNIH
grantHG00098.
14
Figure
Legends
Figure
1.Observed
cumulative proportion ofconnected
loci pairs, by inter-loci distanceand
path length. Minimal pathswere
constructed
between
allintra-chromosomal
loci pairs.Figure
2. Estimated probability that there is at leastone
valid shortest pathspanning
a
inter-locus intervalamong
all theobserved
shortest paths
spanning
that interval,by
inter-locus distanceand
path length.
Figure
3. Estimated true proportion ofconnected
loci pairs, by inter-locus distanceand
path length.The
probabilityan
interval iscovered
correctly is the probability that at leastone
of the pathsspanning
the interval is valid.1
00%n
Q-O
O
_la
0) ^-» CJc
O
u
c
O
O
Q.O
D
0)>
Q)o
75%
50%
25%
(\J CO^
LO toPath
Length
Figure
1c
o
t1o
60%n
i-O
o
^
50%-0)<
(0 Q-"oO
T3 4-*u
a>c
c
o
CJo
Q.O
40%-
30%-
20%-10%
Path
Length
Figure 2
MITLIBRARIES llll IIIIIIHIIII llllj lllllll ilillllM lllllll 39080 00922 4038