Haplotype Patterns - Haplotype Patterns as a Basis for Gene Mapping

Gene Mapping by Pattern Discovery

6.3 Haplotype Patterns as a Basis for Gene Mapping

6.3.2 Haplotype Patterns

Haplotype patterns serve as discriminators for chromosomal regions that are potentially shared IBD by a set of chromosomes in the dataset. LanguageL of haplotype patterns consists of haplotypes over subsets of the marker map, with a few constraints. Marker maps with over hundred markers are not uncommon today; in the near future maps of several thousand of markers can be expected. The number of possible haplotypes grows exponentially with the number of markers in the map. It is not possible to consider all the possible haplotypes in the analysis, but on the other hand, not all haplotype patterns are biologically conceivable. Meaningful patterns correspond to IBD sharing between chromosomes, so markers included in a pattern should form a

Gene Mapping by Pattern Discovery 113

contiguous block. Allowing a restricted number of wildcards within a pattern may be desirable, as there may be marker mutations breaking an otherwise IBD region, or there may be markers having a lot of missing or erroneous allele values. Additionally, haplotypes extending over very long genetic distances are highly unlikely to survive over many generations and meioses, and therefore the set of patterns to be considered can be restricted with an upper limit for the genetic distance between the leftmost and rightmost markers that are assigned with an allele.

Let p = [p₁· · ·p_m] be a haplotype pattern, wherep_j ∈ Aj ∪ {∗}, Aj is the set of alleles at markerj, and∗ is a wildcard symbol that matches any allele in the data. Patternpoverlaps markerj, or markerjis within pattern p, ifjis between the leftmost and rightmost markers bound inp(inclusive).

Length of p can be deﬁned as either (1) the genetic distance between the leftmost and rightmost marker bound in p or (2) the number of markers between and including the leftmost and rightmost marker bound in p. We deﬁne language L of patterns as set of such vectors p = [p₁· · ·p_m], where length(p)≤and either (1) the number of wildcards (∗) withinpis at most wor (2) the number of stretches of consecutive wildcards withinpis at most gand the length of such stretches is at mostG. Pattern parameters,w,g, andG are given by the user.

Haplotype i matches pattern p iff for all markers j holds: p_j = ∗ or p_j =D_ij. The frequency of pattern p, freq(p), is the number of haplotypes matching p. With genotype data things are more complicated; a match is certain only if at most one of the markers assigned with an allele in the pattern is heterozygous in a genotype. A match is possible if at least one of the alleles at each marker in the genotype matches the corresponding allele in the pattern. One possibility for handling the uncertain cases is optimistic matching, where a genotype matches a pattern if any of the possible haplotype configurations matches it: genotype i matches pattern p iff for all markers j holds: p_j = ∗ or p_j = g₁ or p_j = g₂, where (g₁, g₂) = D_ij. In section 6.4.3 we will show that this simplistic approach works surprisingly well. More elaborate schemes are possible, e.g., genotypes can be weighted by 2¹⁻ⁿ, where n is the number of heterozygous markers in the genotype which are also assigned with an allele in the pattern.

Example 6.3.1.Letp= [∗ ∗ 1 ∗ 2∗] be a haplotype pattern over markers (1, . . . ,6). The patternpoverlaps markers 3, 4 and 5 and is matched by, for example, haplotype [3 2 1 4 2 0] and genotype [(1,1) (1,2) (1,1) (2,4) (1,2) (2,3)]. Genotype [(1,1) (1,2) (1,2) (2,4) (1,2) (2,3)] may matchp, depending whether allele 1 at marker 3 and allele 2 at marker 5 are from the same chromosome or not. With optimistic matching, we consider this possible match as a match.

In the instances of HPM we have used, the qualiﬁcation predicate is based on a minimum frequency:q(p)≡freq(p)≥f_min, where the minimum frequency is either given by the user or derived from other parameters and some summary statistics of the data, such as sample size and the number of disease-associated and control observations.

6.3.3 Scores

The purpose of the scoring function is to produce a test statistic for each marker, measuring total association of the marker to the trait over all haplotype patterns that are relevant at the marker. The higher the score, the stronger the association. We deﬁne the set R_j of relevant patterns at markerj as the set of patterns overlapping marker j.

A very simple—yet powerful—scoring function, used in [403, 404], counts the number of strongly disease-associated patterns overlapping the marker:

s(Qj, Y) =|{p∈Qj |A(p, Y)≥amin}|, (6.1) where Q_j = Q∩R_j and A(·) is a measure for pattern–trait association or correlation. The association thresholda_min is a user-speciﬁed parameter.

Table 6.1 illustrates the procedure.

Table 6.1. This table illustrates the computation of markerwise scores with association threshold Z_min = 3. The patterns are ordered by the strength of association. Note that the wildcards within a pattern are included in the score for that marker.

Pattern M1 M2 M3 M4 M5 M6 Z

p1 ∗ ∗ 2 ∗ 1 ∗ 5.8

p2 ∗ 1 2 1 3 ∗ 4.4

p3 ∗ 2 2 ∗ 1 ∗ 4.0

p4 1 2 2 ∗ 1 ∗ 3.4

p5 ∗ 1 2 1 3 3 2.8

Score 1 3 4 4 4 0

Another scoring function, used in [306, 359], measures the skew of the distribution of pattern–trait association in the set of overlapping patterns.

The skew is deﬁned as a distance between the P values of pattern–trait association tests for the patterns inQ_jand their expected values if there was no association:

s(Q_j, Y) = 1 k

k i=1

(P_i(Y)−U_i) logP_i(Y) Ui

, (6.2)

Gene Mapping by Pattern Discovery 115

where k = |Q_j|, P₁(Y), . . . , P_k(Y) is the list of P values sorted into ascending order, andU₁, . . . , U_k are the expected rankedP values assuming that there is no association and that patterns are independent,Ui=_k+1ⁱ .

Both scoring functions consider each pattern as an independent source of evidence. In reality, the patterns are far from independent, but the assumption of independence is a useful approximation. An ideal scoring function would take the structure inQ_j into account.

In all current instances of HPM, the scoring function measures the pattern–trait association independently for each pattern. A pattern whose occurrence correlates with the trait is likely to do well in discriminating the chromosomes bearing the mutation. A meaningful test for this correlation depends on the type of data. With a dichotomous trait, e.g., aﬀected–

unaﬀected, association can be simply tested using theZ-test (or χ²-test) or Fisher’s exact test for a 2-by-2 contingency table, where the rows correspond to the trait value and the columns to the occurrence of the pattern:

M N

A n_AM n_AN n_A U nUM nUN nU

n_M n_N n

Let us assume that there aren_Mobservations that match patternp and n_Nobservations that do not matchp, and that there aren_Aaﬀected andn_U unaﬀected observations in total. Let the frequencies in the 2-by-2 contingency table, where the rows correspond to the trait value (A or U) and the columns to matching (M) or not matching (N) p, be nAM, nAN, nUM, and nUN, respectively. The value of the test statistic

Z = (n_AMn_UN−n_UMn_AN)√ n

n_Mn_N(n_AM+n_AN)(n_UM+n_UN) (6.3) is approximately normally distributed. A one- or two-tailed test can be used.

A one-tailed test is appropriate for patterns with a positive correlation to the trait. Assuming that there are no missing alleles in the data, it is possible to derive a lower bound for pattern frequency given the association threshold

f_min= n_Anx

nCn+nx, (6.4)

wherexis the association threshold forχ²statistic or theZthreshold squared (see [403] for details). No pattern with a frequency lower than f_min can be strongly associated. Even if there are missing alleles, this lower bound can be used—it is not imperative that all the strongly associated patterns satisfyq.

With a quantitative trait, the two-samplet-test can be used for identical means between the group of chromosomes matching the pattern and those not matching it. The number of degrees of freedom (number of chromosomes

minus two) is usually large enough to justify the use of theZ-test instead of thet-test.

If explanatory covariates are included in the data, a linear model can be formulated,

Y_i=α₁X_i1+. . .+α_kX_ik+α_k+1I_i+α₀, (6.5) where Y_i is the trait value for chromosome i, X_ij is the value of the jth covariate for the ith observation, and I_i is an indicator variable for the occurrence of the tested pattern. Its value is 1 if the pattern matches theith observation, otherwise 0. The significance of the pattern as an explanatory variable can be tested by comparing the best-fit model to the best-fit model whereα_k+1= 0.

Missing alleles in the observations are dealt with in a conservative manner:

if an allele is missing at a marker bound in pattern p and there is a mismatch in any other marker, then the observation is counted as a mismatch.

Otherwise we cannot know for sure whetherpoccurs in the observation, and to avoid any bias we ignore the observation when calculating the association for patternp.

6.3.4 Searching for Potentially Interesting Haplotype Patterns

Dans le document Advanced Information and Knowledge Processing (Page 115-119)