Modeling ORF and SNP Densities in the Human Genome Next we consider modeling event sequence data, the ORF and SNP densities

Marko Salmenkivi and Heikki Mannila

5.2 Bayesian Approach and MCMC Methods

5.3.2 Modeling ORF and SNP Densities in the Human Genome Next we consider modeling event sequence data, the ORF and SNP densities

in the human genome. A more detailed example of intensity modeling of event sequence data is given in [112].

The event sequence is modeled as a Poisson process with a time-dependent intensity function λ(t). Intuitively, the intensity function expresses the average number of events in a time unit (see, e.g., [18, 161]).

Poisson loglikelihood of the event sequence S with occurrence times t₁, . . . , t_n, is given by (see, e.g., [163])

logL(S |λ) =−(

λ(t)dt) + n j=1

ln(λ(t_j)). (5.9)

We ran four similar trials, except for the prior speciﬁcations for the number of piecesm. In the last trial we also changed the prior of the intensity levels. The prior distributions were

number of piecesm∼ Geom(γ), levelsλ_i ∼ Gamma(ν, η), change points c_i ∼ Unif(S_s, S_e).

(5.10) The hyperparameters of the geometric distribution are given in Table 5.1. In the ﬁrst and fourth trial, large values of m were strongly weighted; in the third one we supported small values.

For the intensity levels, the gamma prior with hyperparameters ν = 0.005 and η = 0.5 was used in all but the last trial. In trial 4, we used

Table 5.1. Parameter values for the diﬀerent trials for modeling ORF density.

Prior distribution of intensity levelsλ_iwas a gamma distribution;νandηindicate the hyperparameters of the gamma distribution in each trial.

Trial 1 2 3 4

m∼Geom(γ);γ= 0.001 0.5 0.9 0.001

ν 0.005 0.005 0.005 0.001

η 0.5 0.5 0.5 0.1

hyperparameters ν = 0.001 and η = 0.1 instead (see Table 5.1). The expectation of the gamma distribution is ν/η, so the prior had the same mean 1/100 in all the trials. The variance, however, isν/η²; thus in the ﬁrst three trials the prior variance was 1/200, and in the last one it was 1/100.

We will return to the interesting question of the eﬀect of priors later.

During the burn-in period, a change of the value of m, that is, inserting or deleting a segment, was proposed approximately 50,000,000 times, and during the actual simulation run nearly 400,000,000 times. In the case of the chromosome 1, for instance, a candidate state was accepted in 0.11 % of the cases. Since the acceptance-rejection rates of the other parameters were much higher, they were updated more rarely, approximately 40,000,000 times during the actual run in the case of the intensity value in the ﬁrst segment λ1, for instance. The value of parameterm was picked up at approximately every 100th iteration; that is, the sample size ofmwas 4,000,000.

Figure 5.6 shows the posterior average and standard deviation of the number of segments for human chromosomes 1–22 in four trials with different prior distributions. For each chromosome there are four errorbars in the figure, each of which presents the posterior average and standard deviation of the number of pieces in one trial. There are clear differences between chromosomes. The differences are not explained by the sizes of the chromosomes, though the size and the number of segments correlate. For instance, there seem to be relatively few segments in chromosome 4 and many segments in chromosomes 7 and 11. Chromosomes 16 and 18 have about the same number of segments, but the number of segments needed to model the distribution of ORFs on chromosome 16 is about twice the number of segments for chromosome 18. Still, there seems to be a lot of variation in chromosome 18 within a single trial as well as between the trials.

The variation is also remarkably large in chromosome 1, while chromosome 15 is divided into 16 or 17 segments in all the trials. The segmentations of chromosomes 21 and 22 stay almost the same as well.

Figure 5.7 shows the posterior averages and deviations of the intensities of ORF occurrence frequency for chromosomes 15 and 18. There are clear segment boundaries in the chromosomes, indicating that various parts of the chromosomes are qualitatively different. The 16 or 17 segments of chromosome 15 that resulted in all the trials can easily be identified in the figure.

98 Data Mining in Bioinformatics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

5 10 15 20 25 30 35

Chromosome

Number of segments Trial 1: Mean and sd

Trial 2: Mean and sd

Trial 3: Mean and sd Trial 4: Mean and sd

Fig. 5.6.The posterior average and standard deviation of the number of segments for ORF occurrences on human chromosomes 1–22 in four diﬀerent trials. For the parameter values, see Table 5.1.

0 20000 40000 60000 80000

Intensity

Fig. 5.7. The posterior means and 99% percentiles of the intensity of ORF occurrence in human chromosomes 15 and 18.

Chromosome 18 has many sharp change points and boundaries with smaller variation between them. The slight changes inside the relatively stable periods explain the instability of number of segments in the diﬀerent trials.

The Bayesian framework and RJMCMC methods provide a conceptually sound way of evaluation between models on the whole space of segmentations.

The likelihood of the model can always be improved by adding more parameters to the model. By supplying prior probabilities to all the combinations of the model parameters, the problem is shifted in Bayesian analysis to investigating the joint probability distribution of the data and the parameters. The question then takes the form of whether the advantage gained in likelihood by adding more parameters exceeds the possible loss in prior probabilities.

From the point of view of data mining, a particularly interesting problem in segmentation and clustering more generally is ﬁnding the optimal number of segments based on the given data. In a typical data mining problem, the number of data is large, and there is little previous knowledge on the process generating the data.

In the experiments on the ORF distribution, the results clearly indicate differences between different chromosomes. The priors for the number of segments are less informative in the first and fourth trials. In the second trial small values were given considerably higher probability, and they were emphasized even more strongly in the third trial. While giving higher prior probabilities to smaller segment counts naturally decreases the expected number of segments, the magnitude of the effect seems to be quite different in different chromosomes.

An important aspect is that the posterior distribution of the number of segments may be inﬂuenced more by the prior speciﬁcation of the intensity levels than the prior for the dimension of the model. This is because in the higher dimension the joint prior density of the model consists of the product of one more prior densities of intensity levels than in the lower dimension.

100 Data Mining in Bioinformatics

This fact may cause problems when estimating the number of segments.

Assume, for instance, that very little prior knowledge is available as to the possible intensity values. Accordingly, we would like to give wide uniform prior distributions to the intensity levels. This practice would make sense if the model dimension is ﬁxed. However, for models with variable dimension, inserting a new segment causes the joint density to drop more the wider the prior distribution is.

The fourth trial illustrates this eﬀect on the ORF distribution example.

Gamma(0.005,0.5) priors were speciﬁed for the intensity levels λ_i in all the trials except for the fourth one, for which Gamma(0.001,0.1) distribution was used instead. The prior distribution of the last trial doubles the prior variance of the intensity levels. In all chromosomes, this change of prior has a stronger impact on the posterior number of segments than increasing the hyperparameter of the geometric prior of the number of segments from 0.001 to 0.5.

The sequence of SNP occurrences provides an example of a dataset where segmentation is of no use for obtaining a condensed representation of the data. Still, the RJMCMC methods can be used to model the continuous intensity. Figure 5.8 shows examples of intensities of the SNP occurrences from chromosomes 10 and 14. Only a few constant periods can be found as the posterior average of the number of segments is several thousand in both cases.

0 10000 20000 30000 40000 50000

Intensity

20000 30000 40000 50000 60000 70000

Intensity

Location (1000 bp) Chromosome 14

Fig. 5.8. The posterior means and 99% percentiles of the intensity of SNP occurrence in human chromosomes 10 and 14.

Dans le document Advanced Information and Knowledge Processing (Page 100-104)