• Aucun résultat trouvé

Data analysis has been performed following the workflow reported in Fig 19. Genomic profiles were obtained using Affymetrix Human Mapping GeneChip6.0 arrays (Affymetrix, Santa Clara, CA, USA). Initial array intensity data was obtained using Affymetrix GeneChip Command Console Software (AGCC) with 270 Caucasian HapMap samples as a reference dataset for normalization. Raw copy number (CN) and allelic imbalance (AI) were extracted from CEL files using the Affymetrix Power Tools software package

Materials and Methods

78

with standard parameters, (Affymetrix,Santa Clara, CA). We applied 'probe weeding' by removing probes (about 20%) with an interprobe distance of less than 180 bp. Without probe weeding we experienced that hot spots of high probe density were prone to artifacts of tiny amplifications or deletions. A minimum probe distance of 180 bp ensures that probes are more or less independent from the PCR fragment size. Allelic imbalance (AI) was corrected for copy number. The copy number for the X chromosome was centered to the median value.

Because of the sensitivity to genomics waves and tendency of hypersegmentation of existing algorithms, we devised a newsegmentation algorithm 'Fast first-derivative segmentation' (FFSEG; Kwee et al, submitted) based on edge detection using first-derivatives of the raw copy number signal. FFSEG is considerably more robust against genomics waves and hypersegmentation, is extremely fast and compares well with the established algorithms such as mBPCR [218] and circular binary segmentation [219]. For details on mBPCR and CBS, see references.

FFSEG uses non-segmented, raw copy number data that has not necessarily been corrected for genomic waves. For each chromosome, FFSEG locally computes first derivatives by convolving the raw signal with the SGED kernel

F(x) = -sign(x) * exp(-(x/n)**2)

where x is the position index and n is a scale parameter. This choice of kernel was preferred rather than the more commonly used derivative of a Gaussian because it showed better localization of the edge positions due to the sharp boundary at x=0.

Candidate edges were defined as statistically significant outliers, p<0.05 after Bonferroni multiple test correction. The detection of edges was repeated at multiple scales, n = (8, 16, 32, 64, 128, 256, 512, 1024, 2048). We found that this set of scales detected most edges in our test cases. The set of edges were then combined and duplicate positions were removed. This final set served as candidate positions of true chromosomal breakpoints. A subsequent pruning step removed breakpoint candidates by testing the mean signal of its flanking regions (t-test, p<0.05). Finally, the copy number level for each region is estimated by its mean value. For larger scales, we firstly bin the raw signal onto a coarse resolution, detect the approximate positions of edge candidates and refine the edges by re-estimating their positions with the SGED kernel using the full resolution of probes. This sped up the computation considerably without loss of accuracy.

Materials and Methods

79

We applied a multiple filtering strategy to minimize false positives. Segments containing less than 10 probes or with a size smaller than 2000 bp were discarded. Segments were filtered for copy number variation if they overlapped at least 50% with a known CNV according to the Database of Genomic Variants (http://projects.tcag.ca/variation/).

Segments spanning the centromere were filtered out.

After CNV filtering, we employed 'gap filling'. Small gaps of less than 500 kb and flanked by aberrated regions larger than the gap size itself were closed. Larger gaps between 500 kb and 10 Mb were filled if they were flanked by regions with at least 1.5 times their size. Finally, neighbouring segments were iteratively merged if their difference was smaller than 0.15 (log2ratio).

As final quality control, filtered and merged segmented profiles, together with their raw profiles, were visually inspected. Profiles were considered as poor quality and discarded from further analyses if they showed severe over-segmentation or no aberrations at all as evaluated by two independent investigators.

Copy number segments were discretized as heterozygous deleted (HDEL), loss (LOSS), normal (NORM), gain (GAIN), or amplified (AMPL) using thethresholds -0.60, -0.10, 0.10 and 0.60 (log2ratio). AIl segments were classified as LOH if the absolute AI was >0.25.

Clustered heatmaps were generated using an Euclidean distance and Ward linkage on the discretized CN/LOH values after resampling segments uniformly to about 10000 probes. The X/Y chromosomes were omitted for computing the clustering.

Partial frequencies were computed for LOH and each CN aberration class. The overall frequencies of LOSS and GAIN, correspond to the cumulative frequencies of (LOSS+HDEL) and (GAIN+AMPL), respectively.

Minimal common regions (MCR) were calculated according to the algorithm by Lenz et al [220]. Briefly, the MCR algorithm iteratively searches for high frequency peaks and determines the smallest region of maximum overlap as its corresponding MCR. Four distinct types of MCR were defined: short recurrent abnormality (SRA), long recurrent abnormality (LRA), abnormal chromosome arm (ACA) and abnormal whole chromosome (AWC). A maximum segment size of 25 Mb and gap size of 500 kb for SRA analysis, and a minimal segment size of 15 Mb and gap size of 10 Mb for LRA analysis were applied [220]. MCR frequencies were calculated as the fraction of samples bearing the same aberration contained in the MCR without regard to its MCR type. MCRs containing genes encoding the immunoglobulin heavy chain (IGHV) genes and the kappa and lambda light chains were discarded since CN changes in these regions probably represent

Materials and Methods

80

physiological rearrangements occurring in B-cells. Focal abberations were addtionally identified using GISTIC 2.0 [221]on the filtered and merged segments.

Figure 19. Workflow followed for data analysis.

Materials and Methods

81