• Aucun résultat trouvé

2 Materials and Methods

2.7 Sequence analysis

2.7.1 Sequence annotation and comparative DNA sequence analysis

CG13617 coding regions were identified in the D. buzzatii sequence using BLASTX (translated nucleotide query vs. protein database) (MCGINNIS and MADDEN 2004) and compared with the D. melanogaster predicted protein annotation according to FlyBase (TWEEDIE et al. 2009). DNA and protein sequence alignments were performed with CLUSTALW (LARKIN et al. 2007) and/or MUSCLE (EDGAR 2004) and visualized with BIOEDIT software (HALL 1999). CG13617 Drosophila orthologous sequences and annotations correspond to the CAF1 (Comparative Analysis Freeze 1) genome assemblies of the 12 sequenced genomes (CLARK et al. 2007) and were obtained from the DroSpeGe Browser (GILBERT 2007) that was accessed through the Assembly/Alignment/Annotation of 12 related Drosophila species website . The GLEANR consensus annotation (CLARK et al. 2007) was verified for each species by translating the protein encoded by the predicted coding region and comparing with those of D. melanogaster and D. buzzatii. CG13617 similarity searches in species outside the genus Drosophila were performed by BLASTP (protein query vs. protein database) and TBLASTN (protein query vs. translated database) (MCGINNIS and MADDEN

2004) using in both cases D. buzzatii 2st protein as a query to interrogate the corresponding GenBank non-redundant databases. The search of homologous sequences in the available completed genomes of all organisms yielded no new addition to the list of putative orthologous proteins.

CG13617 upstream flanking sequences were analyzed using mVISTA and rVISTA (LOOTS et al. 2002, FRAZER et al. 2004) to search for conserved non-coding sequences (CNSs) among the different Drosophila species that could indicate the presence of promoter or regulatory elements. The DNA sequences of the five Drosophila subgenus species, including from the first two exons of gene nAcRβ-96A to the first two exons of CG13617, were submitted to the VISTA server together with their respective annotations, and aligned with LAGAN (BRUDNO et al. 2003) using translated anchoring to improve the alignment of distant homologues. A phylogenetic tree showing the relationships among species was introduced manually. Criteria used to identify significant conserved sequences were 70% identity in a

window size of 100 bp. Both mVISTA and rVISTA tools were used to define the CNSs and rVISTA was also employed to try to identify potential TFBSs. MATCH™ public version 1.0 (Biobase) was also used to search for TFBSs using all matrices from TRANSFAC® database and a cut-off to minimize false positives. Promoter predictions were performed with the 2006 fly version of MCPROMOTER (OHLER et al. 2002, OHLER 2006) with the highest sensitivity level, or with the NEURAL NETWORK PROMOTER PREDICTION tool (REESE 2001) at the Berkeley Drosophila Genome Project site.

The other genes sequenced in D. buzzatii (TABLE 5, see Results) were identified in the D. mojavensis and D. virilis CAF1 genomes by searching the available annotations in the DroSpeGe Browser (GILBERT 2007) at the Assembly/Alignment/Annotation of 12 related Drosophila species website . In case of doubt in defining the exact coding regions, the consensus GLEANR annotation (CLARK et al. 2007) was followed. Homology with the corresponding D. melanogaster gene was always verified by performing BLASTN searches against this genome. Finally, the nucleotide sequences of the two Drosophila subgroup species (D.

mojavensis and D. virilis) were aligned for each gene with MUSCLE (EDGAR 2004) in order to identify conserved regions where interspecific primers could be designed.

2.7.2 Protein analysis

Protein sequences were obtained by conceptual translation of the predicted CG13617 coding regions in the different available Drosophila genomes or the sequences generated experimentally in this work. To determine the best alignment of the CG13617 proteins of 14 Drosophila species, we tried several alignment methods, including MUSCLE, CLUSTALW and T-COFFEE with different parameters. Results of the different alignment methods are very similar and just differ in the position and/or length of some gaps. A final alignment was generated based on the MUSCLE alignment (EDGAR 2004) with some minor modifications according to the regular T-COFFEE alignment (NOTREDAME et al. 2000). The multiple sequence alignment was visualized with BIOEDIT (HALL 1999). Identity and similarity values between CG13617 proteins in the different Drosophila species were calculated with MATGAT software

(CAMPANELLA et al. 2003) based on pairwise alignments performed by the same program using BLOSUM62 as scoring matrix.

The distinct putative protein domains and motifs were detected using different prediction programs. The C2H2-type zinc finger was identified in the different proteins using INTERPROSCAN (ZDOBNOV and APWEILER 2001). The COILS software (LUPAS et al.

1991) was used to predict and characterize the coiled coil regions in each sequence individually with the following parameters: MTIDK matrix (weighted and unweighted), 28-residue window to determine the presence or absence of a coiled coil structure, 21-residue window to define more accurately the ends of coiled coil segments, and considering residues with probabilities

>50% to be part of a coiled coil. In the D. buzzatii protein sequence, the nuclear localization signal (NLS) was predicted with the PSORTII software (NAKAI and KANEHISA 1992) and the nuclear export signal (NES) using the NETNES1.1server (LA COUR et al. 2004). Finally, the presence of PEST sequences was determined with EPESTFIND (RECHSTEINER and ROGERS 1996).

For the protein conservation analysis we used the web version of the AL2CO program (PEI and GRISHIN 2001) on the multiple alignment of the CG13617 proteins of 14 Drosophila species (excluding the Dbuz_2j sequence). All positions with gaps in more than 50% of sequences (77 positions) were excluded from the analysis. To visualize the conservation patterns of the protein, we tried different window sizes (the optimum size for motif identification according to AL2CO documentation is 3) and selected a 10 aa window as the one giving the most interpretable results, although qualitatively similar results were obtained with other sizes. The different methods used to calculate conservation give very similar results overall (as pointed out by PEI and GRISHIN 2001), and we have chosen the Sum of Pairs measure with BLOSUM62 matrix because it takes into account the similarity between different amino acids (not only identity). To calculate the amino acid frequency we used the Henikoff & Henikoff modified method, which corrects for unequal distances between the different sequences (weight of similar sequences is lower) and has been widely used. Finally, to make conservation indices equal to each other for invariant positions we used the S’

normalization method.