Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes

(1)

Thesis

Reference

Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes

KOUA, Dominique Kadio

Abstract

Current pharmaceutical research is actively exploring the field of natural peptides. Venomics addresses this issue with the study of toxins. The concomitant development of sequencing techniques is opening new perspectives of understanding biological mechanisms.

Transcriptome sequencing of specific tissues is undertaken to better understand and characterize the context of gene expression. In this framework, transcriptomic data made available require automated processing workflows and user-friendly interfaces for data exploitation and comprehension. We present TATools, a bioinformatic platform that provides a unique management environment for understanding transcriptome data by merging results of diverse classical sequence analysis. Additional features and dedicated viewer pages makes TATools a valuable solution for highlighting novelty in a single transcriptome as well as cross-analysis of several transcriptomes in the same environment. TATools is validated in the context of venomics. This thesis reports the genesis of the design of TATools as exposed in two published articles and a manuscript (at this stage under [...]

KOUA, Dominique Kadio. Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes. Thèse de doctorat : Univ. Genève, 2012, no. Sc. 4471

URN : urn:nbn:ch:unige-239511

DOI : 10.13097/archive-ouverte/unige:23951

Available at:

http://archive-ouverte.unige.ch/unige:23951

Disclaimer: layout of this document may differ from the published version.

(2)

UNIVERSITE DE GENEVE Département d'informatique

Institut Suisse de Bioinformatique LABORATOIRES ATHERIS

FACULTE DES SCIENCES Professeur Ron D. Appel Dr. Frédérique Lisacek Dr. Reto Stöcklin

Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes.

THESE

présentée à la faculté des sciences de l'Université de Genève pour obtenir le grade de Docteur ès sciences,

mention Bioinformatique par

Dominique Kadio Koua de

Bouaké (Côte d'Ivoire) Thèse N° 4471

Genève

Centre d'impression UNIGE

1

^er

octobre 2012

(3)

(4)

Thesis

Bioinformatics tools to assist potential drug candidate discovery in venom gland transcriptomes.

KOUA, Dominique Kadio Abstract

Current pharmaceutical research is actively exploring the field of natural peptides. Venomics addresses this issue with the study of toxins. The concomitant development of sequencing techniques is opening new perspectives of understanding biological mechanisms. Transcriptome sequencing of specific tissues is undertaken to better understand and characterize the context of gene expression. In this framework, transcriptomic data made available require automated processing workflows and user-friendly interfaces for data exploitation and comprehension. We present TATools, a bioinformatic platform that provides a unique management environment for understanding transcriptome data by merging results of diverse classical sequence analysis.

Additional features and dedicated viewer pages makes TATools a valuable solution for highlighting novelty in a single transcriptome as well as cross-analysis of several transcriptomes in the same environment. TATools is validated in the context of venomics.

This thesis reports the genesis of the design of TATools as exposed in two published articles and a manuscript (at this stage under revision) and it describes the final outcome of this work with the support of a submitted manuscript detailing the analysis workflow. The use of TATools is illustrated with the study of the Conus consors venom gland transcriptome and subsequent conopeptide identification and classification. Other applications of parts of the TATools platform are shown in another two published articles.

Reference

KOUA, Dominique Kadio. Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes. Thèse de doctorat : Univ. Genève, 2012, no. Sc. Xxxx

iii

(5)

Remerciements Merci infiniment à,

Professeur Ron D. Appel de l'Université de Genève, Président du jury Professeur Amos Bairoch de l'Université de Genève, expert interne, Professeur Oliver Hartley de l'Université de Genève, expert interne, Professeur Jordi Molgo du CNRS en France, expert externe,

Docteure Frédérique Lisacek de l'Institut Suisse de Bioinformatique, Co-directrice, Docteur Reto Stöcklin des Laboratoires Atheris de Genève, Co-Directeur,

les honorables membres du jury qui ont accepté de consacrer de leur temps à l’évaluation et à l'amélioration des travaux que j'ai présenté.

Je voudrais remercier toutes les personnes qui par leur confiance, leur soutien et leur assistance ont permis la réalisation de ces travaux de thèse.

Grand merci à Sylvie et Reto Stöcklin d'avoir bien voulu m'offrir l'opportunité de réaliser mes travaux de thèse au sein des Laboratoires Atheris. Merci à Frédérique Lisacek qui m'a accueilli dans le groupe PIG de l'Institut Suisse de Bioinformatique et qui a toujours eu confiance en moi. Merci à Reto et à tous les partenaires du projet CONCO pour les belles expériences de collaboration internationales auxquelles j'ai eu l'occasion de participer. Merci à mes co-directeurs de thèse d'avoir toujours été présents et disponibles pour l'encadrement de ces travaux. Grand merci pour l'amitié que vous ne cessez de me témoigner.

Je voudrais également remercier Philippe Favreau (Philou) pour ses conseils et ses orientations ainsi que pour sa disponibilité sans faille.

Je voudrais remercier tout mes collègues des Laboratoires Atheris. Cela a été un plaisir quotidien de travailler au sein d'une équipe tout aussi compétente que sympathique. Que Estelle B., Roman M., Coralie D., Aude V., Cecile C., Nicolas H., Frederic P., Francine A., Florence B., Xavier S., Daniel B., Vera O., Hadrien G., Alain C., Sebastien D., Florence A., et tous trouvent ici ma reconnaissance pour les excellents moments que nous avons partagés.

Je remercie également mes parents, mes frères et sœurs pour leur affection et leur soutien indéfectible malgré la distance. Très chers André K., Joséphine B., Suzanne K. Jean-Baptiste A.,

iv

(6)

Jacques A., Florence K., Eugène G., Innocent K., Vincent K., Denis K., j'ai essayé de puiser dans votre courage et votre optimisme la force de mener de l'avant les taches qui m'incombaient. Merci d'être toujours là.

Ma reconnaissance va aussi à l'endroit des responsables de la Résidence Universitaire de Champel pour l'ambiance très conviviale et familiale et pour votre soutien toujours attentif. Merci en particulier à Joachim H., Manuel L., Alfred F., Lukas W., Philippe M., Hans F., Albert O., Albert M., Carlos S., Peter R., ainsi qu'à tous les résidents que j'ai eu énormément de plaisir à rencontrer.

Grand merci à Véronique M., Jocelyne B., Dolnide D., Laure V. Sylvie S. et Gabrielle de B. pour leur inestimable aide dans les questions administratives.

Merci à tous mes collègues du SIB pour leur amitié et l'exemple stimulant de leur qualité scientifique. Merci en particulier à Patricia P., Christian S. Béatrice C., Lorenzo C., Markus M., Fréderic N., Erik A. Je voudrais également adresser mes sincères remerciements à toutes les personnes de l'Institut Suisse de Bioinformatique et de l'Université de Genève pour leur disponibilité et leur assistance toujours cordiale.

Merci a tous mes amis d'ici et d'ailleurs pour leur soutien.

GRAND MERCI A VOUS TOUS.

Deo Omnis Gloria !

v

(7)

Présentation générale

La recherche pharmaceutique moderne est essentiellement basée sur le criblage à haut débit de molécules candidates en vue de leur sélection comme comme principes actifs ciblant spécifiquement des récepteurs biologiques impliqués dans les pathologies à guérir. Il apparaît toutefois que depuis une quinzaine d'années, le nombre de molécules nouvelles proposées par l'industrie pharmaceutique est en constante regression. Il est dès lors fondamental d'envisager l'exploration de nouvelles sources de composés bioactifs. Dans ce contexte, les peptides naturels occupent une place de plus en plus importante dans les programmes de recherche. D'une manière particulière, les venins animaux, connus pour être des cocktails de composés hautement actifs et spécifiques ont été largement étudiés et ont déjà révélé une grande partie de leur richesse. Toutefois, avec l'émergence de nouvelles techniques de séquençage à haut débit, l'exploration de l'éventail complet des protéines en cours d'expression par la 'lecture' des ARN messagers (transcriptome) des glandes à venin est devenue possible et économiquement accessible. Cette nouvelle approche présente l'avantage de permettre une exploration plus détaillée des potentialités de l'appareil venimeux. Toutefois, l'amélioration des techniques de séquençage entraîne la production de transcriptomes de plus en plus volumineux composés de millions de 'reads'. L'analyse bioinformatique des transcriptomes afin d'identifier les peptides pouvant avoir un intérêt apparaît donc comme une étape cruciale de la recherche pharmaceutique basée sur l'exploration des transcriptomes.

L'analyse des approches classiques d'analyse bioinformatique des transcriptomes a permis de mettre en lumière quelques problèmes pour lesquels la présente étude propose une solution. En effet, du fait du volume croissant de données transcriptomiques produites et de la variété d'outils d'analyse existant, l'exploitation pratique des transcriptomes s'avère encore très limitée. Quatre problèmes principaux ont été dégagés dans le présent travail:

1- La méthodologie d'analyse actuelle n'est pas optimale et surtout trop coûteuse en heures de calcul.

2- L'identification de peptides d'intérêt et l'annotation de leurs fonctions potentielles reste une activité longue et fastidieuse qui nécessite l'intervention d'un biologiste expérimenté capable d'explorer et de compiler manuellement les résultats hétérogènes issus entre autres de la recherche de similarité de séquence, ou celle de domaines conservés, de l'utilisation des liens avec des ontologies, etc…

3- La quantité de résultats à valider manuellement ainsi que la plupart des outils bioinformatiques couramment utilisés ne permettent pas la découverte de composés réellement nouveaux.

(8)

4- L'ensemble du processus d'exploration et de découverte est lourdement entravé par l'inexistence d'outils de visualisation adéquats.

L'objet du présent travail est un environnement informatique permettant d'assister la découverte de peptides présentant un intérêt pharmaceutique à partir de l'analyse des transcriptomes des glandes à venin. Cette thèse propose et décrit TATools, une solution efficace et conviviale répondant aux quatre préoccupations soulevées. L'utilisation de TATools et du nouveau schéma d'analyse dans le contexte du projet européen CONCO¹ a contribué de manière satisfaisante à plusieurs problématiques de recherche tant fondamentale qu'appliquée. Pour ce qui concerne la recherche pharmaceutique en particulier, la nouvelle approche proposée a permis d'identifier et/ou de confirmer l'existence à l'intérieur du venin de Conus consors d'analogues de la XEP-018, un composé préalablement isolé du venin de ce gastéropode prédateur. De plus, la modélisation spécifique des conopeptides (peptides de cône) a permis l'identification de nombreux composés intéressants à partir de l'analyse du transcriptome d'un spécimen unique de Conus adamsonii.

Origine des questions

Les animaux venimeux se rencontrent sous presque toutes les latitudes et dans de multiples phyla:

serpents et autres reptiles, mollusques marins, poissons, amphibiens, insectes, arachnides, myriapodes et même quelques mammifères. Ils possèdent des glandes exocrines très spécialisées couplées à un système parfois très sophistiqué (crochets, dards, harpons) pour l'administration du venin secrété. Les animaux venimeux et leurs venins sont depuis de longues années l'objet d'études scientifiques, spécialement parce que les envenimations constituent une cause relativement importante de décès et/ou d'incapacité dans le monde. De plus, les venins sont des mélanges très complexes de peptides et de protéines dont l'intérêt pour la recherche pharmaceutique n'a cessé de croître ces dernières années. L'attrait pour les venins réside dans l'extrême spécialisation et l’impressionnante efficacité des peptides et petites protéines qui les constituent. Les études réalisées ont permis de mettre en évidence que ces composés sont actifs sur un large spectre de cibles moléculaires. Ainsi, plusieurs médicaments issus de peptides de venins ou de leurs dérivés sont d'ores et déjà commercialisés (Capoten/Captopril ; Integrilin/Eptifibatide ; Aggrastat/Tirofiban ; Prialt/Ziconotide ; ...) tandis que de nombreux autres peptides se trouvent à différents stades de validation ou d'approbation.

1 Applied venomics of the cone snail species Conus consors for the accelerated, cheaper, safer and more ethical production of innovative biomedical drugs (http://www.conco.eu/)

(9)

Analyses biologiques

L'élucidation de la composition des venins est liée à l'évolution des techniques d'analyse. Les progrès des méthodes de séparation par électrophorèse sur gels (SDS-PAGE) ou de chromatographie en phase liquide (HPLC), les avancées de la spectrométrie de masse et de la spectroscopie par résonance magnétique nucléaire (RMN) ainsi que la miniaturisation des tests biologiques ont très rapidement été appliqués à l'étude des venins (vénomique ou vénimologie). La protéomique des venins a permis de dresser un inventaire de plus en plus complet des peptides et protéines qui les composent. D'autre part, les récentes avancées dans le domaine du séquençage ont ouvert de nouveaux horizons à la compréhension des systèmes venimeux. Le recours à la transcriptomique des glandes à venin est de plus en plus courant pour compléter la protéomique dans l'identification de composés potentiellement intéressants en recherche pharmaceutique. Dans ce contexte, l'analyse bioinformatique constitue un passage obligé dans la mesure où elle permet, grâce aux nombreux outils mis à disposition, de mieux exploiter la richesse des transcriptomes. En l'absence de ces outils bioinformatiques, l'analyse des données transcriptomiques serait un travail lent et fastidieux. Au contraire, le développement de ces outils facilite et accélère le travail d'interprétation et le guide dans la formulation de nouvelles hypothèses.

Problématique

Les questions préalablement posées suggèrent la capacité de reconnaître des peptides d'intérêt et d'optimiser la qualité de l'identification et la vitesse de détection desdits peptides. Quelle méthode employer pour parcourir les données des transcriptomes le plus rapidement et le plus efficacement possible? Comment détecter les séquences identiques ou similaires au peptide d'intérêt? Comment caractériser un peptide ou une famille de peptide afin d’accélérer la détection de séquences homologues?

Dans un second temps, il apparaît que l'exploitation des données transcriptomiques doit être étendue au delà d'une seule famille ou d'un unique composé. Identifier et classifier des séquences induit classiquement le recours aux bases de données publiques qui répertorient les séquences connues ainsi que des annotations automatiques et/ou manuelles. Cette démarche suppose la capacité accéder aux annotations disponibles en ligne pour inférer celles des données transcriptomiques produites. La grande interrogation demeure de savoir quelle conclusion tirer lorsqu'aucune annotation externe n'est disponible pour une partie du transcriptome. Dans ce cas, une analyse plus minutieuse doit être menée car elle peut aboutir à la découverte de composés potentiellement inconnus.

(10)

Il reste encore à résoudre la question très rarement abordée de la (re)présentation des résultats obtenus. Comment afficher de façon à la fois concise et précise les résultats des différentes analyses effectuées sur le transcriptome? Comment assurer l'interactivité avec les utilisateurs? La visualisation des données et des résultats d'analyse constitue en soi un défi, d'autant plus que, de la qualité de la visualisation peut dépendre la qualité des conclusions et interprétations tirées par les utilisateurs.

Méthodologie

Vu l'éventail des problèmes à aborder, la solution que nous proposons intègre une base de donnée realtionnele et des outils d'analyse robustes et performants, le tout fonctionnant dans un environnement web interactif et convivial. TATools (Environnement Bio-informatique pour l'Analyse des Transcriptomes) permet aussi bien aux novices qu'aux spécialistes de tirer le meilleur parti de l'immense potentiel des transcriptomes rendu accessibles à des coûts toujours plus abordables par les nouvelles générations d'appareils et de techniques de séquençage. Le but des analyses est d'optimiser et de faciliter l'identification et la détection de séquences pouvant présenter un intérêt pour la recherche pharmacologique. Dans ce but, TATools inclut des outils d'analyse classiques tels que BLAST pour la recherche de similarité de séquence dans les bases de données, Gene Ontology (GO) pour le transfert d'annotation et la prédiction de fonction ou d'activité, SignalP pour la prédiction de séquence signal et MAFFT pour la réalisation d'alignements multiples (tous ces outils sont sommairement décrits en annexe 1).

D'un point de vue méthodologique, le grand atout de l'environnement proposé réside dans la combinaison fructueuse d'analyses classiques et complémentaires reposant sur des modèles probabilistes dont l'efficacité a largement été démontrée. Deux des articles publiés à l'issue de nos travaux abordent justement le recours à ces méthodes et démontrent les bénéfices par rapport aux méthodes précédemment utilisées notamment le BLAST. Ainsi pour l'analyse des transcriptomes de glandes à venins, des profils généralisés (PSSM) et des modèles de Markov cachés (HMM) ont été construits pour les familles connues de toxines. Ces modèles sont utilisés pour la détection d'analogues et leur applicabilité à grande échelle constitue un atout majeur dans l'identification de protéines d'intérêt. TATools est donc le fruit de la fusion au sein du même environnement des résultats d'outils classiques et de méthodes de recherche de motifs en vue de l'analyse de grandes quantités de données.

(11)

Application aux conopeptides

TATools a été développé dans le cadre du projet européen CONCO (www.conco.eu) coordonné par les Laboratoires Atheris. Il a permis la détection d'analogues intéressants pour la XEP-018, peptide phare du projet. Par ailleurs, l'analyse complète des transcriptomes de Conus consors et Conus adamsonii a permis la caractérisation de nombreux peptides issus de nombreuses familles de conopeptides. Le projet CONCO a d'autre part été l'occasion de mettre en lumière, une fois de plus, l'étroite relation existant entre protéomique et transcriptomique. Ainsi, les études protéomiques (spécialement le fractionnement par chromatographie et l'analyse par spectrométrie de masse) menées en parallèle des études transcriptomiques, ont permis de confirmer la présence, dans les venins prélevés sur les animaux vivants ou disséqués, de peptides matures dont les précurseurs ont été détectés par analyse bio-informatique du transcriptome. De même le transcriptome a permis de lever quelques ambiguïtés rencontrées lors de l'analyse protéomique du venin.

Notons également que le progrès des méthodes d'analyse protéomique et de séquençage est allé de pair avec celui des procédés de production chimique des protéines. Les venins sont en général disponibles en petite quantité et cela entraîne une relative difficulté pour isoler et purifier les peptides d'intérêt. Cependant, le séquençage ainsi que la synthèse chimique permettent déjà de contourner les problèmes de résolution rencontrés lors de l'analyse protéomique de faibles quantités de venin. Par ailleurs, l'analyse bio-informatique des transcriptomes constitue une méthode relativement rapide et efficace pour détecter les composés recherchés qui pourront être chimiquement synthétisés et purifiés en vue de la validation de leur activité pharmacologique.

Plan de la thèse

Le présent manuscrit comporte trois parties. Puisque le but de l'étude est d'analyser et d'exploiter des transcriptomes afin d'y découvrir des séquences possédant des caractéristiques intéressantes pour une cible protéique, dans la première partie, au chapitre I, nous évoquons des généralités sur les étapes du développement de nouveaux médicaments. Dans ce processus, la découverte du composé actif reste le point central et critique autour duquel les étapes de validation et d'optimisation sont réalisées. La transcriptomique propose une alternative prometteuse pour parvenir à l'identification de molécules candidates. Cette approche et son contexte technologique sont donc brièvement décrits dans ce chapitre. Dans le chapitre II de cette première partie, nous abordons la présentation des animaux venimeux et de leurs venins. L'accent est mis sur les caractéristiques des venins qui en font des sources potentielles de candidats por la découverte des médicaments de demain. Le chapitre III est consacré à la présentation des activités bioinformatiques réalisées dans le cadre du projet CONCO qui constitue sûrement une des initiatives les plus abouties

(12)

de vénomique car il a su fédérer des spécialistes de divers domaines de compétence autour de l'étude du Conus consors, un mollusque marin venimeux. Ces activités ayant en majorité porté sur les toxines du cône (conotoxines), ce chapitre présente les défis relatifs à leur classification ainsi que les approches développées au cours du projet en vue de leur identification au sein des transcriptomes.

La seconde partie est consacrée à la présentation des problèmes liés a l'exploitation des données transcriptomique ainsi que de la méthodologie proposée pour résoudre ces problèmes (Chapitre IV).

L'application de la méthodologie proposée a conduit au développement d'un environnement bioinformatique pour l'analyse des transcriptomes baptisé TATools. Les grandes lignes de l'implémentation de la plate-forme d'analyse des transcriptomes sont d'abord décrites, notamment les modules inclus dans le processus d'analyse des données ainsi que la structure des bases de données sous-jacentes (Chapitre V). Ensuite, un guide d'utilisation détaillé pour les futurs utilisateurs présente les principales interfaces du logiciel ainsi que les principales activités d'analyses qu'il est possible d'y réaliser (Chapitre VI).

Finalement, la troisième partie présente les résultats de l'application de la méthodologie proposée.

Le chapitre VII illustre l'utilisation de la plate-forme et donne quelques résultats issus d'exemples d'application. Le manuscrit se termine par un chapitre VIII consacrée à une discussion sur les performances de la plate-forme proposée. Des perspectives sont également formulées sur les développements qui pourraient être implémentés en vue d'améliorer le niveau d'exploration et d'exploitation des transcriptomes.

Les annexes comportent entre autre des articles présentant des résultats intermédiaires obtenus lors de la résolution des problèmes recensés lors de l'évaluation des schémas classiques ainsi que le résultats obtenus grâce à l'utilisation de la plate-forme.

Mots clés :

Séquençage de nouvelle génération, transcriptome, venin, toxine, peptide, protéine, activité pharmacologique, bioinformatique, famille de protéines, profils généralisés, modèles de Markov cachés, analyse de séquence, annotation automatique, environnement intégré, outil web.

(13)

Publications

Identification and classification of conopeptides using profile Hidden Markov Models.

Laht S., Koua D., Kaplinski L., Lisacek F., Stöcklin R., Remm M. 2012.

Biochim Biophys Acta. 1824(3):488-92. Epub 2011 Dec 30.

Position-Specific Scoring Matrix and Hidden Markov Model complements each other for the prediction of conopeptide superfamilies.

Koua D., Laht S., Kaplinski L., Stöcklin R., Remm M., Favreau P., Lisacek F.

Revised manuscript submitted to Biochim Biophys Acta.

ConoDictor: a tool for prediction of conopeptide superfamilies.

Koua D., Brauer A., Laht S., Kaplinski L., Favreau P., Remm M., Lisacek F., Stöcklin R.2012.

Nucleic Acids Research. 40(Web Server issue):W238-41. Epub 2012 May 31.; doi:

10.1093/nar/gks337.

TATools, a bioinformatic environment for transcriptomes analysis.

Koua D., Mylonas R., Favreau P., Stöcklin R. and Lisacek F.

Manuscript ready for submission to BMC Bioinformatics.

Pattern Searches in Protein Sequences.

Koua D. and Lisacek F. 2012.

In: eLS 2012, John Wiley & Sons, Ltd: Chichester. http://www.els.net/

[DOI: 10.1002/9780470015902.a0006222.pub2]

(14)

Table of contents

Remerciements...4

Présentation générale...6

Publications...12

Table of contents ...13

List of figures...15

List of appendices...16

Introduction...17

Thesis overview...20

Chapter I. Brief overview on drug discovery and Next-generation sequencing...26

1- Drug discovery: classical approach of a lead compound...26

1.1- The target-based approach...26

1.2- In-silico support in the drug discovery process...27

2- Next-generation sequencing techniques: opening promising opportunities...29

2.1- Sequencing techniques...29

2.2- Transcripts analysis and interpretation...31

Concluding remarks...32

Chapter II- Venomics: discovery platform for tomorrow's drug candidates...33

Introduction...33

1- Venomous animals and their venom...33

2- Venom component as drug candidate...34

2.1- Attractive characteristics of venoms...34

2.2- Venomics: a flourishing field...35

3- Venom Proteomics...36

3.1- Venomics co-evolved with analysis techniques...36

3.2- Venom compounds characterisation...37

4- Transcriptomics of venom glands...38

Chapter III. Bioinformatics for CONCO: conopeptide classification marathon...39

1- Project overview...39

2- Conus consors description...39

3- Conotoxins: nomenclature, classification and pharmacological interest...40

4- Conotoxins: bioinformatics classification tools...43

5- Concluding remarks...44

Chapter IV. Study methodology: needs for an improved analysis platform...47

1- Overview of classical analysis workflow...47

2- Evaluation of the classical analysis workflow...49

2.1- Problems opened by the classical workflow...49

2.2- T-ACE, classical transcriptome analysis and organization platform...50

3- Methodology : a drug-discovery oriented analysis workflow...50

3.1- Problem A: Time consuming analysis workflow...50

3.2- Problem B: Highlighting sequences of interest...52

3.3- Problem C: Cross-validation of bioinformatics results...53

Chapter V. TATools implementation...57

Introduction...57

1- Platform use cases ...57

(15)

2- TATools methods...59

3- View and exploit results...60

Chapter VI. Interfaces...63

Introduction...63

1- Login page:...63

2- TATools home page...63

2.1- Enter a new profile...64

2.2- Enter a new transcriptome...64

2.3- Run analysis...66

3- Transcriptome viewer...66

4- List viewers...68

4.1- Global results for BLAST, model match and signal detection...68

4.2- Simplified list view...70

5- Specialized viewers...71

5.1- Compiled results of a translated transcript...71

5.2- Tatools contig viewer...73

5.3- Cluster summary...73

5.4- Clipboard...73

6- Anticipate biologists needs...76

6.1- Enriched BLAST viewer...76

6.2- Pseudo-precursor detection...76

6.3- Multiple sequence alignment manager...76

6.4- Additional tools to assist drug discovery...77

Chapter VII. Transcriptome analysis: a step forward in venomics...79

1- First case study: alpha conotoxins from Conus adamsonii ...79

1.1- Importance of alpha conotoxin...79

1.2- Presentation of Conus adamsonii...79

1.3- Transcriptome map of Conus adamsonii ...80

1.4- Alpha conopeptides from Conus adamsonii...80

2- Second case study: identification of analogues for the XEP-018...81

2.1- Conopeptide distribution in Conus consors venom gland transcriptome...81

2.2- Presentation of XEP-018...82

2.3- XEP-018 analogues detection in venom gland transcriptome...82

Chapter VIII. Discussion and Prospects...86

1- Too many sequences in the bin...86

2- More analysis approaches, more matches, more confidence...86

3- Family distribution in transcriptomes...87

4- Bringing out novelty ...87

5- Comparative transcriptomics...87

6- Results annotation...88

Conclusion...89

Bibliography...90

(16)

List of figures

Figure 1: Thesis graphical situation...17

Figure 2: Target-based drug discovery pipeline...27

Figure 3: Computational interventions in target-based drug discovery...28

Figure 4: A Conus consors...40

Figure 5: Typical regions of a conopeptide precursor...41

Figure 6: Summary of classical transcriptome analysis operations...48

Figure 7: Simplified and efficient BLAST-based annotation workflow...51

Figure 8: Automated cross-validation of results are summarized into a "Transcriptome map"...54

Figure 9: Relational database schema for the newly proposed analysis workflow...55

Figure 10: Complete improved workflow for transcriptome analysis...56

Figure 11: TATools use cases diagram...58

Figure 12: Sequence extraction activity diagram...61

Figure 13: TATools homepage...63

Figure 14: New transcriptome submission interface...65

Figure 15: Interface for setting analysis parameters...65

Figure 16: General result page. Example from Conus adamsonii transcriptome analysis...67

Figure 17: Viewer for transcripts belonging to a class of the transcriptome map...69

Figure 18: Viewer for transcripts associated to a GO term...69

Figure 19: TATools list viewer with annotation interface...70

Figure 20: TATools transcript viewer...72

Figure 21: TATools contig viewer displays reads used to construct a given contig...74

Figure 22: TATools contig cluster viewer...74

Figure 23: TATools clipboard helps to manage user selection...75

Figure 24: Conus consors Transcriptome map...83

Figure 25: Distribution of matches obtained for conopeptides suerfamilies by searching the Conus consors transcriptome with conopeptides pHMMs and PSSMs...83

Figure 26: New isoforms of mu-conotoxin identified from the Conus consors venom gland transcriptome...84

(17)

List of appendices

Appendix 1. Presentation of main bioinformatics tools referred in the manuscript...103

Appendix 2. Conopeptides superfamilies characteristics...109

Appendix 3. Identification and classification of conopeptides using pHMM...110

Appendix 4. PSSM and HMM complements each other for conopeptide prediction...116

Appendix 5. ConoDictor: a tool for prediction of conopeptide superfamilies...129

Appendix 6. TATools, a bioinformatic environment for transcriptomes analysis...134

Appendix 7. Pattern Searches in Protein Sequences...140

Appendix 8. Molecular phylogeny of conopeptides...147

(18)

Introduction Main motivation

The present work is a bioinformatics contribution that was designed to meet the requirements emerging from three different sources. First, this work is bound to pharmacology and the discovery of new drugs from natural compounds. As drug discovery is now increasingly influenced by the expansion of –omics technologies, this work is secondly related to large-scale sequencing initiatives and the current upsurge in generating vast amounts of sequences. Thirdly, this work was motivated by recent developments in venom gland studies now covered by the topic of venomics. Indeed, natural compounds for drug discovery are expected to be found in venoms.

In this context and despite the large collection of bioinformatics tools that are available, these are often unconnected while analytical pipelines are very much in demand especially in the field of venomics. Consequently, this thesis is an attempt to select and integrate a variety of concepts from different origins into a computer environment. As shown in Figure 1, the three fields of drug discovery, venomics and Next-Generation Sequencing (NGS) are related to one another. It is worth noting that most of the overlap between these fields is mediated by bioinformatics.

This work is at the confluence of Venomics, Next-Generation sequencing and Drug discovery.

Figure 1: Thesis graphical situation

(19)

Venomics and drug discovery

Decades of evolution have allowed nature to patiently select and propagate favourable gene products. It was recently recognised that one of the most interesting of these tools are venoms.

Made from a complex and concentrated mixture of peptides, proteins and smaller organic molecules, these cocktails have been developed through years of natural selection and adapted to specific species and environments making any venom potentially unique (Mebs, 2002). The extent of the venomous animal realm means that these species must survive in a large diversity of environments, thus developing highly specific mechanisms of attack and/or defence often using highly efficient cocktails of molecules specifically designed to intimidate, paralyse, kill and/or eventually pre-digest their preys or their attackers.

Venoms have been actively subjected to proteomics screening and technical advances have led to a always more complete and precise elucidation of their peptide content. The progresses made in analytical chemistry and biochemistry have led to the marketing of a number of drugs issued from venoms peptides or their derivatives. Drugs issued from venoms can have many advantages for therapeutic applications. Through millions of years of evolution, nature has developed highly stable, disulfide bridged peptides with high activity and specificity. These properties mean that only small amount are necessary to obtain the desired effect and consequently, production costs are decreased.

These peptides are often highly soluble in water which confers them a low toxicity because they can be easily cleared through the kidneys. Moreover, because of their small size, they exhibit poor immunogenicity.

Moreover, we are now realising that the genetic information contained in the various apparatus evolved to produce and deliver venoms can be an extraordinary resource of knowledge. Millions of year of evolutionary information are potentially available to us in transcriptomes.

NGS, venomics and drug discovery

The advent of Next-Gen sequencing (NGS) is currently giving us the tools to fully take advantage of the possibilities of proteins patiently selected by nature. Transcriptomic appeared to be an alternative approach to bio-assay guided screenings of venoms. Screenings for bioactives on fractionated venom can be a time consuming and expensive process. However, in the transcriptomic approach, the lead discovery and optimisation is somehow already achieved by nature. On the other hand, being able to properly identify a conserved functional domain in a protein is a key point to get clues about the protein's function and/or identify related homologous in others organisms. Various

(20)

identification and classification approaches have been proposed for protein characterisation. Most of them are based on common amino acid patterns or sequences similarity searches to characterise protein families. The results of automated searches for such patterns are used to qualify protein structure and function and to explore evolutionary relationships.

Back to bioinformatics

Considering the increasing number of DNA and protein sequences generated by high throughput technologies, bioinformatics tools must follow the pace of data generation and support the interpretation of transcriptomes. Up to now, only one tool has been proposed for the organisation and visualization of full transcriptome annotation projects (Philipp et al., 2012).

Recent works in Venomics showed various attempts of bioinformatic analysis of venom gland transcriptomes. These analyses mostly rely on sequence similarity searches, known domain scanning and manual annotations from experts.

In this study, we present TATools, a web-based user-friendly platform for transcriptome data analysis and visualization. TATools is based on open-source languages and tools. It offers HTML interfaces for data visualisation and exploitation; simplified forms are also provided for results annotation. The development of TATools was guided all throughout by a concern for usability and applicability to solve the questions raised by experimentalists.

(21)

Thesis overview

Modern drug discovery research is essentially based on high-throughput screening of candidate molecules to try to identify those specifically targeting a biological pathology-related mechanism.

However, despite the advantages of target-based approach from conceptual and practical viewpoint, the number of new active principles that reached the clinical stage or the market is dramatically low and continuously decreasing. This situation has led to the exploration of new sources of biologicals.

Among others, animal venoms have attracted interest because of the richness and efficiency of their compounds. The intensive exploration of venom by classical proteomics approaches have already proved the effectiveness of the venom-based drug discovery effort. In addition, thanks to the development of sequencing techniques, it is now possible to obtain transcriptomes of very good quality, at affordable price.s The transcriptome is the set of all proteins expressed at a particular time of life of the organ. The transcriptomics approach allows a deeper and more detailed exploration of peptides potentially produced by the venomous apparatus. However, the improvement of sequencing techniques leads to the production of increasingly numerous reads. The bioinformatic analysis of these transcriptomes to highlight peptides of interest for pharmacological application is therfore always more challenging. The evaluation of classical transcriptome analysis workflows revealed a number of limitation to be addressed:

1- The current analysis methodology is not optimized and is excessively time consuming.

2- Data interpretation and results validation as well as putative function annotations remain time consuming and require an expert intervention to point out sequences of interest.

3- The amount of data to be analyzed as well as the heterogeneity of bioinformatic analysis outputs make it fastidious to cross-link and cross-validate obtained results in order to reliably select interesting and novel sequences.

4- The lack of data visualization interfaces increases the difficulty of the whole data validation and interpretation process.

The purpose of this work is a bioinformatic environment to assist the discovery of peptides of pharmaceutical interest from animal venoms. This thesis proposes and describes TATools, an efficient and convenient bioinformatic solution addressing the four main presented concerns.

The use of TATools and the application of the newly proposed analysis workflow in the framework of the European project CONCO, efficiently contributed to resolve theoretical and applied

(22)

problems. The new analysis workflow indeed led to identify and/or confirm the discovery of analogues of XEP-018 in the venom gland transcriptome of Conus consors. XEP-018 is an active compound previously isolated from the venom of this predatory cone snail. In addition we constructed specific models for conopeptides (cone snail peptides). The model-based search allowed to identified of novel conopeptides from the venom gland transcriptome of a single specimen of Conus admasonii.

Origin of questions

Venomous animals are found in almost all latitudes and in multiple phyla: snakes and other reptiles, marine mollusks, fish, amphibians, insects, arachnids, centipedes and even some mammals. They have highly specialized exocrine glands coupled with a sometimes very sophisticated delivery system (hooks, darts, spears) to inject the secreted poison. Venomous animals and their venoms have been subjected to scientific studies, especially because envenomations constitute a relatively important cause of death and/or disability worldwide. In addition, venoms are highly complex mixtures of peptides and proteins whose interest in pharmaceutical research has grown steadily in recent years. The attraction for venoms is due to the extreme specialization and the impressive efficacy of peptides and small proteins they contain. The large number of achieved studies have highlighted the efficacy of these compounds for a broad spectrum of molecular targets. Thus, several drugs from venoms of peptides or their derivatives are already marketed (Capoten / Captopril; Integrilin / Eptifibatide; Aggrastat / Tirofiban; Prialt / Ziconotide; ...) while many other peptides are in various stages of validation or approval.

Biological analyses of venoms

The elucidation of the composition of venoms is related to the development of analytical techniques. Advances in methods of separation by gel electrophoresis (SDS-PAGE) or liquid chromatography (HPLC), advances in mass spectrometry and nuclear magnetic resonance spectroscopy (NMR) and the miniaturization of biological assays have rapidly been applied to the study of venoms (or Venomics). Proteomics has allowed to elucidate an increasingly number of peptides and proteins that compose venom. Moreover, recent advances in sequencing have opened new horizons to the understanding of venomous systems. The use of transcriptomic venom glands is becoming common to complete proteomics in identifying compounds of potential interest in pharmaceutical research. In this context, bioinformatics analysis is a must since reliable software helps to better exploit the wealth of transcriptomes. In the absence of such bioinformatics tools the

(23)

analysis of transcriptome data would be a slow and tedious work. In fact, the development of these tools speeds up and makes easier the work of interpretation and guides researchers in the formulation of new hypotheses.

Problematic

Rapid and reliable identification of proteins encoded in transcriptomes plays a pivotal role in next- generation data interpretation. This suggest the ability to recognize peptides of interest and optimize the quality of their identification and the detection speed of such peptides. This raises several questions: What method should be used to browse the data of transcriptomes as quickly and efficiently as possible? How to detect sequences identical or similar to the peptide of interest? How to characterize a family of peptide or a single peptide to speed up the detection of homologous sequences?

With a closer look, it appears that the exploitation of transcriptomic data must be extended beyond a single family or a single compound. Identify and classify sequences typically suppose the use of public databases that list known sequences with associated automatic and/or manual annotations.

The tricky question is what conclusion to draw when no external annotation is available for part of the transcriptome. In this case, a more careful analysis must be conducted since it can reveal potentially new and unknown compounds.

Finally the question of data display and result visualization must also be solved. How to display at once in a concise and a precise way all the results of analyses made on the transcriptome? How to provide interactivity with users? As already known, visualization is itself an entire challenge especially because from the quality of data visualization may depend the quality of interpretations and conclusions that will be drawn by users.

Methodology

Given the range of questions to be addressed, our solution actually integrates a relational database with classical analysis tools and powerful model matching strategies running in an interactive and user-friendly web environment. TATools (Bioinformatic Environment for Transcriptome Analysis) allows both novices and experts to get the most out of the immense potential of transcriptomes.

The purpose of this platform is to optimize and facilitate the identification and/or detection of sequences that may be relevant to pharmacological research. Conventional analysis tools such as BLAST for similarity search in sequence databases, Gene Ontology (GO) to transfer annotation about potential function or activity, SignalP for predicting signal sequence and MAFFT for

(24)

producing multiple sequences alignments. A brief description of these tools, and of other tools that are mentioned throughout the text, is given in appendix 1.

The main feature of the methodology implemented in this new environment is the successful combination of classical analysis tools with probabilistic modelling methods whose effectiveness has been widely demonstrated. Two published articles describing our work demonstrate the benefits of model-based classification methods compared to commonly used sequence similarity search methods such as BLAST. Thus the transcriptome analysis of Conus spp. venom glands was enhanced by using generalized profiles (PSSM) and Hidden Markov Models (HMM) prepared for families of known toxins. These models led to the detection of analogues and their applicability to large scale studies appeared to be a major asset in identifying peptides of interest.

Application to conopeptides

TATools was developed in the framework of an European project named CONCO (www.conco.eu) coordinated by Atheris LAboratories. It allowed the detection of analogues of interest for the XEP- 018, the lead peptide of the project. Moreover, the full analysis of transcriptomes of Conus consors and Conus adamsonii allowed the characterization of many peptides from various conopeptide families. The CONCO project was an opportunity to highlight the close collaboration between proteomics and transcriptomics. In parallel with transcriptomic analysis, proteomic studies (especially the fractionation by liquid chromatography and analysis by mass spectrometry) were carried out on the venom collected from living or dissected animals. Proteomics confirmed the presence in the venom of mature peptides whose precursors were detected by bioinformatic analysis of the transcriptome. Similarly, the transcriptome has allowed to solve ambiguities encountered in proteomic analysis of venoms.

Organisation of the thesis

This manuscript consists in three parts. Since the motivation of our work is the analysis and exploration of transcriptomes for identifying sequences with interesting features given a pathology- related molecular target, in the first part, we discuss the main steps of drug discovery. In this process, the active compound discovery is the central and critical point which guides validation and optimization procedures. Transcriptomics proposes a promising alternative accelerating drug candidate discovery. This approach as well as some technological background are discussed in this chapter I. In Chapter II, we present the interest of venoms and the main results of previous studies conducted on this subject. Chapter III is devoted to the presentation of bioinformatics activities carried out during the CONCO project which united specialists from various biology, biochemistry

(25)

and bioinformatics fields around the study of Conus consors, a venomous marine snail. These activities have mainly focused on analyzing the cone toxins (conotoxins). This analysis is challenged by classification open issues. This chapter covers the approaches developed during the project for conotoxin identification in the transcriptome.

The second part is devoted to the presentation of our bioinformatic environment for transcriptome analysis named TATools. The new methodological approach proposed for transcriptomes analysis is described in Chapter IV. Then, the main aspects of the implementation of the transcriptome analysis platform are described, in particular modules included in the data analysis process and the structure of the underlying databases (Chapter V). Finally, a more detailed user guide presents some user interfaces and explains how to perform analysis on the platform (Chapter VI).

The third part presents results obtained by the usage of the new transcriptome analysis environment and opens a discussion on the future of TATools. Chapter VII illustrates the use of the platform and describes some results with two conus species. Chapter VII discusses performance of the proposed platform. Prospects are also made on further features that could be implemented to improve the level of exploration and exploitation of transcriptomes.

Appendices include articles presenting the methodological aspects of conopeptide classification and related applications as well as a manuscript describing TATools.

Keywords:

Next-generation sequencing techniques, transcriptome analysis, venom, toxin, peptide, protein , drug discovery, bioinformatics, protein family, generalised profiles, hidden Markov models, sequence annotation, web-based tool, integrated platform.

(26)

Part 1:

Background and Thesis Context

(27)

Chapter I. Brief overview on drug discovery and Next-generation sequencing

1- Drug discovery: classical approach of a lead compound

In early days, drug research was purely empiric. A remedy, known to be effective in human disease, but whose mechanism was not understood was the starting point. The goal then was to elucidate the mechanism and to use this knowledge to improve the therapeutic properties of the active principle.

In the last fifteen years target-based drug discovery has reversed the overall trajectory of research in the pharmaceutical industry. Drug targets are now chosen on the basis of a hypothesis about the pathophysiology of the disease. Initial pharmacological tests of these hypotheses in man are not undertaken until after years of preparatory work (Hurko, 2012).

This preparatory research activities include various scientific specialities and are time consuming and expensive. The whole process is referred as Drug discovery. Drug discovery can be conducted at three different levels: mechanism, function and physiology (Drews, 2003).

1- The physiology-based approach seeks to induce a therapeutic effect by reducing disease- specific symptoms or physiological changes. The screening is usually conducted in isolated organ systems or in whole animals. The physiology-based approach was the first drug discovery paradigm, and has resulted in many effective treatments. It is still used extensively but suffers from a very low screening capacity and difficulty in identifying the mode of action of compounds.

2- The function-based approach seeks to induce a therapeutic effect by normalizing a disease-specific functional abnormality. Functional parameters represent a higher level of organism complexity because function requires the integrated action of many mechanisms. the screening capacity of these two methods is low and so they cannot be used for library screening.

3- The mechanism-based approach, which corresponds to the target-based approach, seeks to produce a therapeutic effect by targeting a specific mechanism. It screens for compounds with a specific mode of action. It is the most commonly used strategy because of its ability to screen huge compound libraries (Sams-Dodd, 2005). The present work relates to this case.

1.1- The target-based approach

In the target-based approach (Figure 2), novel mechanisms are identified based on fundamental research in biology and clinical findings. These mechanisms are validated based on expression patterns and knock-out mice. After target selection, some high-throughput screening (HTS) in vitro assay is developed to measure the selectivity of compounds to the target. HTS normally results in several compounds, preferably belonging to different chemical classes, with medium to high

(28)

affinity to the target. In the lead identification phase, small-scale of analogues screening around these structure classes are performed to determine feasibility of reaching a selective compound with appropriate drug-like properties (Sams-Dodd, 2006). The lead structures can be tested in a disease model to determine if the targeted mechanism has therapeutic potential and if the outcome is positive, the lead optimisation programme begins. This programme is mainly a structure-activity relationship elucidation. During this step, a large number of analogues are produced around the lead structures and are screened for target selectivity, pharmacokinetic and metabolic properties. At the end of the lead optimisation phase, suitable compounds are tested in an in vivo disease model for proof-of-principle and, if the study is positive, the compound is selected for development.

The lead optimisation and validation steps are long (2-4 years) and expensive. It is estimated that a marketable drug result from the systematic study of 10⁵ to 10⁶ molecules during almost 12 years for an assumed cost of at least a billion dollars. However, despite the fact that the target-based approach is highly advantageous from a scientific and practical viewpoint, it does not translate into a high success rate for novel targets. Indeed, there has been a steady decline in the number of new molecules and biologicals that enter clinical development and reach the market (Chanda and Caldwell, 2003; Van den Haak et al., 2004).

(figure adapted from http://sydney.edu.au)

1.2- In-silico support in the drug discovery process

All steps of the target-based discovery are assisted by computers. For instance, computational Figure 2: Target-based drug discovery pipeline.

(29)

genomics and protein crystal structure determination are used to improve target identification and validation. These computational techniques, by modeling both target and protein active conformations help to deduce useful interactions. This strategy could be considered as in-silico system biology. In the same manner, prediction of protein-ligand structure actually guides analogue preparation and screening. Finally at the early clinical trial stage, biosimulation can be used to evaluate adverse effect rates according to drug interactions and genetic background of the targeted population. In silico activities are summarized in Figure 3.

Concluding remarks

To face the decline of the number of drug and/or biologicals entering the market, natural products are actively explored. The improvement in throughput and quality as well as the decrease of costs of next-generation sequencing offer drug discovery a new source of compounds to be screened.

The opening of the sequencing era offers new perspectives to target-based drug discovery. Indeed, high-throughput transcriptome data provides hundreds of putative drug candidates to be tested. In the next section we briefly present sequencing approaches and introduce bioinformatic importance for transcriptome data analysis .

Figure 3: Computational interventions in target-based drug discovery.

(30)

2- Next-generation sequencing techniques: opening promising opportunities

Genome and transcriptome analyses have become unavoidable in elucidating biological processes.

Today, sequencing is not only providing long reads of good quality but is also relatively affordable.

High quality and accessibility of current techniques, combined with extensive computational capabilities have given genome and transcriptome analyses a prevalent role in biological studies.

2.1- Sequencing techniques

Sequencing techniques have evolved very rapidly in the past decades. Major improvement in terms of throughput, speed and cost led to major reappraisal at distinct time points. This explains why sequencing is qualified with “generations”.

The Sanger method, known as the first generation sequencing technique, has been the most widely adopted and used sequencing technology probably because of its very low error rate². This method requires a single-stranded DNA template, a DNA primer, a DNA polymerase, normal deoxynucleotide triphosphates (dNTPs) incorporated in the newly synthesized strand in a cycle reaction, and modified nucleotides (dideoxyNTPs) that terminate DNA strand elongation. When this mixture is fractionated by electrophoresis on denaturing acrylamide gels the pattern of bands shows the distribution of dTs in the newly synthesized DNA. By using analogous terminators for the other nucleotides in separate incubations and running the samples in parallel on the gel, a pattern of bands is obtained from which the sequence can be read off (Sanger et al., 1975; Sanger et al., 1977). A typical Sanger sequencing reaction yields sequences with length up to 700–800 bp, after which the quality of the sequences decreases (Casals et al., 2012).

The second generation of sequencing techniques, also called high-throughput or Next Generation Sequencing (NGS) technologies, have exponentially increased the quantity of sequences generated, producing up to several million bases (gigabases, Gb) in a single run. The first and critical step in all NGS technologies is the library preparation. Library preparation is globally always the same and consists in DNA/RNA purification and random fragmentation by physical or enzymatic reactions to generate fragments of desired average sizes. The resulting fragments are then ligated to short DNA 2 Historically, the first DNA sequencing method was proposed by Maxam and Gilbert. This first method was a chemical procedure that breaks a terminally labeled DNA molecule partially at each repetition of a base. Four different reactions were proposed to preferentially cleave DNA at specific nucleotide. The DNA sequence required radioactive labelling at one 5' end by a kinase reaction using gamma-32

P ATP. Chemical treatment generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). The initial DNA sequence was reconstituted from the migration pattern of radioactive bands obtained by electrophoresis on a polyacrilamide gel (Maxam and Gilbet, 1977).

(31)

fragments called adaptors. DNA/RNA is then amplified by PCR, and the sequencing reaction is performed. Sequencing techniques include the pyrosequencing technique proposed by Roche/454, the Illumina approach proposed by Solexa, the sequencing by ligation system used by Applied Biosystems' SOLiD technology, the Ion Torrent semiconductor sequencing methodology proposed by Life Technologies. Table 1 summarizes the main features of the most commonly used sequencing platforms. A detailed summary of each of these major sequencing techniques is added in Appendix 3. Other recently available second-generation technologies include Polonator G.007 (Shendure et al., 2005), the nanoball sequencing from Complete Genomics (Drmanac et al., 2010).

Table 1: Comparison of different sequencing platforms.

(This table was originally published by Thudi et al., 2012)

The main novelty of third generation (also called single molecule) sequencing technologies is that they are able to detect light from a single molecule change (Braslavsky et al., 2003; Harris et al., 2008). Helicos Biosciences, the first company that presented a third generation sequencing technology, has developed the Helicos' True Single Molecule Sequencing (tSMS). Sequencing is performed by synthesis of four fluorescently labelled nucleotides one at a time. A laser makes the

(32)

nucleotides emit light that will be detected by the sequencer. Pacific Biosciences has also produced a third generation system, following a real-time sequencing by synthesis method where the sequencing is not halted resulting in very short run times and longer reads. Single DNA polymerase molecules are attached to “zero-mode waveguide”, nanophotonic structures able to measure the fluorescence of labelled nucleotides in real time in reduced volumes enabling parallelization (Levene et al., 2003; Eid et al., 2009; Metzker, 2009). The main current limitation of single molecule sequencing technologies is the higher error rate (Schadt et al., 2010). Other third generation technologies not available include the detection of individual DNA bases as they pass through a nanopore, or microscopy techniques for direct imaging of single DNA molecules.

2.2- Transcripts analysis and interpretation

Data produced by sequencing platforms are raw reads that need to be assembled for reconstructing the original genetic information expressed in the studied sample. Most assembling software produces contiguous nucleotide sequences named contigs. Other software is also available for contig annotation.

In the case of model organisms such as human, mouse or yeast, complete genomes are available.

Transcriptome analysis of model organisms therefore mainly consists in mapping operations. This has been fully described and discussed in literature and still constitute an interesting challenge.

However, assembly problems of non model organisms remains even more challenging.

In the case of non model organisms, limited a priori sequence information exists. Assembly must be performed without the aid of a reference genome. Two main approaches are proposed for de novo assembly: overlap graphs and de Bruijn graphs. In the first case, overlaps between each pair of reads is computed and compiled into a graph. Each node of this graph represents a single sequence read and an edge represent an overlap between two reads. The consensus is computed by following the overlap graph. This algorithm is computationally intensive and most effective in assembling fewer reads with a high degree of overlap. On the other hand, De Bruijn graphs breaks reads into smaller sequences called k-mers (usually 25-50 bp). Thees k-mers are aligned based on k-1 sequence conservation to create contigs. The use of k-mers – which are shorter than the read lengths – in de Bruijn graphs reduces the computational intensity of this method.

After de novo assembly, the analysis of contigs, relies on comparative analysis with annotated genes or gene products of other organisms. The elucidation of the role of a specific sequence will necessitate a comparative analysis with annotated sequences from different organisms described in a wide range of databases (NCBI nr, UniProtKB/SwissProt, Gene Ontology (GO), KEGG, InterPro

(33)

and others). Searching these databases will generate rich and voluminous outputs that also need to be evaluated by scientists. The management of results produced by bioinformatic analyses of assembled transcriptomes also constitute a bottleneck for transcriptome interpretation. This aspect will be discussed later in Chapter IV.

Concluding remarks

The combination of target-based drug discovery and high-throughput transcriptome sequencing has opened the era of transcriptome-based drug discovery. In this context, the present work focuses on the detection of novel sequences inside transcriptomes after the reads have been assembled.

The next chapter presents the CONCO project, a successful initiative of unravelling the transcriptome of a non model organism in order to discover new drug candidates.

(34)

Chapter II- Venomics: discovery platform for tomorrow's drug candidates

Introduction

The term “venomics” was introduced to embrace techniques and methods intended to understand and characterize venom and venom glands toxin contents (Ménez et al., 2006). The venomics approach currently encompasses transcriptomic, proteomic, peptidomic and/or glycomic studies of venom and venom glands (de Graaf et al., 2009). The recent development of venomics has made this field a must for identifying tomorrow's drug candidates.

1- Venomous animals and their venom

In any habitat there is competition for resources, and every ecosystem on Earth supporting life contains poisonous or venomous organisms. One of the most fascinating techniques of capturing prey or defending oneself is the use of poison or venoms. Venom represents an adaptive trait and an example of convergent evolution (Fry, 2008). The animal kingdom includes more than 100,000 venomous species spread through major phyla such as chordates (reptiles, fishes, amphibians, mammals), echinoderms (starfishes, sea urchins), mollusks (cone snails, octopi), annelids (leeches), nemertines, arthropods (arachnids, insects, myriapods) and cnidarians (sea anemones, jellyfish, corals) (Mebs, 2002). Venomous animals typically possess venom-producing exocrine glands coupled to a delivery system including barbs, beaks, fangs, harpoons, pincers, proboscises, spines, spurs and stingers (Fry, 2009). The ecological advantages conferred by a venom system are evident from the extraordinarily diverse range of animals that have evolved venoms for the purposes of predation, defense or competitor deterrence (King, 2011).

Venoms are deadly cocktails, each comprising unique mixtures of peptides and proteins naturally tailored by Natural Selection to act on vital systems of the prey or victim. Venom toxins disturb the activity of critical enzymes, receptors, or ion channels, thus disarranging the central and peripheral nervous systems, the cardiovascular and the neuromuscular systems, blood coagulation and homeostasis (Ménez et al., 2002). Venoms often include protease inhibitors and stabilizing agents that protect them from internal and external (high temperature) detrimental effects, and hence preserve them in the glands for weeks. It is estimated that they are composed of a mixture of 200 to 1000 peptides and proteins, most of which have not been characterised (Ménez et al., 2006; Biass at al., 2009). Multiplying the number of potential venom components by the number of venomous species makes it easy to understand what a natural resource of bioactives venoms represents (Escoubas and King, 2009).