Contents
1 Introduction 1
1.1 Some numbers about proteins . . . . 4
1.2 Relation between sequence and structure . . . . 6
1.3 Experimentally investigating the protein structure . . . . 8
1.3.1 Protein Crystallography . . . . 8
1.3.2 Nuclear Magnetic Resonance data . . . . 9
1.4 Anfinsen’s dogma, dynamics and protein behavior . . . 11
1.5 Protein Dynamics and Protein Disorder: . . . 15
1.6 Estimation of dynamics from sequence . . . 17
1.7 Protein as Probabilistic entities . . . 18
1.8 Computational models of biological processes . . . 18
1.9 Sequence analysis in bioinformatics . . . 19
1.10 Goal of the thesis . . . 20
1.11 Contributions and list of publications . . . 20
2 Methods 23 2.1 Building models for biology . . . 23
2.1.1 Machine learning . . . 23
2.1.2 Supervised and unsupervised models . . . 23
2.1.3 Training and testing the model: the concept of overfitting . . 24
2.1.4 Performances Evaluation . . . 26
2.1.5 Markov chains and Hidden Markov Models . . . 27
2.1.6 Support Vector Machines . . . 31
2.1.7 A brief overview on Neural Networks . . . 33
2.2 Tracking protein evolution . . . 36
2.2.1 The most used alignment tools . . . 37
2.2.1.1 Clustal . . . 38
2.2.1.2 Mafft . . . 38
2.3 Predicting properties of Protein sequences . . . 40
2.3.1 Disorder prediction . . . 40
2.3.1.1 IUpred . . . 41
2.3.1.2 ESpritz . . . 42
2.3.2 Protein beta aggregation prediction . . . 42
2.3.3 Annotation of Archaeal DNA-binding proteins . . . 44
3 Contributions 47 3.1 Rigapollo:features-based protein alignment . . . 47
3.1.1 Translating amino acids into feature vectors . . . 50
3.1.2 Emission probabilities using SVMs . . . 52
3.1.3 Summary of the methodology . . . 54
3.1.4 Performance evaluation . . . 57
3.1.5 Datasets design . . . 57
3.1.6 Results . . . 58
3.1.7 Discussion . . . 63
3.2 AgMata: a beta-amiloid propensity predictor . . . 64
3.2.1 Approach . . . 65
3.2.1.1 Datasets . . . 66
3.2.1.2 Selection of the structural data . . . 66
3.2.1.3 Probability of pairing as discriminative problem . . . 66
3.2.1.4 Feature Vectors and application of a discriminative model . . . 67
3.2.1.5 Single-residue interaction probability calculation . . . 67
3.2.1.6 Beta Pairing Propensity . . . 68
3.2.2 Results . . . 69
3.2.3 Discussion . . . 70
3.3 DisoMine: a webserver for disordered prediction . . . 74
3.3.1 Training and testing Datasets . . . 74
3.3.2 Approach . . . 74
3.3.3 Results . . . 77
3.3.4 Discussion . . . 77
3.4 Xenusia: archaea DNA binding proteins identification . . . 78
3.4.1 Datasets . . . 79
3.4.2 Approach . . . 80
3.4.2.1 Prediction of DNA-interacting residues . . . 80
3.4.2.2 Predicting the DNA binding domain . . . 81
3.4.3 Validation on archaeal proteins . . . 82
3.4.4 Discussion . . . 84
4 Experimental applications 86 4.1 In-silico mutagenesis of human Ataxin-3 . . . 86
4.1.1 Preliminary experimental results . . . 90
4.2 Identification of Archaea DNA binding proteins . . . 99
4.2.1 Asgrad . . . 99
4.2.2 Sulfolobus acidocaldarius . . . 101
4.2.3 Experimental investigation . . . 102
4.3 Summary of the experimental procedures . . . 107
4.3.1 In-silico mutagenesis of human Ataxin-3 . . . 107
4.3.2 DNA-binding assay . . . 108
5 Conclusions and Future works 109 5.1 Obtaining reliable predictions from noisy data . . . 109
5.2 Computational and non-computational sciences . . . 110
5.3 Dynamics and predicted dynamics . . . 110
5.4 Concerns about the developed tools . . . 112
5.4.1 Protein Alignments and Rigapollo . . . 112
5.4.2 Predicting protein disorder with DisoMine . . . 113
5.4.3 DNA-binding protein identification in archaea . . . 113
5.4.4 Beta aggregation prediction with AgMata . . . 114
5.5 Future work . . . 115
6 Appendix 117 6.1 Glossary . . . 117
6.2 Methods Supplementary details . . . 120
6.3 Acronyms . . . 121
7 Acknowledgements 123
Bibliography 124