11 1.5 Protein Dynamics and Protein Disorder

(1)

Contents

1 Introduction 1

1.1 Some numbers about proteins . . . . 4

1.2 Relation between sequence and structure . . . . 6

1.3 Experimentally investigating the protein structure . . . . 8

1.3.1 Protein Crystallography . . . . 8

1.3.2 Nuclear Magnetic Resonance data . . . . 9

1.4 Anfinsen’s dogma, dynamics and protein behavior . . . 11

1.5 Protein Dynamics and Protein Disorder: . . . 15

1.6 Estimation of dynamics from sequence . . . 17

1.7 Protein as Probabilistic entities . . . 18

1.8 Computational models of biological processes . . . 18

1.9 Sequence analysis in bioinformatics . . . 19

1.10 Goal of the thesis . . . 20

1.11 Contributions and list of publications . . . 20

2 Methods 23 2.1 Building models for biology . . . 23

2.1.1 Machine learning . . . 23

2.1.2 Supervised and unsupervised models . . . 23

2.1.3 Training and testing the model: the concept of overfitting . . 24

2.1.4 Performances Evaluation . . . 26

2.1.5 Markov chains and Hidden Markov Models . . . 27

2.1.6 Support Vector Machines . . . 31

2.1.7 A brief overview on Neural Networks . . . 33

2.2 Tracking protein evolution . . . 36

2.2.1 The most used alignment tools . . . 37

(2)

2.2.1.1 Clustal . . . 38

2.2.1.2 Mafft . . . 38

2.3 Predicting properties of Protein sequences . . . 40

2.3.1 Disorder prediction . . . 40

2.3.1.1 IUpred . . . 41

2.3.1.2 ESpritz . . . 42

2.3.2 Protein beta aggregation prediction . . . 42

2.3.3 Annotation of Archaeal DNA-binding proteins . . . 44

3 Contributions 47 3.1 Rigapollo:features-based protein alignment . . . 47

3.1.1 Translating amino acids into feature vectors . . . 50

3.1.2 Emission probabilities using SVMs . . . 52

3.1.3 Summary of the methodology . . . 54

3.1.4 Performance evaluation . . . 57

3.1.5 Datasets design . . . 57

3.1.6 Results . . . 58

3.1.7 Discussion . . . 63

3.2 AgMata: a beta-amiloid propensity predictor . . . 64

3.2.1 Approach . . . 65

3.2.1.1 Datasets . . . 66

3.2.1.2 Selection of the structural data . . . 66

3.2.1.3 Probability of pairing as discriminative problem . . . 66

3.2.1.4 Feature Vectors and application of a discriminative model . . . 67

3.2.1.5 Single-residue interaction probability calculation . . . 67

3.2.1.6 Beta Pairing Propensity . . . 68

3.2.2 Results . . . 69

3.3 DisoMine: a webserver for disordered prediction . . . 74

3.3.1 Training and testing Datasets . . . 74

3.3.2 Approach . . . 74

3.3.3 Results . . . 77

(3)

3.4 Xenusia: archaea DNA binding proteins identification . . . 78

3.4.1 Datasets . . . 79

3.4.2 Approach . . . 80

3.4.2.1 Prediction of DNA-interacting residues . . . 80

3.4.2.2 Predicting the DNA binding domain . . . 81

3.4.3 Validation on archaeal proteins . . . 82

4 Experimental applications 86 4.1 In-silico mutagenesis of human Ataxin-3 . . . 86

4.1.1 Preliminary experimental results . . . 90

4.2 Identification of Archaea DNA binding proteins . . . 99

4.2.1 Asgrad . . . 99

4.2.2 Sulfolobus acidocaldarius . . . 101

4.2.3 Experimental investigation . . . 102

4.3 Summary of the experimental procedures . . . 107

4.3.1 In-silico mutagenesis of human Ataxin-3 . . . 107

4.3.2 DNA-binding assay . . . 108

5 Conclusions and Future works 109 5.1 Obtaining reliable predictions from noisy data . . . 109

5.2 Computational and non-computational sciences . . . 110

5.3 Dynamics and predicted dynamics . . . 110

5.4 Concerns about the developed tools . . . 112

5.4.1 Protein Alignments and Rigapollo . . . 112

5.4.2 Predicting protein disorder with DisoMine . . . 113

5.4.3 DNA-binding protein identification in archaea . . . 113

5.4.4 Beta aggregation prediction with AgMata . . . 114

5.5 Future work . . . 115

6 Appendix 117 6.1 Glossary . . . 117

6.2 Methods Supplementary details . . . 120

6.3 Acronyms . . . 121

(4)

7 Acknowledgements 123

Bibliography 124