Fichiers et banques de données

(1)

Fichiers et banques de données

A. Search for all amino acid sequences of glutamate dehydrogenase (GDH)

● 1. Go to the NCBI : http://www.ncbi.nlm.nih.gov/

● 2. Tape : "glutamate dehydrogenase OR GDH"

● 3. Select : "Protein" : 3807 hits (03/2007)

● 4. Choose the option "Preview/Index" . This allows the use of key-words with the option : "Add Term(s) to Query or View Index"and a boolean logical search ("AND" / "OR" / "NOT")

The number of hits is indicated on the right of the screen (3807). To see the results ("Summary"), click on this number in your browser.

● 1. Field "Add Term(s)", choose option "All fields"

● 2. Tape the following terms (make a copy / paste) : (warning : this list has not been updated)

peptide NOT partial NOT mutant NOT phenylalanine NOT leucine NOT synthase NOT putative NOT caspase NOT quinoprotein NOT membrane NOT decarboxylase NOT topoisomerase NOT

monooxygenase NOT transaminase NOT kinase NOT oxidase NOT thioredoxin NOT glycerate NOT glyoxylate NOT glucose NOT glucose-1-dehydrogenase NOT glutamate dehydrogenase-related NOT glutamine NOT glutamyl NOT glucarate NOT glycerol NOT proline NOT valine NOT semialdehyde NOT aldehyde NOT glyceraldehyde NOT dihydropyrimidine NOT

formyltetrahydrofolate NOT fatty NOT isocitrate NOT

saccharopine NOT methylmalonate NOT probable NOT possible NOT related NOT similarity NOT similar NOT homolog NOT homologue NOT synthetic NOT unknown NOT hypothetical NOT patent NOT transcriptional NOT thymidine NOT reductase NOT resolvase NOT regulatory NOT zinc NOT 3-dehydroquinate NOT adhesion NOT ammonium

● 3. Click on boolean "NOT". All keywords and booleans

See the answers

(2)

are written in the main field (top of the page)

● 4. Click on "Preview" : 290 hits (03/2007)

● What is the goal of this selection ?

● What is the consequence of the boolean "OR" and

"NOT" ?

● Some are not GDH. Which ones ?

B. Removing redondant sequences

This part is the most tedious and time - consuming one. This can be made using "Multalin" from the Institut National de la Recherche Agronomique (INRA).

● What type of file could be used to know the name of the organism ?

● Why are there multiple files for the same protein from the same organism ?

● To what kind of information are linked the various accession numbers in those files ?

See the answers

● 1. Field "Add Term(s)", choose option "Organism"

● 2. Tape the name chosen : for example "Homo sapiens"

● 3. Click on boolean "AND"

● 4. Click on "Preview" : 8 hits (03/2007) for this organism

● 5. Click on this number (8). The files "Summary" are returned

● 6. Field "Display", choose the option "Fasta". This is one of the various format of data used by the algorithms of

sequences alignment

● 7. Click on "Display". The files in FASTA format are returned

● 8. Field "Send to", choose "Text" : a new HTML page is returned. Copy the data

See screenshots

(3)

● 9. Open a new navigator window and go to the Web interface "Multalin" (Florence CORPET) for multiple alignment :

http://prodes.toulouse.inra.fr/multalin/

multalin.html

● 10. Paste data and adjust the parameters

● 11. Start the software. The multiple alignment indicates which sequences are the same, therefore redondant. Note these accession numbers.

● What can you conclude ?

See screenshots

● Go back to the NCBI. Remove the false-positve hits (CRYL1) and sequences whose length is less than 20 amino acids.

● Remark : omenclature for range of sequence lengths: 3000:4000[SLEN] (see HELP from "Entrez")

● Redo the selection for "Homo sapiens"

● Make a new multiple alignment.

● Do those 5 sequences correspond to 5 differents proteins ?