• Aucun résultat trouvé

Perl Programming Fundamentals for the Computational Biologist

N/A
N/A
Protected

Academic year: 2022

Partager "Perl Programming Fundamentals for the Computational Biologist"

Copied!
6
0
0

Texte intégral

(1)

Perl Programming Fundamentals for the Computational Biologist

Lab 2

Marine Biological Laboratory, Woods Hole

“Advances in Genome Technology and Bioinformatics”

Fall 2004

Andrew Tolonen

Chisholm lab, MIT

(2)

Goal of the lab: In the previous lab, you read a FASTA file of

genes, built a hash, and searched each gene sequence for a pattern.

Now you can ask “If I BLAST each of those genes against the swissprot database, what will the top hit be?” Here is what you will do:

copy makeHash.plx to create a new script called blastGenes.plx that will accomplish the following:

BLAST the sequence of each of the genes in your hash from last time against a local copy of the swissprot database.

read the output files to grab the top hit for each gene.

print the top hit for each the to an output file

Prelab: download the exercises, input files, and their solutions from cyano.mit.edu. I really hope it will work this time!

1. make a directory for lab 2 % mkdir Lab2

2. cd into that directory

% cd Lab2

3. Open your browser

4. Type “ftp://[email protected]” into the browser window password: perlmbl

5. Retrieve the archive by “save target as” for Lab2.tar.gz into your Lab2 directory

6. Unzip the archive by running the following command at the shell prompt. (Note: the % sign signifies the shell prompt. It may look different on your computer).

% gunzip Lab2.tar.gz 7. Untar the archive

% tar -xvf Lab2.tar

(3)

now if you 'ls', you should see a list of files in your current working directory:

% ls

genes.faa makeHash.plx perlLab2.pdf Solutions

Step 1. Your goal is to modify makeHash.plx to BLAST the genes in your hash. The first step is to print the gene name and its sequence to an output file. Replace the print statements with these

instructions. Run the script. Open the file “blast.faa”. Does it contain a gene entry? If you are having problems, see /

Solutions/exercise1.plx

foreach $x (keys(%genes)) {

# first print the gene out to a file for blasting

open (BLAST, ">blast.faa") or die "cant open blast file\n";

print BLAST "$x\n"; # print out the gene name line print BLAST "$genes{$x}\n"; # print out the sequence close BLAST;

}

Step 2. Now you are ready to BLAST the gene sequences. To do this, use backticks to make a system call to the blastall executable.

blastall can take many different arguments. For your purposes this will work:

`blastall -m 8 -p blastp -d swissprot -i ./blast.faa -o ./

blast.out`;

Here is what the blastall arguments mean:

Argument Meaning

-m 8 make tabular output

-p blastp BLAST protein seqs

-d swissprot BLAST against

swissprot database

-i ./blast.faa specify input file

(4)

-o ./blast.out specify output file

When you insert the blastall statement in the foreach loop after the open BLAST statements, you will be BLASTing each gene. After running the script, examine the contents of ./blast.out. Does it contain tabular blast output? See /Solutions/exercise2.plx for the solution.

Note: You can use backticks to execute any linux command line utility from within your Perl script. If you want, try it out!

Step 3. Grab the top hit from each BLAST search and print it to a separate file. This is easy because the information about the top hit is the first line of the output file. See Solutions/exercise3.plx for the solution.

# now grab the top hit from the blast output file

open (IN, "<./blast.out") or die "cant open blast output file\n";

$line = <IN>; # grab the first line of the file (contains # top hit)

push(@tophits, $line); # push the tophit info into an array close IN;

Now your foreach loop is done! It should contain code to execute the following commands:

foreach $x (keys(%genes)) {

# 1. print a gene entry to a file for blasting # 2. BLAST the gene against swissprot

# 3. grab the top hit and print it to a file }

# now print each element of the @tophits array to

# an output file

You have an array @tophits that contains the information about the top hit of each blast search. Print the @tophits array to an output file.

From outside the foreach loop.

(5)

# print each element in @tophits to ./tophits.txt open (HITS, ">./tophits.txt") or die "cant open hits file\n";

foreach $x (@tophits) {

print HITS "$x\n";

}

close HITS;

Open the file ./tophits.txt. Does it contain tabular BLAST hit information for the top hit of each gene?

Extra Credit:

1. What if you only wanted to BLAST the genes that contained a given motif?

$motif = “STG”;

foreach $x (keys(%genes)) {

if ($genes{$x} =~ /$motif/) {

# blast the genes }

}

# print the top hits to an output file

Could you use regular expressions to make the motif more flexible.?

ie. “A serine or a threonine followed by a Glycine or an Alanine, etc.”

2. You could use fastacmd to retrieve the sequences of the top hits.

Step 1. First of all, try running fastacmd from the command line.

Open the file “tophits.txt” and write down the gene ID for the first top hit. In the line below, the gene ID is 38605646.

gene gi|38605646|sp||P02562_1 32.20 59 33 2 81 138 27 79 1.4 30.42

% fastacmd -d swissprot -s 38605646

(6)

Step 2. In order to run fastacmd on each top hit, you need to grab the gene ID number. You can do this using regular expressions:

$line =~ /gi\|(\d+)\|sp/;

Once you have grabbed the gene ID's try running fastacmd from within your Perl script.

Références

Documents relatifs

- Perl just has three data structures (types of boxes to hold your data). – scalars (variables denoted with

You just returned from a seminar where the speaker described a novel amino acid motif found in genes involved infection.. Your goal is to use Perl to search all the genes in your

– You can download thousands of modules from CPAN (Comphrehensive Perl Archive Network) www.cpan.org. Installing modules on a

A first example of this approach could be the extension of an organization’s knowledge base to documents related to its territorial and social context: in this way the

“The UK TranSAS Mission shall address all modes of transport (road, rail, maritime and air) with an emphasis on maritime transport, and shall consider all relevant aspects of

Each model was a logistic exposure model (built in Program MARK) assessing the relationship between daily nest survival rate (dependent variable) and independent variables

We show, in the limit of a one-dimensional flow computed by our two-dimensional solver, or for flows in a cylindrical geometry, that the scheme recovers the classical Godunov

by Vigeral [12] and Ziliotto [13] dealing with two-person zero-sum stochastic games with finite state space: compact action spaces and standard signalling in the first case,