Part II: The Bioinformatics Workstation
SEQRES 1 357 GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN 2MNR 106 SEQRES 2 357 VAL PRO LEU ALA TYR PRO VAL HIS THR ALA VAL GLY THR 2MNR 107
5.9 Playing Nicely with Others in a Shared Environment
5.9.4 Creating Archives of Your Data
So, after months of your time, hundreds of megabytes of files, and several layers of subdirectories, the otter project is finally complete. Time to move on to the next project with a clean slate. But as refreshing as it may sound, you can't just type:
% rm -rf otter/
Other people may need to look back at your findings or use them as a starting point for their own research. At the other extreme, you can't leave your files lying around or laboriously copy them a few at a time to another location.
Not every file needs to be accessible at all times; some files are replaced, while others are more conveniently stored elsewhere. This section covers the tools provided by Unix for archiving your data so you don't have to worry about it on a day-to-day basis but can find things later when you need them.
5.9.4.1 tar: Hold the feathers
Safari | Developing Bioinformatics Computer Skills -> 5.9 Playing Nicely with Others in a Shared Environment
http://safari.oreilly.com/main.asp?bookname=bioskills&snode=58 (6 of 9) [6/2/2002 8:59:45 AM]
Usage: tar functions [options] [arguments] filenames
After going through all the effort of setting up your filesystem rationally, it seems like a waste to lose that structure in the process of storing it away, like hastily packed dishes in an unexpected cross-country move. Fortunately, there is a Unix command that lets you work with whole directories of files while retaining the directory structure. tar compacts a directory and all its component files and (if you ask for it) subdirectories into a single file with the name of the compacted directory and a .tar extension. The options for tar break down into two types: functions (of which you must choose one) and options. tar is short for "tape archive," since the utility was originally designed to read and write archives stored on magnetic tape. Another common use of tar is to package software in a form that can be easily transferred over the Internet.
To run tar, you must choose one of the following functions:
c
Creates a new tape archive r
Appends the files to an existing archive u
Adds files to the archive if they aren't present or are modified x
Extracts files from an existing archive t
Prints a table of contents of the archive The options for tar are as follows:
f archive
Performs the specified operation on archive, which can either be a device (such as a tape drive or a removable disk) or a tar file
v
(verbose mode) Prints the name of each file archived or extracted with a character to indicate the function (a for archived; x for extracted)
w
(whiny mode) Asks for confirmation at every step
Note that neither functions nor options require the hyphen that usually precedes Unix command options.
If you type:
% tar cvf otter/
the otter/ directory and all its subdirectories are rolled into a single file called otter.tar. It's good practice to use the v option, so you can see if something is going horribly wrong while the archive is being processed.
If, on the other hand, you want to make an archive of the otter/ directory on the tape drive nftape, you can type:
% tar cvf /dev/nftape otter/
A couple of warnings about tar are in order. First, before you use tar on your system, you should use which to find out whether the GNU or the standard version is installed. Several of the options mean different things to each version;
the ones listed earlier are the same in each version.
Second, the tar file you create will be as large as all the contents of the directory and subdirectories beneath it. This condition has dire implications if your archived directory is large and you have limited disk space, or you need to transfer large amounts of tar 'd data. In these cases, you should break down the directory into subdirectories of a more manageable size, and tar those instead.
Safari | Developing Bioinformatics Computer Skills -> 5.9 Playing Nicely with Others in a Shared Environment
http://safari.oreilly.com/main.asp?bookname=bioskills&snode=58 (7 of 9) [6/2/2002 8:59:45 AM]
If you don't have the space on your current filesystem or partition for your files and the archive you are creating to exist simultaneously, or you wish to download a whole archive file and unpack it just to retrieve a few files, you can transfer your archive over the network or even just to another partition using a combination of ftp and tar commands.
Sending an archive this way and then extracting it at the destination can be less time-consuming than a cp -r if a large number of files are involved. The ftp program recognizes a form in which a command replaces the input filenames.
The command is executed in a subshell on the local machine and operates on files on the local filesystem. The construct is:
ftp command "| command" filename
Inside the ftp program, here's how to send the output of the tar command, enclosed in quotes, into the filename specified as the target on the remote machine:
put "|tar cvBf - *" filename
Here's how to direct the downloaded archive through the tar command, resulting in extraction of only the files in the specified directory within the archive:
get filename.tar "|tar xvf - dirname"
Finally, here's how to list the contents of the remote archive:
get filename.tar "|tar t - *"
5.9.4.2 compress
Usage: compress -[options] filenames
Ultimately, you don't want to be left with large—if more manageable—tar files cluttering up your filesystem. In this situation, data-compression utilities are important, since they allow you to cheat and reduce the amount of space that files take up on your hard disk. compress is the standard Unix file-compression command. It's the opposite of
uncompress, the command used in Chapter 3 to open compressed papers and software. compress adds a .Z to the end of the filename.
Here are the most useful options for compress : -f
Forces compression; even if there is already a compressed version of the file, the main effect is to not overwrite an existing compressed file
-v
(verbose mode) Prints percentage compression achieved by the file -r
(recursive mode) If compress is applied to a directory that contains subdirectories, compresses their contents as well as those of the original directory
If you have a text file named stoat.txt and the tar file of the otter/ directory from the last section, and you want to compress both and look at the resulting compression ratio achieved, type:
% compress -v stoat.txt otter.tar
This command produces two files stoat.txt.Z and otter.tar.Z. The files can be uncompressed using the uncompress command or gzip -d (described next). In case you were wondering, natural languages (the kind humans use) end up with a compression ratio around 60%, and programming languages get around 40%. Try compressing the sequences of some of your favorite proteins to see what sort of ratio you get: the values can be wildly variable, depending on whether there are repeats in the sequence.
5.9.4.3 gzip
Usage: gzip -[options] filenames
As usual, in addition to the standard Unix compress, there's a faster and more efficient GNU utility: gzip. gzip behaves in much the same way as compress, except that it gets better compression on average, since it uses a superior
Safari | Developing Bioinformatics Computer Skills -> 5.9 Playing Nicely with Others in a Shared Environment
http://safari.oreilly.com/main.asp?bookname=bioskills&snode=58 (8 of 9) [6/2/2002 8:59:45 AM]
algorithm. gzip adds the suffix .gz to a file that it compresses. It emulates the compress options described earlier and adds a few of its own:
-N
(default setting) Preserves the original name and timestamp from the file being compressed -q
(quiet mode) Suppresses warnings when running -d
Returns a file that has been compressed by gzip to its uncompressed state; gzip can also recognize and uncompress files produced by compress
Delivered for Maurice ling Swap Option Available: 7/15/2002
Last updated on 10/30/2001 Developing Bioinformatics Computer Skills, © 2002 O'Reilly
< BACK Make Note | Bookmark CONTINUE >
Index terms contained in this section
archives, creating for data
© 2002, O'Reilly & Associates, Inc.
Safari | Developing Bioinformatics Computer Skills -> 5.9 Playing Nicely with Others in a Shared Environment
http://safari.oreilly.com/main.asp?bookname=bioskills&snode=58 (9 of 9) [6/2/2002 8:59:45 AM]
Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help
Programming > Developing Bioinformatics Computer Skills > III: Tools for Bioinformatics See All Titles
< BACK Make Note | Bookmark CONTINUE >
158127045003020048038218232180015152050067001135112006120214154184001152226225090070035