Filesystem Basics - Files and Directories in Unix

Part II: The Bioinformatics Workstation

Chapter 4. Files and Directories in Unix

4.1 Filesystem Basics

All computer filesystems, whether on Unix systems or desktop PCs, are basically the same. Files are named locations on the computer's storage device. Each filename is a pointer to a discrete object with a beginning and end, whether it's a program that can be executed or simply a set of data that can be read by a program. Directories or folders are containers in which files can be grouped. Computer filesystems are organized hierarchically, with a root directory that branches into subdirectories and subdirectories of subdirectories.

This hierarchical system can help organize and share information, if used properly. Like the taxonomy of species developed by the early biologists, your file hierarchy should organize information from the general level to the specific.

Each time the filesystem splits into subdirectories, it should be because there are meaningful divisions to be created within a larger class of files.

Why should you organize your computer files in a systematic, orderly way? It seems like an obvious question with an obvious answer. And yet, a common problem faced by researchers and research groups is failure to share information effectively. Problems with information management often become apparent when a research group member leaves, and others are required to take over his project.

Imagine you work with a colleague who keeps all his books and papers piled in random stacks all over his office. Now imagine that your colleague gets a new job and needs to depart in a hurry—leaving behind just about everything in his office. Your boss tells you that you can't throw away any of your colleague's papers without looking at them, because there might be something valuable in there. Your colleague has not organized or categorized any of his papers, so you have to pick up every item, look at it, determine if it's useful, and then decide where you want to file it. This might be a week's work, if you're lucky, and it's guaranteed to be a tough job.

This kind of problem is magnified when computer files are involved. First of all, many highly useful files, especially binaries of programs, aren't readable as text files by users. Therefore, it's difficult to determine what these files do if they're not documented. Other kinds of files, such as files of numerical data, may not contain useful header information.

Even though they can be read as text, it may be next to impossible to figure out their purpose.

Second, space constraints on computer system usage are much more nebulous than the walls of an office. As disk space has become cheaper, it's become easier for users of a shared system simply never to clean up after themselves. Many programs produce multiple output files and, if there's no space constraint that forces you to clean up while running them, can produce a huge mess in a short time.

How can you avoid becoming this kind of problem for your colleagues? Awareness of the potential problems you can cause is the first step. You need to know what kinds of programs and files you should share with others and which you should keep in your own directories. You should establish conventions for naming datafiles and programs and stick to these conventions as you work. You should structure your filesystem in a sensible hierarchy. You should keep track of how much space you are using on your computer system and create usable archives of your data when you no longer need to access it frequently. You should create informative documentation for your work within the filesystem and within programs and datafiles.

The nature of the filesystem hierarchy means that you already have a powerful indexing system for your work at your fingertips. It's possible to do computer-based research and be just as disorganized as that coworker who piles all his books and papers in random stacks all over his office. But why would you want to do that? Without much more effort, you can use your computer's filesystem to keep your work organized.

4.1.1 Moving Around the Directory Hierarchy

Like all modern operating systems, the file hierarchy on a Unix system is structured as a tree. You may be used to this

Safari | Developing Bioinformatics Computer Skills -> 4.1 Filesystem Basics

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=45 (1 of 5) [6/2/2002 8:54:45 AM]

All Books Search

from PC operating systems. Open one folder, and there can be files and more folders inside it, layered as deep as you want to go. There is a root directory, designated as /. The root directory branches into a finite number of files and subdirectories.

On a well-organized system, each of these subdirectories contains files and other subdirectories pertaining to a particular topic or system function.

Of course, there's nothing inside your computer that really looks like a tree. Files are stored on various media—most commonly the hard disk, which is a recordable device that lives in your computer. As its name implies, the hard disk is really a disk. And the tree structure that you perceive in Unix is simply a way of indexing what is on that disk or on other devices such as CDs, floppy disks, and Zip disks, or even on the disks of every machine in a group of networked

computers. Unix has extensive networking capabilities that allow devices on networked computers to be mounted on other computers over the network. Using these capabilities, the filesystems of several networked computers can be indexed as if they were one larger, seamless filesystem.

4.1.2 Paths to Files and Directories

Each file on the filesystem can be uniquely identified by a combination of a filename and a path. You can reference any file on the system by giving its full name, which begins with a / indicating the root directory, continues through a list of subdirectories (the components of the path) and ends with the filename. The full name, or absolute path, of a file in someone's home directory might look like this:

/home/jambeck/mustelidae/weasels.txt

The absolute path describes the relationship of the file to the root directory, /. Each name in the path represents a subdirectory of the prior directory, and / characters separate the directory names.

Every file or directory on the system can be named by its absolute path, but it can also be named by a relative path that describes its relationship to the current working directory. Files in the directory you are in can be uniquely identified just by giving the filename they have in the current working directory. Files in subdirectories of your current directory can be named in relation to the subdirectory they are part of. From jambeck 's home directory, he can uniquely identify the file weasels.txt as mustelidae/weasels.txt. The absence of a preceding / means that the path is defined relative to the current directory rather than relative to the root directory.

If you want to name a directory that is on the same level or above the current working directory, there is a shorthand for doing so. Each directory on the system contains two links, ./ and ../, which refer to the current directory and its parent directory (the directory it's a subdirectory of ), respectively. If user jambeck is working in the directory /home/jambeck /mustelidae/weasels, he can refer to the directory /home/jambeck /mustelidae/otters as ../otters. A subdirectory of a directory on the same level of the hierarchy as /home/jambeck /mustelidae would be referred to as

../../didelphiidae/opossums.

Another shorthand naming convention, which is implemented in the popular csh and tcsh shell environments, is that the path of the home directory can be abbreviated as ~. The directory home/jambeck /mustelidae can then be referred to as

~/mustelidae.

4.1.3 Using a Process-Based File Hierarchy

Filesystems can be deep and narrow or broad and shallow. It's best to follow an intuitive scheme for organizing your files.

Each level of hierarchy should be related to a step in the process you've used to carry out the project. A filesystem is probably too shallow if the output from numerous processing steps in one large project is all shoved together in one directory. However, a project directory that involves several analyses of just one data object might not need to be broken down into subdirectories. The filesystem is too deep if versions of output of a process are nested beneath each other or if analyses that require the same level of processing are nested in subdirectories. It's much easier to for you to remember and for others to understand the paths to your data if they clearly symbolize steps in the process you used to do the work.

As you'll see in the upcoming example, your home directory will probably contain a number of directories, each

containing data and documentation for a particular project. Each of these project directories should be organized in a way that reflects the outline of the project. Each directory should contain documentation that relates to the data within it.

4.1.4 Establishing File-Naming Conventions for Your Work

Unix allows an almost unlimited variability in file naming. Filenames can contain any character other than the / or the null character (the character whose binary representation is all zeros). However, it's important to remember that some

characters, such as a space, a backslash, or an ampersand, have special meaning on the command line and may cause

Safari | Developing Bioinformatics Computer Skills -> 4.1 Filesystem Basics

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=45 (2 of 5) [6/2/2002 8:54:45 AM]

problems when naming files. Filenames can be up to 255 characters in length on most systems. However, it's wise to aim for uniformity rather than uniqueness in file naming. Most humans are much better at remembering frequently used patterns than they are at remembering unique 255-character strings, after all.

A common convention in file naming is to name the file with a unique name followed by a dot (.) and then an extension that uniquely indicates the file type.

As you begin working with computers in your research and structuring your data environment, you need to develop your own file-naming conventions, or preferably, find out what naming conventions already exist and use them consistently throughout your project. There's nothing so frustrating as looking through old data sets and finding that the same type of file has been named in several different ways. Have you found all the data or results that belong together? Can the file you are looking for be named something else entirely? In the absence of conventions, there's no way to know this except to open every unidentifiable file and check its format by eye. The next section provides a detailed example of how to set up a filesystem that won't have you tearing out your hair looking for a file you know you put there.

Here are some good rules of thumb to follow for file-naming conventions:

Files of the same type should have the same extension.

●

Files derived from the same source data should have a common element in their unique names.

●

The unique name should contain as much information as possible about the experiment.

●

Filenames should be as short as is possible without compromising uniqueness.

●

You'll probably encounter preestablished conventions for file naming in your work. For instance, if you begin working with protein sequence and structure datafiles, you will find that families of files with the same format have common extensions. You may find that others in your group have established local conventions for certain kinds of datafiles and results. You should attempt to follow any known conventions.

4.1.5 Structuring a Project: An Example

Let's take a look at an example of setting up a filesystem. These are real directory layouts we have used in our work; only the names have been changed to protect the innocent. In this case, we are using a single directory to hold the whole project.

It's useful to think of the filesystem as a family tree, clustering related aspects of a project into branches. The top level of your project directory should contain two text files that explain the contents of the directories and subdirectories. The first file should contain an outline of the project, with the date, the names of the people involved, the question being

investigated, and references to publications related to this project. Tradition suggests that such informational files should be given a name along the lines of README or 00README. For example, in the shards project, a minimal README file might contain the following:

98-05-22

Project: Shards

Personnel: Per Jambeck, Cynthia Gibas

Question: Are there recurrent structural words in the three-dimensional structure of proteins?

Outline: Automatic construction of a dictionary of elements of local structure in proteins using entropy maximization-based learning.

The second file should be an index file (named something readily recognizable like INDEX ) that explains the overall layout of the subdirectories. If you haven't really collected much data yet, a simple sketch of the directories with explanations should do. For example, the following file hierarchy:

98-03-22 PJ

Layout of the Shards directory

(see README in subdirectories for further details) /shards

Safari | Developing Bioinformatics Computer Skills -> 4.1 Filesystem Basics

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=45 (3 of 5) [6/2/2002 8:54:45 AM]

/shards/data/results/globins /shards/data/test_cases /shards/graphics

/shards/text

/shards/text/notebook /shards/text/reports /shards/programs

/shards/programs/source /shards/programs/scripts /shards/programs/bin

may also be represented in graphical form, as shown in Figure 4-1.

Figure 4-1. Tree diagram of a hierarchy

In this directory, we've made the first distinction between programs and data (programs contains the software we write, and data contains the information we get from databases, or files the programs generate). Within each subdirectory, we further distinguish between types of data (in this case, protein structures and protein sequences), and results (run on two sets of proteins, the enolase family and the globin superfamily) gleaned from running our programs on the data, and some test cases. Programs are also subdivided according to types, namely whether they are the human-readable program listings (source code), scripts that aid in running the programs, or the binaries of the programs.

As we mentioned earlier, when you store data in files, you should try to use a terse and consistent system for naming files.

Excessively long filenames that describe the exact contents of a file but change for different file types (like

all-GPCR-loops-in-SWISSPROT-on-99-7-14.text) will cause problems once you start using the facilities Unix provides for automatically searching for and updating files. In the shards project, we began with protein structures taken from the Protein Data Bank (PDB). We then used a homegrown Perl program called unique.pl to generate a nonredundant database, in which no protein's sequence had greater than 25% similarity to any other protein in the set. Thus, we can represent this information economically using the filename PDB-unique-25 for files related to this data set. For example, the list of the names of proteins in the set, and the file containing the proteins' sequences in FASTA format (a common text-file format for storing macromolecular sequence data), are stored, respectively, in:

PDB-unique-25.list PDB-unique-25.fasta

Files containing derived data can be named consistently as well. For example, the file containing all seven-residue pieces of protein structure derived from the nonredundant set is called PDB-unique-25-7.shard. This way, if you need to do something with all files pertaining to this nonredundant database, you can use the wildcard PDB-unique-25*, ignoring databases generated by different programs or those generated with unique.pl at different similarity thresholds.

File naming conventions can take you only so far in organizing a project; the simple naming schemes we've laid out here will become more and more confusing as a project grows. For larger projects, you should consider using a database management system (DBMS) to manage your data. We introduce database concepts in Chapter 13.

Delivered for Maurice ling Swap Option Available: 7/15/2002

< BACK Make Note | Bookmark CONTINUE >

Safari | Developing Bioinformatics Computer Skills -> 4.1 Filesystem Basics

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=45 (4 of 5) [6/2/2002 8:54:45 AM]

Index terms contained in this section

absolute paths directories, in Unix paths to files, in Unix naming paths to folders in Unix paths in Unix

process-based file hierarchies relative path

root directories

Safari | Developing Bioinformatics Computer Skills -> 4.1 Filesystem Basics

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=45 (5 of 5) [6/2/2002 8:54:45 AM]

Programming > Developing Bioinformatics Computer Skills > 4. Files and Directories in Unix > 4.2 Commands for Working with Directories and Files

See All Titles

< BACK Make Note | Bookmark CONTINUE >

158127045003020048038218232180015152050067001135112006120215207095124197091122229155053

Dans le document URLs Referenced in This Book (Page 90-95)