• Aucun résultat trouvé

Merging Datafiles with join

Dans le document URLs Referenced in This Book (Page 126-129)

Part II: The Bioinformatics Workstation

Chapter 4. Files and Directories in Unix

5.4 Transformations and Filters

5.4.6 Merging Datafiles with join

Usage: join -[options] file1, file2

join merges two files based on the contents of a specified join field, where lines from the two files having the same value in the join field are assumed to correspond. Files are assumed to have a tabular format consisting of fields and a consistent field separator, and are assumed to be sorted in increasing order of their join fields.

Command-line options for join include:

-1 fieldnum

Uses the specified field number as the join field in file 1 -2 fieldnum

Uses the specified field as the join field in file 2 -t character

Uses the specified character as the delimiter throughout the join operation -e string

Replaces empty output fields with the specified string -a filenum

Produces output for each unpairable line in the specified file; can be specified for both input files; fields belonging to the other output file are empty

-v filenum

Produces output only for unpairable lines in the specified file -o list

Constructs the output lines from the list of specified fields, where the format of the field list is filenum.fieldnum; multiple items in the list can be separated by commas or whitespace

join is quite useful for constructing data tables from multiple files, and a sequence of join operations can construct a complicated file. In a simple example, there are three files:

mustelidae.color:

badger black ermine white long-tailed tan otter brown stoat tan

mustelidae.prey:

ermine mouse badger mole stoat vole otter fish

long-tailed mouse mustelidae.habitat:

river otter

Safari | Developing Bioinformatics Computer Skills -> 5.4 Transformations and Filters

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=53 (4 of 6) [6/2/2002 8:59:26 AM]

snowfield ermine prairie long-tailed forest badger

plains stoat

First, combine mustelidae.color and mustelidae.prey. The field both have in common is the name of the animal, which is the first field in each file. mustelidae.prey isn't yet sorted. The form of the join command needed is:

% sort mustelidae.prey | join mustelidae.color - > outfile which produces the following output:

Now combine the resulting file with mustelidae.habitat. If you want the resulting output to be in the form habitat animal prey color, use the command construct:

% sort -k2 mustelidae.habitat | join -1 2 -2 1 -o 1.1,2.1,2.3,2.2 - outfile This operates on the standard input and the output file from the previous step to produce the output:

forest badger mole black

Usage: sort -[general options] -o[outfile] -[key interpretation options] -t[char]

-k[keydef]...[filenames]

The sort command can sort a single file, sort a group of files and simultaneously merge them into a single file, or check a file for sortedness. This function has many applications in data processing. Each line in the file is treated as a single field by default, but keys can also be defined by the user on the command line.

The main options for sort are:

-c

Tests a file for sortedness based on the user-selected options -m

Merges several input files -u

Displays only one instance of lines that compare as equal -o outfile

Sends the output to a file instead of sending it to standard output -t char

Uses the specified character to delimit fields

Options that determine how keys are interpreted can be used as global options, but they can also be used as flags on a particular key. The key interpretation options for sort are:

-b

Safari | Developing Bioinformatics Computer Skills -> 5.4 Transformations and Filters

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=53 (5 of 6) [6/2/2002 8:59:26 AM]

Ignores leading or trailing whitespace in a sort key.

-r

Reverses the sort order for a particular key.

-d

Uses "dictionary order" in making comparisons; i.e., characters other than letters, digits, and whitespace are ignored.

-f

Reclassifies lowercase letters as uppercase for the purpose of making comparisons. Normally, L and l would be separated from each other due to being in uppercase and lowercase character sets; with the -f flag, all L's end up together, whether capitalized or not.

5.4.7.1 Specifying sort keys

Key definitions are arguments of the -k option. The form of a key definition is position1,position2. Each is a numerical value that specifies where within the line the key starts and ends. Positions can have the form field.character, where field specifies the field position in the input line, and character specifies the position of the starting character of the key within its individual field. If the key is flagged with one of the key

interpretation options, the form of the key is field.character[flags]. If the key interpretation option isn't applied to the whole sort, but merely to one key, then it's appended to the key definition without a preceding hyphen.

Delivered for Maurice ling Swap Option Available: 7/15/2002

Last updated on 10/30/2001 Developing Bioinformatics Computer Skills, © 2002 O'Reilly

< BACK Make Note | Bookmark CONTINUE >

Index terms contained in this section csplit command (Unix)

cut command (Unix) EOL (end of line) character files, in Unix

transformations and filters filters

head command (Unix) join command (Unix) paste command (Unix) sort command (Unix) sort keys, specifying split command (Unix) tail command (Unix)

© 2002, O'Reilly & Associates, Inc.

Safari | Developing Bioinformatics Computer Skills -> 5.4 Transformations and Filters

http://safari.oreilly.com/main.asp?bookname=bioskills&snode=53 (6 of 6) [6/2/2002 8:59:26 AM]

Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help

Programming > Developing Bioinformatics Computer Skills > 5. Working on a Unix System > 5.5 File Statistics and Comparisons

See All Titles

< BACK Make Note | Bookmark CONTINUE >

158127045003020048038218232180015152050067001135112006120214154184001152228047033209138

Dans le document URLs Referenced in This Book (Page 126-129)