FILE SYSTEMS - Recording formats - Information Retrieval: Data Structures & Algorithms

Recording formats

6.3 FILE SYSTEMS

Virtually all of the research into file systems for optical disks has concentrated on WORM and CD-ROM optical disks. These are the oldest forms of optical disk technology and the types with the most differences from magnetic disks. Commercially available erasable optical disks are a relatively recent phenomenon and are close enough in capabilities to magnetic disks that conventional file systems can usually be adapted. Thus, we concentrate our discussion on file systems developed for WORM and CD-ROM optical disks. But, before we proceed we first present a short discussion on some technical issues affecting the implementation of file structures on optical disks.

6.3.1 Technical Issues for File Systems

Optical disk technology is similar enough to magnetic disk technology that the same types of file

structures that are used on magnetic disks can usually be used in some form on optical disks. There are, however, some differences that can make some choices better than others.

For example, conventional pointer linked file structures (e.g., B-trees) are a poor choice for WORM optical disks. The reason for this is that each modification of the file usually requires some of the pointers linking the file structure together to change; rebalancing a B-tree after an insert or delete is a good example. Since storage space cannot be reclaimed on a WORM optical disk, changing the value of a pointer requires the new value to be stored in a new disk sector, consuming space. Thus, in the course of normal file maintenance operations, extra storage space is consumed.

This type of problem is present when linked lists or trees are used (as is true for B-trees). If an element or node is modified, then it must be stored on the disk in a new position and all pointers to the position of the old version of the node must be updated to point to the position of the new version. Changing those pointers in turn changes their positions on the disk requiring any pointers to their old positions to be changed as well. This means that if an element of a list is modified, all elements between it and the head of the list must be duplicated. The same is true for trees: if a node is modified, all nodes on the path

up to, and including, the root, must be duplicated.

This is not the only problem with using pointer linked structures on WORM optical disks; the direction that the pointers point is also of considerable consequence. A forward pointer is one that points from an older disk sector (written before) to a younger disk sector (written after), and a backward pointer points from a younger sector to an older sector. It is a characteristic of WORM optical disks that it is not

possible to detect a bad disk sector before the sector has been written; thus, if a forward pointer is stored on the disk, there is a chance that the sector is points to may subsequently turn out to be unusable,

making the forward pointer invalid. Sector substitution schemes that might deal with this problem can be envisioned, but a better solution is to simply avoid the problem where possible by using backward

pointers. Backward pointers do not have this problem as they always point to valid sectors.

Preallocation of disk space on a WORM disk can also lead to problems. Most disk drives are unable to differentiate between reading a blank sector and reading a defective sector (i.e., they both are

unreadable). Thus, if space has been reserved on a disk, it will be impossible to detect the difference between the beginning of preallocated (and blank) space on the disk and a previously written sector that is now unreadable (possibly because of a media defect, dirt, or scratches). This inability will render unreliable any organization that depends on preallocated space on a WORM disk.

A further aspect of WORM optical disk technology to consider when designing file structures is the granularity of data registration. On a WORM disk the smallest unit of data storage is the disk sector which has a typical capacity of 1 kilobyte. It is not possible to write a portion of a disk sector at one time and then another portion of the same sector later. A sector may be written once, and only once, and can never be updated. This restriction comes from the error correction code added to the data when the disk sector is written. If a sector is modified after its initial write, its contents would become inconsistent with the error correction code and the drive would either correct the "error" without comment or report a bad sector. With the inability to update the contents of a disk sector, there is a danger of wasting the storage capacity of the disk through sector fragmentation.

Not all of the characteristics of optical disks lead to problems for the implementations of file structures.

On CD-ROM disks, for example, since the data never changes some optimizations are possible. Also, the spiral track found on CLV format disks lends itself nicely to hash file organizations by allowing hash buckets of arbitrary size to be created, allowing bucket overflow to be eliminated.

6.3.2 Write-Once B-Tree

The Write-Once B-Tree (WOBT) of Easton (1986) is a variation of a B-tree organization developed for WORM disks. The difference between the two structures is the manner in which the contents of the tree's nodes are maintained.

In a conventional B-tree, when a node is modified, its previous value or state is discarded in favor of the new value. This practice recycles storage space, but is not desirable in all situations. For some

applications the ability to access any previous state of the tree is a distinct advantage; the WOBT allows this.

The WOBT manages the contents of the tree nodes in a manner that preserves all versions of a node throughout the life of the tree. This is accomplished by not overwriting nodes when they change, but instead appending new time-stamped entries to the node. The most current state of the tree is represented by the latest version of each entry (including pointers) in a node.

When a node is split in a WOBT, only those entries that are valid at the current time are copied into the new version of the node. The old version is retained intact. Deletions are handled by inserting a deletion marking record that explicitly identifies the record to be deleted and its deletion time.

The diagram in Figure 6.1 illustrates a WOBT with three nodes containing the records C, D, F, G, and H. The diagram shows the root node labeled 1 and two children, nodes 2 and 3. The root has an extra entry used to link different versions of the root node together. Being the first root of the tree, node 1 has a NULL (0) pointer for this entry. The rest of the entries in the root point to the children of the root. The first data entry indicates that C is the highest record in node 2, and F the highest in node 3.

Figure 6.1: Write-once B-tree

When a record A is added to the tree, it is simply appended to node 2 in the space available for it, and an entry is propagated up to the parent of node 2, the root. This is illustrated in Figure 6.2.

Figure 6.2: Write-once B-tree after insertion of "A"

The root now contains two entries which point to node 2. When we access the tree with respect to the current time, we use only the most recent values of the record/pointer pairs, in this case A/2 and F/3, which are found by a sequential search of the node. If we access the tree with respect to a time before the insertion of A, we find the record/pointer pairs C/2 and F/3. The record A in node 2 would not be found in that search because its later time-stamp would disqualify it.

When we insert a further record "J" into the tree, we must split both node 3 and the root node 1. The result is illustrated in Figure 6.3. When we split node 3 we end up with two new nodes 4 and 5. When node 1 is "split" in this case, we can make room in the new node by not including the data/pointer pair C/2 which is now obsolete. This results in one new node, node 6. The extra entry in the new root node now points back to the old root, node 1, so that accesses with respect to a previous time are still possible.

Node 3 is no longer in the current version of the tree, but can still be accessed as part of the old version of the tree (prior to the insertion of "J") through the old root.

Figure 6.3: Write-once B-tree after insertion of "J"

Features to note about the WOBT are that search times will be slower than for a conventional B-tree as extra time will be required for a sequential search of each node to determine the current values of the data/pointer pairs. The WOBT also has problems with sector fragmentation since all modifications of the tree are stored as they are made directly on the WORM disk. It is also not particularly efficient with frequent updates to a single record, since each update must be stored on the disk in at least one disk sector.

However, despite these drawbacks, the WOBT is a robust (using backward pointers only) and elegant solution to the problem of efficiently implementing a multiway tree on an indelible storage device.

6.3.3 Time-Split B-Tree

The Time-Split tree (TSBT) of Lomet and Salzberg (1989) is an enhancement of the Write-once B-tree that eliminates some of its problems while adding to its utility. The basic structure and operation of the TSBT are identical to that of the WOBT. The difference between the two is that the TSBT distributes the contents of the tree over both a magentic disk and a WORM optical disk, and employs a slightly different approach to node splitting that reduces redundancy.

Combining magnetic and optical storage technologies allows their individual strengths to complement each other. This is an idea also employed in the buffered hashing organization of Christodoulakis and Ford (1989a). Write-once optical disks have a very low cost per stored bit and, as well, ensure that data cannot be accidently or maliciously deleted. Magnetic disks offer faster access and allow modifications to stored data without consuming storage space. A TSBT employs a magnetic disk to store the current changeable contents of the tree and the write-once disk to store the unchangeable historical contents.

The migration of the historical contents of the TSBT to the write-once disk is a consequence of node splitting. The TSBT divides node splits into two types: Time Splits and Key Splits. A time split occurs when a node is full of many historical (i.e., not in the current version of the tree) entries. A time split makes a new version of the node that omits the historical versions of records. In the TSBT, the old version of the node is removed from the magnetic disk and stored on the WORM optical disks.

A key split is the normal type of split associated with conventional B-trees and occurs when a node is full and most of the records are current (in a conventional B-tree, the records are always current). The result is two new nodes stored on the magnetic disk, each with roughly half of the records (historical and current) of the original node.

The advantages of the Time-split B-tree over the Write-once B-tree are many.The magnetic disk allows

faster access to the current data than is possible if it were stored on an optical disk alone. It also allows transactions to make nonpermanent entries before they commit; these can be deleted if the transaction aborts. Records can also be updated without consuming disk space. Lower sector fragmentation is also a result, since buffering on the magnetic disk tends to reduce fragmentation by buffering many small changes into one larger change.

The splitting policy of a TSBT can be skewed to favor different objectives. If the total amount of storage space consumed is a concern, then key space splits should be favored as time splits tend to increase redundancy in the database. If the size of the current version of the B-tree (i.e., the part of the tree being stored on the magnetic disk) is a concern, then time splits should be favored as they free occupied space on the magnetic disk.

When implementing the TSBT, care must be taken to recognize that the size of the tree is limited not by the capacity of the optical disk, but by the capacity of the magnetic disk, as it stores all of the current contents of the tree. If the magnetic disk is full, the amount of remaining storage space on the optical disk will be irrelevant.

6.3.4 Compact Disk File System

The Compact Disk File System (CDFS) of Garfinkel (1986) is a system independent hierarchical file system for WORM optical disks. The goals of the CDFS are to be completely transportable across a variety of modern operating systems, to make efficient use of storage space, and to have a relatively high retrieval performance.

Unlike the write-once and time-split B-trees, the CDFS does not provide a structure for the organization of records, but rather a structure for the organization of groups of complete files. The application for which it is primarily intended is to organize files that experience few modifications, such as those

belonging to a source code archive. The smallest unit of registration in the CDFS organization is the file.

The basic unit of organization in the CDFS is called a "transaction." A transaction results from the process of writing a complete group of files on the optical disk. All the files in a transaction group are placed on the disk immediately adjacent to the position of the previous transaction. Each individual file is stored contiguously. At the end of a transaction, an updated directory list for the entire file system is stored along with an "End of Transaction" (EOT) record. The EOT record contains a link to the EOT record of the previous transaction allowing access to historical versions of the organization (a dummy EOT record is stored at the start of an empty disk). The last transaction on the disk is the starting point for all accesses and the directory list it contains represents the current version of the file hierarchy.

The CDFS contains three types of "files": regular files, directories, and links. Each file is given a unique sequence number for the file system and a version number. If a file is updated by writing a new copy of it at some later time, the new copy retains the sequence number, but receives a new version number.

Each stored file entry consists of two parts, a file header and a file body. The header, which is invisible to a user, stores a large amount of explicit information about the file. This is an attempt by the CDFS to span the entire space of file charecteristics that any given operating system might record or require. For example, the file header contains the name of the owner of the file; on some systems (e.g., UNIX) this information must be derived by consulting a system database. This explicit information allows the contents of a single disk employing a CDFS to appear to be a native file system on more than one operating system (with appropriate drivers for each system).

A directory is a special type of file that contains entries identifying other files known as members. These entries include pointers to the disk positions of the members and other information such as file sizes and modification times. This extra information is redundant since it is also stored in the file header, but it serves to improve the performance of directory list operations by eliminating the seeks required to access each member.

A link entry is simply a pointer to a file or a directory and allows a file to be a member of more than one directory.

The directory list stored at the end of the files in the transaction is an optimization to reduce seeks. It is a list of the positions of all current directories and subdirectories in the hierarchical file system. Using the directory list improves performance by reducing the seeks needed to traverse the file directory tree.

The diagram in Figure 6.4 illustrates how an instance of a CDFS is stored in a series of disk sectors. The example is for two transactions for a CDFS consisting of three files. The second transaction is used to store a second expanded version of the second file in the file system. The arrows in the diagram

represent the pointers which link the various constituents of the CDFS together. For example,

(backward) pointers exist between the EOT records, between each EOT record and its directory list, between the directory list and the directories (in this case just one, the root), and between the root and the three files.

Figure 6.4: State of compact disk file system after two transactions

The CDFS is an efficient means of organizing an archive of a hierarchical file system. Its main drawback is that it does not allow efficient updates to files. Any change to a single file requires the entire file to be rewritten along with a new copy of the directory list. Extra information could be added to the file header to allow files to be stored noncontiguously. This would allow portions of the file to be changed while other parts remained intact.

The robustness of the CDFS inherent in the degree of redundancy found in the organization, coupled with the access it allows to all previous versions of a file, makes it ideal for use in storing file archives.

Being relatively system independent, the CDFS is also an excellent organization for data interchange. It would be possible, for example, to copy a complete UNIX file system to an optical disk employing the CDFS and then transport it to an appropriate VMS installation and access the archived UNIX file system as if it were a native VMS file system.

6.3.5 The Optical File Cabinet

The Optical File Cabinet (OFC) of Gait (1988) is another file system for WORM optical disks. Its goals are quite different from those of the CDFS described previously. Its main objective is to use a WORM disk to simulate an erasable file system such as that found on a magnetic disk, and appear to an

operating system just as if it was any other magnetic disk file system. It does this by creating a logical disk block space which can be accessed and modified at random on a block-by-block basis through a conventional file system interface.

The mapping between the logical and physical blocks is provided by a structure called the File System Tree (FST) which resides on the WORM optical disk, (see Figure 6.5). To translate between logical and physical blocks, the logical block number is used to find a path through the FST to a leaf; the leaves of the tree are the physical disk blocks of the WORM optical disk.

Figure 6.5: File system tree (FST) for optical file cabinet

Both the interior nodes and the leaves of the tree are buffered in main memory. The buffers are periodically flushed (e.g., every 5 minutes) and written on the optical disk in a process called

Dans le document Information Retrieval: Data Structures & Algorithms (Page 90-101)