• Aucun résultat trouvé

2.1 The characteristics of Relational data and DBMS Relational data bases contain one or more physically

independent tables - rows and columns - of data. Such a table is referred to as a "Relation". Each row or record of the relation is referred to as a "tuple" (As in "N-tuple"). The tuples of any one Relation must all have the same logical structure and

content. The data items in any one column of a Relation must all belong to the saine "Domain" of information.

The standard data manipulation operations available in a Relational DBMS consist simply of "SELECT", "PROJECT" and

"JOIN".

SELECT operates on a single table to select certain

records according to content. PROJECT operates on a single table to extract (to "project out") entire columns. Either or both of these two operations applied to a table create a new table as a row and/or column subset of the original.

The JOIN operation takes the records of two separate tables and "joins" them together wherever the same value of a specific field is found in the records of both tables. This again creates a new table. A JOIN operation is, however, only meaningful if the JOIN is made on two columns, one in each,

table which both belong to the same domain.

It was shown by Codd [5], the originator of this model, that an admixture of these three basic operations, applied

recursively and repeatedly as required, could isolate any definable subset of the entire data base as a single Relation.

Furthermore, it is not necessary to create any predefined

access paths, pointers, linkages or indices to attain this goal.

In order for these operations to proceed correctly, however, it is necessary that the following rules apply to the Relations:

1) every record must have one more fields or whose value(s) is(are) unique within the relation (a unique key).

2) all other fields within the record must be "functionally dependent" on this key.

3) all other fields must be functionally independent of each other.

These rules together constitute what is called "third normal form" and the process of putting the data base in such order is known as "Normalisation". (The phrase "The Key, the whole Key, and nothing but the key, so help me Codd" is a helpfull mnemonic for this.)

2.1.1 The general advantages of the Relational Model The relational model offers certain potential advantages over the Hierarchical and Network

models:-1) It greatly simplies the user's view of the data base.

Instead of complex graphical structures, we now have sets of simple tables - data in the form we normally conceive of it.

2) No pre-defined retrieval bias is built-in to the

relational DBMS in the form of hierarchical linkages or network pointers, etc. Hence retrievals from any and all viewpoints are equally feasible.

3) Operations apply to whole tables and create new whole tables.

Consequently, the relational approach is currently very much favoured by the user community.

It must be noted though, that this is employed in practice as a conceptual model which governs the users' view of the data base and defines formally the structures of the data base and the manipulation processes applicable to it. All existing Relational DBMS's (that are acceptably efficient) rely internally, beyond the user's view, on older well proven, techniques taken straight from their Hierarchical and Network progenitors in order to function efficiently.

2.2 The characteristics of large geoscience data bases Among the largest geoscience data bases that currently exist are those derived from large scale aero-geophysical surveys, and the volume of data gathered by such surveys is steadily increasing. This increase is exhibited by all three possible degrees of freedom - size of survey area, data acquisition rates and number of survey parameters recorded.

Bristow [6] describes an airborne gamma ray spectrometry system capable of sampling up to 1024 energy channels at rates of up to 4 samples per second. Each channel/sample is a 16 bit word, hence the maximum data acquisition rate is 8 Kbytes/

second or approximately 30 Megabytes/hr. 100 hours of survey flight would therefore acquire approximately 3 Terabytes of digital data.

The Thailand aerogeophysical survey [7] currently in progress contains about 1,000,000 kilometers of survey flight line. Magnetics, gamma-spectrometry VLF and electronic

navigation (ENS) data are all recorded digitally. The raw data (as recorded in-flight) volume will be several Gigabytes. The processed final data (interpolated grids, etc) will be in the Terrabyte range.

2.2.1 The logical content of survey data bases

Regardless of type, such survey data bases generally contain the following logical data

entities:-i) FIELD - A single number, e.g. A spatial

coordinate, a geophysical measurement.

ii) STATION - A group of fields of different types all related to the same space/time point.

iii) LINE - An ordered sequence of stations.

iv) BLOCK - A group of lines which can be processed as an independent unit.

v) SURVEY - One or more blocks constituting an entire project.

vi) MAP - A subset of a block or survey within an arbitrarily defined polygonal boundary.

vii) GRID - A set of geodata values all of the same type arranged in grid order over all or part of

the survey area.

2.2.2 The inherent structure of survey data bases

The essential function of a DBMS applied to geoscience data is to support efficient storage and retrieval of the above described logical entities.

These entities, however, clearly exibit inherent

structures more suited to the oldest data model rather than the Relational model. For example "Survey - Block - Line - Station Field" is a strictly onetomany structure. As is "Survey -Map - Line" and "Survey - Grid - Value" i.e. all are strict

hierarchies.

Furthermore, the retrieval requirements when processing this data have strictly "top-down" hierarchical search paths.

i.e. The basic requirement is to retrieve the lowest level

entities - a line of stations - from a specified survey and map.

This writer has never encountered an application requiring an inverse search, i.e. to find the line, map and survey which contain a specific aeromagnetic field value or Uranium count.

The above considerations indicate that this type of data is inherently suited to a hierarchical DBMS. Other considerations exist which indicate its inherent unsuitability to a Relational DBMS.

2.3 Geoscience data bases and the Relational DBMS

There is nothing inherent in the types of data entities described above that would actually prevent them from being put into and served by a relational DBMS. All of the different data sets encountered in geoscience could be structured as sets of relations (it can be shown, in fact, that any data base of any kind and complexity can be restructured as a set of relations with no loss of information.)

The important question to answer is what are the advantages and disadvantages of doing so ? The general

advantages of the Relational model have been described, as have certain considerations which indicate the inherent suitability of survey data to a Hierarchical model. There also exist clear disadvantages in applying a Relational DBMS to survey data.

An obvious disadvantage is one of the cited general advantages of the Relational model, i.e. the absence of

pre-defined access paths. As described above, survey data bases have distinctly preferred access paths. Hence no disadvantages ensue from having these built-in, and substantial improvements in efficiency result from having them. Hence a Relational DBMS prevents exploitation of the inherent structures of the data base.

The greatest disadvantage, however, arises from the need to normalise the relations. To illustrate this, consider one of the larger data groups employed in survey data processing - the interpolated data grid.

Such grids, interpolated across and between the data traverses, are an essential prerequisite to mapping and three dimensional interpretation processes. The grids are normally stored in row or column order as sequences of binary numbers.

Each number represents an interpolated value at one node of the grid. A separate "header" record defines associated grid

parameters such as cell dimensions, orientation, coordinates of grid origin, data type, etc.

A single grid value can not be assumed to be unique within the set of grid values. Hence the data as usually stored

violates the first rule of normalisation (section 2.1 above). In order to normalise the data, it would be necessary to store

explicitly the row and column index numbers with each grid node.

As no two grid nodes can have the same row/column indices, these two fields together would constitute the mandatory unique key in each record of the grid relation.

Hence, normalisation would triple the number of data items to be stored for any grid. Although it is techically simple to expand the grid data in this way, the resulting 300% increase in

storage volume and retrieval time is totally unacceptable in practice.

Not all survey data groups are as intractable as grid data. The in-flight data set, for example, exists from the begining as a fully normalised relation, i.e. each record

contains a fiducial which is unique, and in-flight measurements which are functionally dependent on this key and independent of each other. Even so, such data is still not acceptable as-is for manipulation by a Relational DBMS because of the pre-requisites of the JOIN operation.

Creation of a specific map as a subset of the entire

in-flight data set involves "joining" this data set with another relation defining the map involved. The in-flight data as

stored, however, even though it is normalised, does not contain a domain in common with the map definition data. Hence, the requisite join can not be made.

In order to facilitate the join, data from the "map no."

domain would have to be added as another column to the in-flight data. Likewise, for block or survey extraction. This again would reduce storage and retrieval efficiency.

3. ADAPTATION OF THE RELATIONAL MODEL TO GEOSCIENCE NEEDS