• Aucun résultat trouvé

Types of Documents

Dans le document Advances in Pattern Recognition (Page 35-38)

The wide coverage of the definition of ‘document’ makes it impossible to provide an exhaustive systematization of the various kinds thereof. Nevertheless, some intuitive and useful categorizations of documents, based on different (although sometimes overlapping) perspectives, can be identified. In the following, a short list will be provided, according to which the matter of this book will be often organized:

Support A first distinction can be made between tangible or intangible supports.3 The latter category includes digital documents, while the former can, in turn, be sensibly split into written documents (such as those on paper, papyrus or parch-ment) and documents whose content must be impressed on their supports (e.g., stone, clay, etc.) using other techniques.

Time of production A rough distinction can also be made between legacy documents (i.e., documents that were produced using classical, non-computerized techniques, for which an original digital source/counterpart is not available) and

‘current’ ones. The former typically correspond to tangible ones, and the latter to digital ones, but the overlapping is not so sharp. Indeed, tangible documents are still produced nowadays, although they have lost predominance, and the recent years have seen the widespread use of both kinds at the same time.

Historical interest Similar to the previous one, but in some sense more neat and born from a specific perspective, is the distinction between historical documents and ‘normal’ ones. Here, the focus is on the value of a document not just based on its content, but on its physical support as well. The document as an object, in this case, cannot be distinguished by its essence because even a scanned copy would not replace the original. Hence, this perspective is strictly related to the problems of preservation, restoration and wide access.

3Another term sometimes exploited to convey the same meaning is physical. However, it is a bit odd because even intangible supports require a physical implementation to be perceived, e.g., air is the physical support of the intangible ‘sound’ documents, and magnetic polarization or electrical signals are the physical support of the intangible ‘digital’ documents.

Table 1.1 Some types of documents and their categorization

Support Production Interest Medium Structure Formalism

Speech intangible any normal sound no natural

language Greek roll tangible legacy historical text,

graphic

Picture tangible legacy any graphic no light

Music intangible any normal sound no temperament

Program code intangible current normal text content programming language Web page intangible current normal text,

graphic, sound

content tagged text

Medium Another obvious distinction is based on the kind of medium that is used to convey the information. In the case of paper documents, the choice is limited between text and graphics, while in digital documents other kinds of media, such as sound, might be included.

Structure Moving towards the document content, different categories can be de-fined according to the degree of structure they exhibit. The concept of structure is typically referred to the meaningful arrangement of document components to form the document as a whole, but can sometimes be referred to the inner structure of the single components as well.

Representation formalism Also related to the content is the categorization by rep-resentation formalism. There are different formalisms for both the paper docu-ments, whose interpretation is intended for humans, and for digital docudocu-ments, whose interpretation is intended for computer and telecommunication systems.

However, often the digital documents are intended for humans as well, in which cases both levels of representation are involved.

As can be easily noted, these kinds of categorizations show some degree of or-thogonality, and thus allow several possible combinations thereof (Table1.1reports a few sample cases). Of course, not all combinations will bear the same importance or interest, but some are more typical or outstanding and deserve particular atten-tion. Specifically, this book focuses on documents (and document components) from a visual and understanding perspective. Thus, a selection on the support and media types is immediately obtained: as to the former, we are interested in digital and pa-per documents only, while as to the latter only text and graphics will be considered.

In other words, only ‘printable’ documents and components will be dealt with, not (e.g.) music.

An interesting essay on what is a document in general, and a digital document in particular, can be found in [5,6], and will be summarized and discussed hereafter.

A historical survey of the literature reveals that a ‘document’ is, traditionally, often

intended as a ‘textual record’. Such a quite restrictive definition is extended by some authors to include any kind of artifact that fulfills the ‘object as a sign’ perspec-tive. Underlying is the assumption that a document must be a physical item that is generally intended and perceived as a document, and, although less convincing for generic documents—but not, as we will see shortly, for digital documents—must also be the product of some (human) processing [4]. As a further requirement, it has to be cast into an organized system of knowledge. Indeed, the question concerning the distinction between medium, message and meaning is a very old one.

This question becomes again topical with digital technology that represents ev-erything as a sequence of bits. In fact, once one accepts that a word processor output (probably the digital artifact most similar to what is classically intended as a doc-ument) is to be considered a document, it must also be accepted that everything having the same representation is a document as well. It is only a matter of in-terpretation, just like one alphabet (say, the Latin one) can express many different natural languages. This sets the question of ‘what a document is’ free from any de-pendence on specific supports because multi-media all reduce, at the very end, to a single medium that is a sequence of (possibly electronically stored) bits. Indeed, some define a digital document as anything that can be represented as a computer file. This is the very least definition one can think of, and at the same time the most general. At a deeper analysis, the point is that there is no medium at all: the bits exist by themselves, without the need for the support of physical means for mak-ing them concrete, although this is obviously needed to actually store and exploit them.

Moreover, accepting the thesis that what characterizes a document is its being a

‘proof’ of something, two consequences immediately spring out. First, it must carry information, and information is expressed and measured in bits [10]. Thus, any-thing that can (at least in principle, and even with some degree of approximation) be expressed in terms of sequences of bits is a document. Second, it must be fixed in some shape that can be distributed and permanently preserved. Hence, the document needs a support, and thus any support capable of, and actually carrying, significant bits in a stable way is a document from a physical perspective. Thus, digital docu-ments can settle the controversy on whether the physical support is discriminant in determining if an object is a document or not.

The problem of digital documents, with respect to all other kinds of documents preceding the digital era, is that they are the first human artifact that is outside the human control. Indeed, all previous kinds of documents (principally text and pic-ture ones, but others as well) could be, at least in principle, directly examined by humans without the need for external supporting devices, at least from the syn-tactic perspective. Although allowing an extremely more efficient processing, this imposes a noteworthy bias to their final users, and introduces the risk of not being able to access the document information content because of technology obsoles-cence factors, which after all spoils the object of the very feature that makes it a document.

Dans le document Advances in Pattern Recognition (Page 35-38)