Document Processing Needs - Advances in Pattern Recognition

Depending on various combinations of document types, environments and user ob-jectives, lots of different document processing tasks can be identified, in order to satisfy the several needs of many document-based activities and institutions. In new documents, the needs are mainly related to semantics and interpretation of the con-tent, and hence on information retrieval and extraction. Conversely, going back to legacy documents, additional problems, that tend towards syntax and representation subjects, arise and become stratified on top of those.

On legacy documents, the problem is how to turn them into a digital format that is suitable for their management using computer systems. The most straightforward way to obtain this is scanning, which usually introduces as a side-effect various kinds of noise, that must be removed. For instance, typical problems to be dealt with are bad illumination, page image distortions (such as skew angles and warped papers) and introduction of undesired layout elements, such as specks and border lines due to shadows. Additional problems are intrinsic to the original item:

• Presence of undesired layout elements, as in the case of bleedthrough;

• Overlapping components, such as stamps and background texture;

• Lack of layout standard, in the document organization or in its content, as for handwritten or non-standard alphabets and letters.

Moreover, often the digitization process must include extraction of the document content as well, in order to store each component thereof in a way that is suitable for its specific kind of content. In this case, simple scanning or photographing the document is not sufficient. Indeed, for being properly stored and processed, the tex-tual part of the document should be identified and represented as text, while images should be cropped and stored separately. In turn, this might impose additional re-quirements on how scanning is performed.

A serious problem related to preservation of legacy documents is the durability of the copy and its actual exploitability in the future. Indeed, in spite of the appearance, paper is more durable than many other supports (e.g., magnetic disks and tapes are easily damaged, but also the actual endurance of current optical disks is not known).

An example may clarify the matter: a few decades ago some Public Administrations decided to transpose many of their archived papers, often important official docu-ments, onto microfilms. Microfilms have a limited quality, they are delicate and can be read only by specific instruments. Such instruments, and the making of micro-films itself, were very costly, and for this reason microfilm technology was soon abandoned in favor of scanning by means of computer systems. Those instruments are no longer produced, and the existing ones are progressively disappearing, so the preservation problem strikes back. Even worse, in some cases the administrations, after transposing on microfilms their legacy documents, have destroyed the originals to make room in their archives. Thus they now are not able to read the microfilms and do not have the original either: there is a real danger of having lost those doc-uments. The lesson gained from this negative experience is that the adoption of a new technology must ensure that the documents will be accessible in the future, in-dependently of the availability of the same instruments by which they were created.

For printable documents (probably the most widespread and important ones—like administrative records, deeds and certificates) the best solution might be to provide, in addition to the digital version (to be exploited in current uses) a few ‘official’

paper copies, to be stored and exploited in case, for some reason (e.g., technological obsolescence or failure, black-outs, etc.) the digital counterpart is not accessible.

On historical documents, the main interest relies in their restoration and preser-vation for cultural heritage purposes. Restricting to written documents, the source format is typically paper, papyrus or parchment, and hence digitization can be gener-ally obtained by means of a scanning process in order to maintain the original aspect of the artifact, in addition to its content. All the problems of legacy documents are still present, but additional ones arise, due to the low quality and standardization of the original, such as in the case of missing pieces (particularly in ancient fragments).

On digital documents (both born-digital ones and legacy ones that have been digitized according to the above criteria), a first issue is to develop suitable rep-resentation formalisms and formats that can express different kinds of content in an effective, compact and machinable way (three requirements that are of-ten opposite to each other, thus imposing some kind of trade-off). More gen-erally, the interest lies in properly indexing them for improving retrieval per-formance, and in extracting from them relevant information that can be ex-ploited for many purposes, such as indexing itself, categorization, understand-ing, etc.

Lastly, when switching from single documents to homogeneous collections, or-ganizational aspects come also into play, and must be properly tackled. These as-pects obviously include (or at least involve) the indexing and information extraction issues described in the previous paragraph, but also raise additional needs strictly re-lated to the overall purposes of the collection, and not only to the document content itself. Categorization and clustering of documents is often a fundamental require-ment in all contexts, but distribution and delivery of the proper docurequire-ments to the various users of the system might play a crucial role as well. For instance, a digi-tal library is interested in associating documents to categories of interest; an e-mail system can derive its success from the ability to precisely discriminate spam mes-sages; an on-line bookshop aims at suggesting effective recommendations according to users’ desires; in an office typical needs are interoperability, compliance with ex-ternal specifications and security aspects; and so on.

References

1. Knowledge for development: Tech. rep. The World Bank (1998/1999)

2. Merriam-Webster’s Collegiate Dictionary, 10th edn. Merriam-Webster Inc. (1999)

3. Angelici, C.: Documentazione e documento (diritto civile). In: Enciclopedia Giuridica Trec-cani, vol. XI (1989) (in Italian)

4. Briet, S.: Qu’est-ce que la Documentation. EDIT, Paris (1951)

5. Buckland, M.K.: What is a ‘document’? Journal of the American Society for Information Science 48(9), 804–809 (1997)

6. Buckland, M.K.: What is a ‘digital document’? Document Numérique 2(2), 221–230 (1998)

7. Candian, A.: Documentazione e documento (teoria generale). In: Enciclopedia Giuridica Trec-cani (1964) (in Italian)

8. Carnelutti, F.: Documento—teoria moderna. In: Novissimo Digesto Italiano (1957) (in Italian) 9. Irti, N.: Sul concetto giuridico di documento. In: Riv. Trim. Dir. e Proc. Civ. (1969) (in Italian) 10. Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of

Illi-nois Press, Champaign (1949)

Digital Formats

Recent developments in computer science disciplines have caused an ever-increasing spread of digital technologies, which has consequently turned upside down nearly all human activities in many of their aspects. At the same time, the fragmenta-tion of the computer science landscape and the tough competifragmenta-tion among software producers have led to a proliferation of different, and unfortunately often incompat-ible, ways of representing documents in digital format. Only very recently, initial efforts have been started aimed at rationalization and standardization of such for-mats, although it must be said that some (limited) range of variability is needed and desirable to be able to properly represent documents having very different characteristics (for instance, it is hardly conceivable that different kinds of con-tent as, say, text, images and music will be ever represented using the same ‘lan-guage’).

In the following, some of the most widespread formats will be introduced, grouped according to the kind and degree of structure they allow expressing in a document content representation. Raster image formats will be dealt with more in-depth, for showing and comparing different perspectives on how visual informa-tion can be encoded. Also PS and PDF will be given somewhat larger room be-cause they are famous from an end-usage perspective, but are not very well-known as to internal representation. Also for comparison purposes to the raster formats, a particular attention will be given to their image representation techniques. Mark-up languages will be discussed with reference to HTML and XML. However, the aim here will not be providing an HTML/XML programmer’s manual (many valid books on this specific subject are available); rather, their representational rationale and approach will be stressed. HTML will be discussed with just the aim of con-sidering a tag-based format and the possibilities it provides, while XML will be presented to give an idea of how flexible information representation can be ob-tained by a general tag definition mechanism supported by external processing tech-niques.

S. Ferilli, Automatic Digital Document Processing and Management, Advances in Pattern Recognition,

Dans le document Advances in Pattern Recognition (Page 39-44)