• Aucun résultat trouvé

Document Image Analysis

Dans le document L’Université de La Rochelle (Page 23-26)

1.3 Historical Document Project: ALPAGE. . . . 4 1.3.1 Scientific background and objectives . . . . 5 1.3.2 Global Methodology . . . . 6 1.4 Cadastral Map . . . . 6 1.5 Conclusion . . . . 9

1.1 Forewords

This chapter provides the overall concepts of the thesis. It starts from introducing a general aspect of document image analysis. Then, it points out the necessities and importance of dedicated services oriented to historical documents and a related project named ALPAGE. Subsequently, the main focus of this work: Content-Based Map Retrieval within an ancient collection of color cadastral maps is introduced.

The scope, objectives and organization of this thesis are provided at the end of this chapter.

1.2 Document Image Analysis

With the improvement of printing technology since the 15th century, there are a huge amount of printed documents published and distributed. The printed book quickly becomes a regular object in the world. By 1501 there were 1000 printing shops in Europe, which had produced 35,000 titles and 20 million copies1. Since that time, a vast amount of books have been falling into decay and degrading. This means not only the books themselves are disappearing, but also the knowledge of our ancestors.

Therefore, there are a lot of attempts to keep, organize and restore ancient printed documents. With the best digital technology, one of the preservation methods of these old documents is the digitization. However, digitized documents will be less beneficial without the ability to retrieve and extract the information from them, which could be done by using techniques of document analysis and recognition.

1http://communication.ucsd.edu/bjones/Books/printech.html

Figure 1.1: Hierarchy of document image processing; adapted from [Kasturi 2002]

Document analysis or more precisely, document image analysis (DIA), is the process that performs the overall interpretation of document images. [Nagy 2000]

gave the short definition of DIA as follow. "DIA is the theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer". DIA is the subfield of digital image processing that aims at converting document images to symbolic form for modification, storage, retrieval, reuse, and transmission. In practice, a document analysis system performs the basic tasks of image segmentation, layout understanding, symbol recognition and application of contextual rules in an integrated manner. The objective of document image analysis is to recognize the text and graphics components in images and to extract the intended information as a human would. Two components of document image analysis i.e. textual processing and graphical processing can be defined (see figure 1.1).

Figure 1.2 illustrates a common sequence of steps in document image analy-sis. After data capturing, the image undergoes pixel-level processing and feature analysis, then text and graphics are treated separately for recognition of each.

In view of an analysis of ancient documents, it requires the same concept as men-tioned above. However, the task is more challenging, the text/graphic separation question is more delicate to address. A strict data flow separation between the text and graphic processing chains assumes that a "good" text/graphic segmentation within the document is always possible. This hypothesis is not always easy hold when documents become denser and denser. This is because ancient documents hold more significance and more complexities than normal one. Firstly, ancient documents have historical meanings. Some negligible details in recent document

1.2. Document Image Analysis 3

Figure 1.2: A sequence of steps for document analysis; adapted from [Kasturi 2002]

structure could be very important in historical domain. Secondly, most ancient documents are degraded and susceptible to decay over time. Thus the digitization process has to be handled carefully and precisely with higher resolution. In addition, due to the large volume of ancient documents, the capturing device must support both qualitative and quantitative problems. Thirdly, the layouts or structures of ancient documents are organized differently. There are complex arrangements of texts and graphics including the styles and fonts used in publishing. Finally, an-cient documents are the targets of different users; starting from general users to experts, consequently, document understating systems should take this fact into ac-count. This raises the question of how to structure the information extracted from ancient documents to be able to respond to different user requirements. Usages and user needs are quite hard to circumscribe due to their plurality. Usage can either individual or collective which condition the way to structure the information.

The effort to manage ancient documents so far seems to be in progress. In France, firstly, this idea was generally fragmented. There was a lack of global and strategic management tools and no common policies on handling of ancient document resources and on setting priorities in management. This results in the threat of waste in resources, efforts and investments. Digitization is also costly and needs huge budgets, often based on public funding. Fortunately, from the support of French government and the collaboration of many research laboratories, the projects called MADONNE and NAVIDOMASS were set up for the purpose of preserving and exploiting ancient documents. These pioneer projects opened the way to more and more challenging relations between ICT-HSS communities

(Information & Communications Technology - Humanities and Social Sciences), for instance, the ALPAGE project came to birth into this frame of mind. French and European initiatives such as the french digital library GALICA2, and the British Library3show the engagement for this cause. Especially, to get out of the recession, a huge investment has been approved by the french government. A digitalization program supported by a 750 M€ fund is on the way. This effervescence denotes the matter of the digitization of our cultural heritage.

Dans le document L’Université de La Rochelle (Page 23-26)