Automatic generation of views: for every element e chosen by the user, the system must generate a view containing several attributes:

Graphic Construction of Textual Marts

Stage 5: Automatic generation of views: for every element e chosen by the user, the system must generate a view containing several attributes:

• the first attribute corresponds to the document number of the element e;

• the second attribute is the number of the element constituting the ancestor common to all the elements chosen by the user;

• the last attribute corresponds to the information of the element e.

Note: The number of “s_compose” for the attributes is determined by calculating the level number between the corresponding element (document or ancestor element) and the chosen element e.

For the first dimension, the view that the system must generate is the following:

CREATE VIEW Dimension_n (Doc, Anc, Inf) AS (1)

SELECT e.s_compose … s_compose.itsdoc.num, (2)

e.s_compose … s_compose.num, (3)

e.content (4)

FROM Spe_Elts e

WHERE e.s_compose … s_compose.itsdoc.doc = "Name_Doc" (5) e.s_compose … s_compose.itsdoc.belong.doctype = "Name_SL” (6) AND e.inherit.ge_name="Elt_m" ;

(1) Or Fact_n (Doc, Anc, Inf)

(2) Attribute 1: number of the document

(3) Attribute 2: number of the element constituting the ancestor common to all the chosen elements

(4) Attribute n : content of the chosen element (5) If the user chose a document

(6) If the user chose a logical structure

In the same way, the system must generate all the other views. We will then have four views: Dimension_1 (doc, paper, type), Dimension_2 (doc, paper, topic), Dimension_3 (doc, paper, year), and Fact_1 (doc, paper, title).

From these views, the system must then generate another view by a joint on the first two attributes (in our example “doc” and “paper”):

To generate a last view, the system must apply the aggregation operation by grouping the dimensions:

CREATE VIEW Dimension_1 (doc, paper, type) AS SELECT e.s_compose.s_compose.s_compose.itsdoc.num, e.s_compose.s_compose.num, e.content

FROM Spe_Elts e

WHERE e.s_compose.s_compose.s_compose.itsdoc.belong.doctype = "publications"

AND e.inherit.ge_name = "Type" ;

CREATE VIEW Joint (type, topic, year, title) AS SELECT d1.type, d2.topic, d3.year, f1.title

FROM Dimension_1 d1, Dimension_2 d2, Dimension_3 d3, Fact_1 f1 WHERE d1.paper = d2.paper AND d2.paper = d3.paper AND d3.paper = f1.paper AND d1.doc = d2.doc AND d2.doc = d3.doc AND d3.doc = f1.doc ;

CREATEVIEW Distribution (type, topic, year, number) AS SELECT j.type, j.topic, j.year, count (j.title)

FROM Joint j

GROUP BY j.type, j.topic, j.year ;

Figure 9. Multidimensional Table “Distribution”

The view generated by the system must be visualized according to a multidimensional table “Distribution.”

CONCLUSION

The concept of textual warehouses we propose allows manipulating the documents of a heterogeneous collection by their structures and their contents, contrary to other systems that impose a predefined structure. Indeed, the proposed generic model is suitable for storing heterogeneous documents according to their logical structures and for applying the techniques of informa-tion retrieval (restituinforma-tion of passages but not the whole documents), data interrogation (restitution of factual information), and multidimensional analysis (analyzing data according to several dimensions by using a graphic language that offers a great simplicity for the users).

Several experiments have been carried out on two aspects — first on the integration of large collections of heterogeneous documents issued from the Laboratory Intranet, and then on the analysis and use of this warehouse content by several non-experimented users. The distinction between the generic and the specific structures improved the expressiveness of a large document collection in the way to retrieve, exploit and analyze its content. The graphic language is also open enough to allow any user to construct any query, even a complex one.

At present, our main goal is to continue the merging of the techniques developed within the framework of the information retrieval and the data warehouses. Indeed, the specifications of the document warehouse need to be extended in order to:

• define an interrogation language appropriate for the warehouse instead of using SQL language to facilitate query syntax;

• apply the multidimensional operators to textual marts in a textual way, according to a formalism or in a graphic way;

• extract statistical information and knowledge to explain the behaviors of users and the definition of user profiles.

Let us assume that the document warehouse is the base for the definition of a business memory; it is intended for any person in an organization who must quickly access and analyze any useful information. This memory must contain any knowledge extracted from document content (i.e., from structure and

textual parts). Our future work will aim to extend the process of textual analysis to integrate personalization criteria and metadata (by the user himself or by an automatic process).

ACKNOWLEDGMENTS

We would like to acknowledge M. Franck Ravat, assistant professor at Toulouse I University, for his helpful comments and discussions on this research. We also thank M. Mohamed Mbarki (Master student) for his contribution to the implementation of the system.

REFERENCES

Abiteboul, S., Cobena, G., & Mignet, L. (2001). Change-centric management of versions in an XML warehouse. In Proceedings of VLDB’01, Rome, Italy (pp. 581-590).

Agrawal, R., Gupta, A., & Sarawagi, S. (1995). Modelling multidimensional database. San Jose, CA: IBM Almaden Research Center. (Technical Report)

Boughanem, M., Chrisment, C., & Soulé-Dupuy, C. (1999). Query modifica-tion based on relevance back propagamodifica-tion in ad hoc environment.

Information Processing & Management Journal, 35(2), 121-139.

Burkowski, F. J. (1992). Retrieval activities in a database consisting of heterogeneous collection of structured text. In Proceedings of ACM SIGIR’92, Copenhagen, Denmark (pp. 112-125).

Cabibbo, L. & Torlone, R. (1998). A logical approach to multidimensional databases. In Proceedings of EDBT’98, Valencia, Spain (pp. 183-197).

Chevalier, M., Christine, J., & Khrouf, K. (2003). Towards a documentary memory: Building a document repository for companies. In Proceedings of ICEIS’03, Angers, France (pp. 213-218).

Chiaramella, Y. & Nie, J. (1990). A retrieval model based on an extended model logic and its application to the RIME experimental approach. In Proceedings of SIGIR’90, Brussels, Belgium (pp. 25-44).

Faulstich, L. C., Spilopoulou, M., & Linnemann, V. (1997). WIND: A warehouse for Internet data. In Proceedings of BNCOD 15: Advances in databases, London (pp. 169-183).

Fourel, F., Mulhem, P., & Bruandet, M. (1998). A generic framework for structured document access. In Proceedings of DEXA’98, Vienna, Austria (pp. 521-530).

Frakes, W. B. & Yates, R. B. (1992). Information Retrieval Data Struc-tures & Algorithms. Boston, MA: Addison-Wesley.

Gardarin, G. & Yoon, S. (1996). HyWEB: Un système d’interrogation orienté-objet pour le Web. In Proceedings of BDA’96, Cassis, France (pp. 205-224).

Gardarin, G., Mensch, A., & Tomasic, A. (2002). An introduction to the e-XML data integration suite. In Proceedings of EDBT’02, Prague, Czech Republic (pp. 297-306).

Gyssens, M. & Lakshmanan, L. V. S. (1997). A foundation for multi-dimensional database. In Proceedings of VLDB’97, Bombay, India (pp.

106-115).

IBM World Trade Corporation. (1982). Storage and information retrieval (STAIRS) reference manual. Amsterdam: IBM Netherlands. (Technical Report)

Inmon, W. H. (1994). Building the Data Warehouse. New York: John Wiley

& Sons.

Khrouf, K. & Soulé-Dupuy, C. (2001). Decisional textual dataweb design. In Proceedings of ISE’01, Las Vegas, Nevada (pp. 40-43).

Khrouf, K., Soulé-Dupuy, C., & Zurfluh, G. (2001). Exploitation d’une mémoire d’entreprise à partir d’entrepôts textuels. ISI Journal, 6(3), 87-117.

Lallich, G. & Ouerfelli, T. (1998). La segmentation pour l’indexation d’un document technique: Principe et méthodes. Proceedings of EFRA’98, Sfax, Tunisia.

Salton, G. (1971). The SMART Retrieval System: Experiment in Automatic Document Processing. Englewood Cliffs, NJ: Prentice Hall.

Salton, G., Fox, E. A., & Wu, H. (1983). Introduction to Modern Informa-tion Retrieval. New York: McGraw-Hill.

Soulé-Dupuy, C. (2001). Bases d’informations textuelles: Des modèles aux applications. HDR memory. Toulouse, France: Paul Sabatier University - Toulouse III.

Soutou, C. (1999). Relationel-objet sous Oracle8: Modélisation avec UML. Paris: Editions Eyrolles.

Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-20.

Thomas, H., Datta, A. & Viguier, I. (1997). A conceptual model and algebra for online analytical processing in decision support databases. Tuc-son, AZ: University of Arizona. (Technical Report)

Turtle, H. & Craft, B. (1990). Inference networks for document retrieval. In Proceedings of SIGIR’90, Brussels, Belgium (pp. 1-24).

Van Rijsbergen, C. (1986). A new theoretical framework for information retrieval. In Proceedings of RDIR’86, Pisa, Italy (pp. 194-200).

APPENDIX

Dans le document Intelligent Agents for Data Mining and (Page 133-138)