• Aucun résultat trouvé

A Source Centric Temporal Model

N/A
N/A
Protected

Academic year: 2021

Partager "A Source Centric Temporal Model"

Copied!
6
0
0

Texte intégral

(1)

A Source Centric Temporal Model

Dario Colazzo

1

François-Xavier Dudouet

2

Ioana Manolescu

3

Benjamin Nguyen

4 1

LRI, Universit´e de Paris Sud, France, [email protected]

2

CNRS IRISES UMR 7170, France, [email protected]

3

INRIA Saclay–ˆIle-de-France, France, [email protected]

4

Universit´e de Versailles-Saint-Quentin, France, [email protected]

ABSTRACT

Research in social sciences typically involves collecting and analyzing a corpus of information annotated with time and source information. Such applications need to model when a given fact has happened, according to whom it has hap-pened, and also when did a given source claim a given fact; they may need to handle unknown, but constrained (bounded) moments in time.

We present a generic source-centric temporal model, suited to the information modelling needs of social sciences. We present the conceptual model, a possible XML representa-tion, and demonstrate its usefulness on data and queries of a real application. We explain the advantages of our model over those currently employed in temporal, proba-bilistic, and provenance databases.

Rationale

”My second purpose today is to provide you with additional information, to share with you what the United States knows about Iraq’s weapons of mass destruction as well as Iraq’s involvement in terrorism.” Colin Powell addressing the UN Security Council, February 5, 2003

”There were some people in the intelligence community who knew at the time that some of those sources were not good, and shouldn’t be relied upon, and they didn’t speak up. That devastated me.” Colin Powell on ABC News, in 2005, on his address to the UN Security Council

1.

INTRODUCTION

Research in social sciences often involves building and an-alyzing database of facts relevant for a given area of study, out of diverse sources ranging from official U.N. documents to Web sites such as Wikipedia. Time is an essential dimen-sion of such facts, and temporal databases have been inten-sively studied. However, such databases [9] mainly consider two notions: transaction time and validity time. Transac-tion time is simply the moment the informaTransac-tion is stored in the database, and validity time is the time (usually dura-tion) during which this information is assumed correct. Any other time dimension is modelled as user defined time and

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

its management is not addressed. As our examples show, social science applications need more sophistication in the management of time, in particular to model who knew what when, that is, intertwining source and time dimensions.

1.1

Overview

As databases grow more powerful, and larger volumes of data fall in the scope of automatic processing, many sources of information, potentially contradictory and incomplete, become available. For instance, think of what media sources report on complex events such as political or financial crisis, armed conflicts etc. Moreover, users have come to expect the database system itself to construct (or to assist the user in constructing) a cleaned-up, coherent sequence of events out of incomplete and/or contradictory data - when this is possible. For instance, based on a subscription to a financial news stream, one could track what was known on a given company and on its business area, to whom, and when, in the days before an insider trading operation is suspected. In an investigation, such information may be complemented e.g. with data about phone calls between main actors, in order to infer passing of information from one to another. Again, time is of crucial importance here.

One way of handling imprecision in databases (including imprecision on time intervals) is to assign probabilities to various facts and to ask queries, and return answers, with specific probabilities [5]. This approach is very suited to some applications, but is not natural in others, most no-tably in social sciences. Consider for instance a study of an armed insurrection. A database about the conflict would include information from local politicians, from local and foreign newspapers, reports from foreign diplomatic person-nel, news agency feeds etc. Thus, it may state that according to newspaper 1, 10.000 soldiers were moving from A to B on the day D. Possibly, according to government sources, only 200 rebels had left A on day D. A social scientist needs both facts, although they partially contradict each other, and it cannot even be established (yet) if they are about the same group of people. Invaluable to the scientist is the notion of who made a given statement, for instance, a local newspaper respectively government sources. This resembles the notion of provenance [2, 3, 1, 4], but with some twists: (i) we may want to model nested sources, e.g. according to newspaper 2, which cites a government source..., and (ii) time is essen-tial also in modelling the transmission of information from one source to another: when did the government source tell this to newspaper 2? In this setting, assigning numerical probabilities to facts and reasoning on these probabilities is unsuited. In contrast, social sciences may ask questions

(2)

f1 Liz Ann Tom 01/2008 now tf1 f e 03/2005 s sf1 w s tf now e 04/2008 sf2 w sf s now e tf2 f 01/2004 s 12/2005 e $t1 sf3 w tf s now e

Figure 1: Sample instance of our generic model.

such as: how much of the government information was re-peated by newspapers 1 and 2 on the day D? How did this change later? Which were the sources of newspaper 2? Who published information that did not originate with the govern-ment?

1.2

Motivating example

In prior work, the social scientist involved in this paper has built a (relational) database analysing the evolution since 1921 of the licit production of many drugs, basing his study mainly on information provided by the United Nations [6]. This database was overall coherent, i.e. did not include con-tradictory information such as e.g. morphine manufacture in France in 1930 was X and morphine manufacture in France in 1930 was Y for X 6= Y . Coherency was ensured (i) by relying mostly on a single source of information, which was appropriate in this context and (ii) by the scientist manu-ally solving contradiction by suppressing less trusted infor-mation.

When extending the study to illicit production, the sit-uation is much more complicated. Clearly, there are no consensual figures! Various agencies publish conflicting es-timates. This information can be relayed by websites citing their sources, but sometimes modifying the figures. The original model used for licit drug production could not han-dle these issues. Observe that in this context, inconsistencies are unavoidable, and they are not “errors to fix”. On the con-trary, they focus the sociologist’s attention to controversial issues, to understand why sources disagree.

The following (real) example illustrates the dimension of sourced, timed information that needs to be captured in our model. In January 2008, the scientist F-X.D. finds on the Wikipedia page on opium production in Afghanistan [15]: ”Afghanistan saw a bumper opium crop of 4,600 metric tons in 1999”, citing the source: ”United Nations (2004-11-18). Press conference on Afghanistan opium survey 2004. Press release. Retrieved on 2006-01-94.” The following informa-tion items need to be captured from this snippet: (i) a timed fact (opium production in 1999 is 4,600t); (ii) a source with a publication date (UN Press release 2004-11-18); (iii) the time Wikipedia became aware of the information published by UN (2006-01-14); (iv) the time the sociologist became aware of all this (2008-01). This information contradicts some previous information gathered in 2002 by F-X.D. from

2001-10-03 US governmental sources that state a production of 2861t.

In this article, we propose a novel, generic time and source based model, capable of answering the needs of any applica-tion dealing with inconsistent temporal informaapplica-tion. In this context, strongly-structured data (e.g. tables) seemed too restrictive, therefore, we settled for a very generic model in the style of RDF [14], to which we add (i) time and (ii) source, which we model using beliefs (also annotated with time). Based on concrete applications we studied, we propose an XML serialization of our model, and demonstrate its interest by querying a sample dataset of real sociological data.

In the following, Section 2 describes the model, whereas Section 3 presents a practical implementation based on XML and XQuery. We then survey related works and conclude.

2.

MODEL

In this section we introduce our model, in a simple object-oriented style. We then describe a concrete representation based on XML.

2.1

Conceptual model

Facts represent any kind of information needed by the ap-plication. For maximal generality, we do not constrain fact structure; one can see a fact as an arbitrary RDF statement (which, of course, may refer to other RDF statements). By convention, we use f (possibly with subscripts) to denote facts. Timed facts are (f, s, e) tuples, where f is a fact, and s, e are the start and end dates, respectively. Timed facts will be denoted by tf , possibly with subscripts. Finally, a sourced fact is a tuple (w, (tf |b0), s, e), where w (who) de-notes the entity (person, organization, information source etc.) which states the timed fact tf or the sourced fact b0, and s, e are the start and end moments of the sourced fact, i.e. they capture when and for how long a given entity has known (maintained) a given timed fact. Observe that in practice a source may be associated in a variety of ways to a given timed fact: a television station may broadcast the fact, a journal may print it, a person may state it over the phone, or the social scientist may record it in the database simply because he believes it. One could say that the timed fact holds according to the source. We use the generic term sourced fact to uniformly denote these possibilities. The

(3)

structure of source entities is unconstrained.

Clearly, it would have been possible to model timed facts as facts (since facts can have any property, including start and end), just as it is possible to model sourced facts as particular cases of timed facts. We distinguish these three notions in our model in order to bring it close to the social scientist using the database.

We illustrate this model through examples, which can be traced on Figure 1. In this figure, boxes represent facts and timed facts, whereas other information items are shown in oval nodes.

Consider the statement “John works for ACME Corp.”. We model this as a fact f1 (with some internal structure, that we do not consider further).

Now, consider the statement: “John has been working for ACME Corp. since 03/2005”. This is modelled as a timed fact tf1, such that tf1.f = f1, tf1.s = 3/2005, and tf1.e = now. Here, we borrow the now symbol well-known in temporal databases, denoting that the fact is assumed valid until it is explicitly invalidated.

We now consider the statement: “Liz has known since January 2008 that John had been working for ACME Corp since 03/2005.” We model this as a sourced fact sf1, such that sf1.w = Liz, sf1.tf = tf1, sf1.s = 01/2008 and sf1.e = now. Now, assume Tom learns in 4/2008 what Liz believes. Then, Tom’s knowledge can be modelled as sf2 such that sf2.w = T om, sf2.b = sf1, sf2.s = 04/2008 and sf2.e = now. Observe that while Liz’s sourced fact concerns a timed fact, Tom’s belief concerns the sourced fact of Liz.

Sourced facts may disagree. For instance, assume Ann knows that John has worked for ACME Corp. from 01/2004 to 12/2005 (this contradicts what Liz thinks). To model this, we start by introducing a timed fact tf2such that tf2.f = f1, tf1.s = 01/2006, and tf2.e = now. Then, we introduce sf3 such that sf3.w = Ann, sf3.s = $t1 and sf3.e = now. Observe that we do not know when Ann started to believe in the timed fact tf2, but we know she still believes it now. Therefore, we introduce a time variable $t1, recording that it is before the current time: $t1 < tcrt, where tcrt is a concrete time moment obtained by calling the system’s time function. Contrast tcrt, a concrete moment, with now, which corresponds to a still-running time interval.

Finally, consider the statement “John has worked for ACME Corp.”. We model this as a timed fact tf3(Figure 2), based on the same fact f1, but with the start date $t2and the end date $t3, such that $t2< $t3< tcrt. This captures the natu-ral constraint between the start and end of a fact, as well as the relationship between that interval and the moment when we make the statement. For instance, if Max was hired by XYZ in 2000, then the start date of his working there is t4 such that 12/1999 < t4< 1/2001.

In summary, our model consists of: (i) facts; (ii) timed facts; (iii) timed sourced facts; and (iv) time variables con-strained by (in)equality predicates.

2.2

Handling time variables

We have seen that time variables allow a flexible mod-elling of unknown and/or imprecise time information. A few aspects deserve further discussion.

OccurenceTime variables may play the role of start and end dates for timed and sourced facts. As explained above, in our model, a start or end date belongs to exactly one

f1 $t2<$t3<04/2008 $t2 $t3 tf3 f s e sf4 tf Ann w 01/2008 s now e

Figure 2: Model instance with constrained variables

timed or sourced fact. In the same style, a time variable can appear exactly once as a start or end date in our database. If we need to assign the same unknown start date to two facts, we use two variables, and separately add an equality predicate.

Coherence (In)equality predicates (more generally called constraints) among variables, or among variables and pre-cise time moments, make up a directed graph structure. We say the graph is coherent if it does not contain cycles among constant time moments. For instance, the graph G1 = ({$t2, $t3}, {$t2 ≤ $t3, $t3 < 01/2008}) is coherent. If we add the constraint 03/2008 < $t2, we obtain the graph G2 which is incoherent, since it contains both 01/2008 < 03/2008 (by the normal interpretation of time values) and 03/2008 < 01/2008 (via $t2 and $t3). Observe that if we add to G1 the constraint $t3 ≤ $t2, the resulting graph G3 is still coherent; according to this graph, $t2= $t3, which is not per se an error.

ResolutionAssume the global constraint graph G is coher-ent. We say a time variable $t resolves to an interval (s, e) based on G, where s is a concrete time moment, and e is ei-ther a concrete time moment or the now symbol, if according to G, (i) s ≤ t and s is the latest time moment for which this constraint is supported by G, and (ii) t ≤ e, and e is the earliest time moment for which this constraint is supported by G. We denote this by $t ↓G(s, e). Intuitively, resolution seeks to frame each time variable inside the most precise in-terval that can be found according to G. For instance, in the graph G1above, $t2↓G(tmin, 01/2008). In this formula, tminis the earliest time moment representable in the system, and we similarly assume a tmax which is the latest repre-sentable moment. Now assume we add to G1the constraints $t2 < 01/2005 < $t3, which leads to the graph G4. Then, $t2↓G4 (tmin, 01/2005), and $t3↓G4 (01/2005, 01/2008). Handling incoherence As long as new facts and beliefs are added to the database, only fresh time variables can be used, therefore there is no risk of introducing cycles in the time order graph. However, adding explicit constraints may lead to an incoherent graph such as G2 above. Two main approaches for handling incoherence can be pursued.

One possibility is to reject the addition of constraints that, together with the existing ones in the database, may lead to an incoherent graph. In this case, the system could return to the user the potential inequality loop as an explanation of why the new constraint is rejected.

The other possibility is to allow adding any constraint, regardless of whether it introduces cycles or not. Then, we need to record the viewpoint from which the constraint is considered valid. This allows users that do not trust the

(4)

DB f ::=  | f act[descr[any], tf ], f Timed facts tf ::=  | t-fact[t, sf ], tf

Time t ::= time[start[χ], end[χ]]

Sourced facts sf ::=  | s-fact[who[any], t, sf ], sf Graph constraint G ::=  | constr[χ θ χ], G

Figure 3: XML representation in BNF form

newly added constraint to keep their view of the world coher-ent, if they so chose. Thus, explicit constraints are modelled by a timed fact such as e.g. tfx($tw < $tz, tcrt, now), and a sourced fact such as sfy(tfx, Ann, tcrt, now). This way, if each user keeps his view of the world coherent (i.e., does not introduce mutually contradicting constraints), resolution is still possible restricted to the user’s constraints. We would write e.g. $t ↓G,{Liz}(s, e) to say that the time variable $t resolves to (s, e) according to Liz’s constraints in the graph G. It is of course possible to define resolution based on the constraints of more than one user, e.g. ↓G,{Liz,T om} etc. Clearly, the second approach still allows a given entity to enter constraints incoherent among themselves.

Both approaches have interesting potential. Coherence preserving is preferable for databases built and used mostly by a given entity. However, source proliferation may lead to global inconsistencies. Thus, we believe the approach that makes most sense for the users is (i) keeping each entity’s constraints coherent, (ii) allowing incoherent constraints is-sued from different sources, (iii) resolving time variables in-side coherent fragments of the database only. From now on, without loss of generality, we will focus on a single coherent graph G, which (depending on the application) may stand for a restricted part of the larger, incoherent graph.

2.3

XML representation

Observe that in our model, as illustrated by Figure 1, all the (nested) source facts related to the same fact are orga-nized in a hierarchy on top of the fact. A natural represen-tation (serialization) of such data is an XML tree inverting the hierarchy, where the fact is at the root, and including all sourced facts referring (directly or indirectly) to this fact as children (or descendants) of that fact. The following consid-erations help ensure that the hierarchy of timed and sourced facts rooted at a given fact (again, consider Figure 1 bottom-up) can be represented as a tree. (i) Each sourced fact is about a single timed or sourced fact. (ii) Each start and end date is a separate node in our model. (iii) If the same en-tity, say Liz, is involved in more than one sourced facts, our methodology recommends modelling the entity separately and using references to the entity whenever needed in the sourced fact hierarchy. Clearly, instead of representing plain, timed, and sourced facts and in the same document, one could store facts separately and have sourced facts reference them. We use a single document for simplicity.

We characterize an XML database of timed and sourced facts according to the BNF grammar in Figure 3. We use  to indicate the empty sequence, the symbol , to denote sequence composition, χ as a metavariable ranging of time variables $t1, $t2, . . . plus date values of the form dd/mm/yyyy plus the special value now, and θ ranging over the compar-ison symbols <, ≤.

A database is a pair < f , G > where f a sequence of facts,

while G is conjunction of time constraints.

Each fact in f is an element containing some textual de-scription descr plus some temporal annotation. This last one consists of a sequence of t-fact elements, each one con-taining either: a simple time interval time[start, end], where both start and end are dates1

; or, a time interval plus a list of sourced facts. This last case indicates the case where sev-eral sourced facts refer to the same timed fact. Sourced facts sf are lists of s-fact elements, each one containing informa-tion about: its source(given by an element who), the time interval information about the sourced fact itself, and some other possible sourced facts that may refer to it.

Each constraint constr[χ θ χ] in G both defines time variables, and constraints such variables. The database < f, G > is well-formed if each variable used in f is mentioned (defined) in a constraint in G.

2.4

Database evolution

We have so far explained how a given corpus of informa-tion can be captured in our model. We now consider the evolution (or dynamics) of such a database with time.

In our applications, arbitrary XML update statements are potentially dangerous (because they may violate our model, such as e.g. editing a sourced fact to erase the fact it refers to!), and not very useful to social scientists, because the latter are not query experts, and view their database more in the style of our conceptual model. Therefore, a database evolves by applying operations from the following set:

Adding a fact is always possible.

Adding a timed fact requires that the underlying simple fact already exists in the database. Before performing the actual insertion, the system presents to the user the existing timed facts related to the same fact (if any), for information. Adding a sourced fact requires that the sourced or timed fact on which the new one is based exists already in the database. Similarly, prior to the addition, the system presents to the user the existing sourced facts regarding the same timed fact, including the respective sources.

Updating a now value (the end date of some type of fact) is possible in order to replace it with a precise time value. Observe that this kind of editing does not change the graph of time constraints, since now values cannot participate in constraints (only time variables can).

Authoritative source propagation refers to a kind of trigger-based mechanism. Assume a source s is considered very trustworthy by a given user u. Then, she may want sourced facts from s, regarding any timed fact t, to be copied into sourced facts from u, regarding that timed fact. This can be seen as a mechanism for cascading insertions.

Deletions are possible as long as they do not violate the model integrity, i.e. it is not possible to delete a fact f and preserve a timed fact tf based on f , and similarly for sourced facts based on timed facts.

Lacking a standard XML language (and associated tools) for specifying the kind of active behaviour mentioned above, we embed such logic in the Web-based application interface.

3.

IMPLEMENTATION

In this section, we describe our experience with a proof-of-concept implementation, based on a real database. Our ex-1

(5)

declare namespace tm="http://tmpURI/temp.xsd";

declare namespace sourceData="http://tmpURI/opium.xsd"; <results>{

let $doc := doc("opium.xml")

let $afghaID := $doc//sourceData:institution [sourceData:name="Afghanistan"]/@id for $prod in $doc//sourceData:production for $fact in $doc//tm:fact

let $year := $fact//tm:timedFact/tm:timeValue/ tm:start/tm:value

where $prod/@id = $fact/tm:factIDREF and $year = 1999

and $prod/sourceData:substance = "opium"

and $prod/sourceData:productorInstitution = $afghaID return

<result value="{xs:integer($prod/sourceData:quantity/ sourceData:value)}" year="{xs:gYear($year)}" />} </results>

<results>

<result value="4565" year="1999"/> <result value="2861" year="1999"/> <result value="4600" year="1999"/> </results>

Table 1: Initial query formulation and results. let $firstSource := $fact//tm:timedFact/tm:sfacts/

tm:sfact/tm:source/tm:sourceIDREF let $firstSourceName := $doc//sourceData:

institution[@id=$firstSource ]/sourceData:name let $sourceYear := $firstSource/../../

tm:timeValue/tm:start <results>

<result value="4565" year="1999"> <source name="UNODC" year="2003"/> </result>

<result value="2861" year="1999"> <source name="US Gov" year="2001"/> </result>

<result value="4600" year="1999">

<source name="UN Press Briefing" year="2004"/> </result>

</results>

Table 2: Modified query and results.

perimentation consists of (i) an implementation of our model using XML Schema, (ii) a manually built XML database, based on facts found on the web and on information from [6] and (iii) a set of XQuery queries on this database.

DataOur data is a subset of the database built in [6] con-cerning production of drugs, complemented with informa-tion found on the Web. The data also contains informainforma-tion on the people involved in the arenas of international drug control, and various others such as the database operators. In this section, we consider only the production of opium in Afghanistan, from 1999 to 2007. These years are interesting because there are no consensual facts.

XML SchemaWe have produced two schemas. The first, that we reference using the namespace tm, is a direct trans-lation of our data model. For the sake of conciseness, our

declare function local:source ($s as element(tm:sfact)) as element(source){

let $n := doc("opium.xml")//sourceData:institution [@id=$s/tm:source/tm:sourceIDREF]/

sourceData:name

let $sy := $s/tm:timeValue/tm:start return if ($s/tm:sfacts/tm:sfact) then <source value="{xs:token($n)}" year="{xs:token($sy)}"> {local:source($s/tm:sfacts/tm:sfact)} </source> else <source value="{xs:token($n)}" year="{xs:token($sy)}"/> };

Table 3: Source analysis function.

examples use year granularity to model time (the actual schema uses day-level granularity). Unknown temporal val-ues and the now concepts are captured using constants. The variables themselves are of type xs:ID, in order to allow global constraints. A variable may have several constraints. This schema is generic in so far as facts can be of any type. Our second schema, that we reference using the namespace userData, was built using an external reference to the first schema in the following way:

The fact database schema can be any XSD, but each element that we want to reference to, either using time our sourcing must contain an id attribute, of type xs:ID. By adding to the userData schema an element tm:root, under the root of the document element, we are able to manage both non-temporal and sourced, non-temporal information in the same XML document. Arbitrary simple facts (composed of just a string) are legal tm:factText elements.

We can add to each tm:fact one or more dates, which will temporalise the information. Each temporalised infor-mation can then become a tm:sfact (sourced fact) element, having a source (text or xs:IDREF) and associated tempo-ral information. Note that the term sfact covers the whole scope of sourcing defined in this paper, such as according to, believes, knows, says, published on, etc.

QueriesConsider the query in Table 1, listing production values for 1999 according to different sources. To better understand the results, we modify the query to also return the first-level source, as shown in Table 2. As we can see, these values are contradictory, although rather close. In order to understand better these differences, let us query adding the first level source. We construct the source name, and year to be returned in a new element. Observe that this query still does not allow understanding or interpreting the inconsistencies. To that purpose, we need to recursively query the data, which is attained by introducing a function, shown in Table 3.

The results in Table 4 capture the example of Section1.2, and we can see another source of information was also added. Such a query could for instance show mistakes made by sources when quoting each other. In our case, the 2003 re-port from UNODC (part of UN) is quoted approximately in the UN Press Briefing. Rather that quoting the exact figure

(6)

<results>

<result value="4565" year="1999"> <source value="UNODC" year="2003">

<source value="F-X.D." year="2008"/> </source>

</result>

<result value="2861" year="1999"> <source value="US Gov" year="2001">

<source value="F-X.D." year="2002"/> </source>

</result>

<result value="4600" year="1999">

<source value="UN Press Briefing" year="2004"> <source value="Wikipedia" year="2006">

<source value="F-X.D." year="2008"/> </source>

</source> </result> </results>

Table 4: Sourced query results. 2003 figure, Wikipedia only quotes the briefing.

Many other sorts of queries relevant to sociologists are supported by the database, but omitted for lack of space. They mainly focus on finding conflicts, or aggregating infor-mation along the time dimension.

ResolutionWe need to specify how the built-in resolution mechanism is handled, outside off-the-shelf XML query en-gines (such as Saxon in our study). From the variables and constraints in the database instance, we construct an in-memory data structure which eagerly performs resolution upon each database update. An external function resolve($in as xs:string), implemented in Java, exploits the in-memory structure to resolve the variable whose name is $in into a possibly more precise interval. By default, resolve calls are added on top of all returned time values.

ComplexityAdding the first level sourcing costs us O(1), since given our schema, each of the additional let operations generates only one node. No joins are generated despite the fact that we separate data from sourced facts. Complexity of the recursive function is O(n) where n is the total number of sources related to the fact. Adding multi-level sourcing and time therefore grants minimal computational complex-ity, since it is equal to the output complexity. XQuery lets us directly query this structure. We leave source centric tem-poral schema-aware query optimization for future work.

4.

RELATED WORK

Most work on temporal XML data focus on the use trans-action and validity time intervals in order to model data evo-lution [12, 13, 11]. The same approach has been considered in some recent work aiming at adding and managing tempo-ral information in RDF data [8]. In our context, we provide a data model where the nature of time intervals is identified by the source fact the time interval is associated to. We capture the transaction-validity approach, in a simple way : transaction time intervals are those ones associated to a dis-tinguished source ’SYSTEM’, while validity time intervals are those associated to the distinguished source ’US’.

The authors of [7] proposes a multidimensional temporal XML model that can be used to do multiple versioning of

documents not only with respect to time but also to other dimensions, such as language, degree of detail, etc. In any case, concerning the temporal dimension, only transaction-validity time is considered.

Another approach which is connected to ours is presented in [10], which proposes the use of time variables and con-straints in relational databases in order to model absolute, relative, imprecise and infinite temporal information; the work also proposes a query algebra for temporal tables and deals with some issues related to query answering. Our work is in some sense a transposition of [10] to XML data, and we expect that in this different context new challenges will arise when studying issues related to query answering.

5.

CONCLUSION AND PERSPECTIVES

In this article, we have presented a new semi-structured model to manage sociological sourced temporal data, and shown the viability of its implementation in XML. Future work mainly includes the optimization of the storage and querying of this data, and of course, from the sociologists point of view, constructing and querying a large database using this model, of data retrieved from the web.

6.

REFERENCES

[1] O. Biton, S. Cohen-Boulakia, S. B. Davidson, and C. Hara. Zoom*userviews: Querying relevant provenance in workflow systems. In VLDB, 2007. [2] R. Bose and J. Frew. Lineage retrieval for scientific

data processing: A survey. ACMCS, 37(1), 2005. [3] P. Buneman, A. Chapman, and J. Cheney. Provenance

management in curated databases. In SIGMOD, 2006. [4] Y. Cui and J. Widom. Lineage tracing for general

data warehouse transformations. VLDB Journal, 12(1), 2003.

[5] N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.

[6] F.-X. Dudouet. Le contrˆole international des drogues, 1921-1999 (International Drug Control). PhD thesis, Universit´e Paris X, 2002.

[7] M. Gergatsoulis and Y. Stavrakas. Representing changes in XML documents using dimensions. In XSym, 2003.

[8] C. Gutierrez, C. Hurtado, and A. Vaisman. Introducing time into RDF. TKDE, 19(2), 2007. [9] C. S. Jensen. Temporal database management. Dr.

Techn. Thesis, 2000. in Chapter Introduction to Temporal Research.

[10] M. Koubarakis. Representation and querying in temporal databases: the power of temporal constraints. In ICDE, 1993.

[11] A. Mendelzon, F. Rizzolo, and A. Vaisman. Indexing temporal XML documents. In VLDB, 2004.

[12] F. Wang and C. Zaniolo. XBiT: An XML-based bitemporal data model. In ER Conference, volume 3288 of LNCS. Springer, 2004.

[13] F. Wang, C. Zaniolo, and X. Zhou. Temporal XML? SQL strikes back! In TIME Conference, 2005. [14] W3c rdf web page. http://www.w3.org/RDF/. [15] Opium production in Afghanistan.

http://en.wikipedia.org/wiki/Opium production in Afghanistan, 2008.

Figure

Figure 1: Sample instance of our generic model.
Figure 2: Model instance with constrained variables

Références

Documents relatifs