RDF Archiving - Storing Evolving Data - Storing and querying evolving knowledge graphs on the w

Storing Evolving Data

3.2.2. RDF Archiving

Linked Open Datasets typically change over time [38], creating a need for maintaining the history of the datasets [37]. Hence, RDF archiving has been an active area of research over the last couple of years. In the domain of non-RDF graph databases, several graph database extensions exist. These extensions are either wrapper-based [58, 59], which leads to sub-optimal querying due to the lack of indexing, or they are based on changing the graph model [60, 61], which complicates the writing of queries. Furthermore, none of the existing non-RDF graph stores offer native versioning capabilities at the time of writ-ing. We therefore only discuss RDF archiving for the remainder of this section.

Fernández et al. formally define an RDF archive [62] as follows: An RDF archive graph A is a set of version-annotated triples. Where a version-annotated triple (s, p, o):[i] is defined as an RDF triple (s, p, o) with a label i ∈ N representing the version in which this triple holds. The set of all RDF triples [3] is defined as (U ∪ B) × U × (U ∪ B ∪ L), where U, B, and L, respectively represent the disjoint, infinite sets of URIs, blank nodes, and literals. Furthermore, an RDF version of an RDF archive A at snapshot i is the RDF graph A(i) = {(s, p, o)|(s, p, o):[i] ∈ A}. For the remainder of this article, we use the no-tation V_i to refer to the RDF version A(i).

The DIACHRON data model [39] introduces the concept of diachronic datasets, i.e., datasets that contain diachronic entities, which are semantic entities that evolve over time. This data model formally defines a diachronic dataset as a set of dataset versions

set of records (i.e., tuples or triples), an associated schema, temporal information about this version and metadata specific to this version. Domain data must be reified in order to store it in the DIACHRON model. Due to the simplicity of RDF archive model compared to the domain-specific DIACHRON data model, we will use the model of Fernández et al. for the remainder of this document.

Systems for archiving Linked Open Data are categorized into three non-orthogonal stor-age strategies [37]:

The Independent Copies (IC) approach creates separate instantiations of datasets for each change or set of changes.

The Change-Based (CB) approach instead only stores change sets between versions.

The Timestamp-Based (TB) approach stores the temporal validity of facts.

In the following sections, we discuss several existing RDF archiving systems, which use either pure IC, CB or TB, or hybrid IC/CB. Table 5 shows an overview of the discussed systems.

3.2.2.1. Independent copies approaches

SemVersion [63] was one of the first works to look into tracking different versions of RDF graphs. SemVersion is based on Concurrent Versions System (CVS) concepts to maintain different versions of ontologies, such as diff, branching and merging. Their ap-proach consists of a separation of language-specific features with ontology versioning from general features together with RDF versioning. Unfortunately, the implementation details on triple storage and retrieval are unknown.

Name IC CB TB

SemVersion [63] X

Cassidy et. al. [64] X

R&WBase [65] X

R43ples [66] X

Hauptman et. al. [67] X

X-RDF-3X [68] X

RDF-TX [69] X

v-RDFCSA [70] X

Dydra [71] X

TailR [72] X X

Table 5: Overview of RDF archiving solutions with their corresponding storage strategy: Individual copies (IC), Change-based (CB), or Timestamp-based (TB).

3.2.2.2. Change-based approaches

Based on the Theory of Patches from the Darcs software management system [73], Cas-sidy et. al. [64] propose to store changes to graphs as a series of patches, which makes it a CB approach. They describe operations on versioned graphs such as reverse, revert and merge. An implementation of their approach is provided using the Redland python library and MySQL by representing each patch as a named graph and serializing it in TriG [74].

Furthermore, a preliminary evaluation shows that their implementation is significantly slower than a native RDF store. They suggest a native implementation of the approach to avoid some of the overhead.

Im et. al. [75] propose a CB patching system based on a relational database. In their ap-proach, they use a storage scheme called aggregated deltas which associates the latest version with each of the previous ones. While aggregated deltas result in fast delta queries, they introduce much storage overhead.

R&WBase [65] is a CB versioning system that adds an additional versioning layer to ex-isting quad-stores. It adds the functionality of tagging, branching and merging for datasets. The graph element is used to represent the additions and deletions of patches, which are respectively the even and uneven graph IDs. Queries are resolved by looking at the highest even graph number of triples.

Graube et. al. introduce R43ples [66] which stores change sets as separate named graphs, making it a CB system. It supports the same versioning features as R&WBase and intro-duces new SPARQL keywords for these, such as REVISION, BRANCH and TAG. As reconstructing a version requires combining all change sets that came before, queries at a certain version are only usable for medium-sized datasets.

3.2.2.3. Timestamp-based approaches

Hauptman et. al. introduce a similar delta-based storage approach [67] by storing each triple in a different named graph as a TB storage approach. The identifying graph of each triple is used in a commit graph for SPARQL query evaluation at a certain version. Their implementation is based on Sesame [76] and Blazegraph [77] and is slower than snap-shot-based approaches, but uses less disk space.

X-RDF-3X [68] is an extension of RDF-3X [41] which adds versioning support using the TB approach. On storage-level, each triple is annotated with a creation and deletion time-stamp. This enables time-travel queries where only triples valid at the given time are returned.

RDF-TX [69] is an in-memory query engine that supports a temporal SPARQL querying extension. The system is based on compressed multi-version B+Trees that outperforms similar systems such as X-RDF-3X in terms of querying efficiency. The required storage space after indexing is similar to that of X-RDF-3X.

v-RDFCSA [70] is a self-indexing RDF archive mechanism, based on the RDF self-in-dex RDFCSA [54], that enables versioning queries on top of compressed RDF archives as a TB approach. They evaluate their approach using the BEAR [62] benchmark and

BEAR-A BEAR-B

show that they can reduce storage space requirements 60 times compared to raw storage.

Furthermore, they reduce query evaluation times more than an order of magnitude com-pared to state of the art solutions.

Dydra [71] is an RDF graph storage platform with dataset versioning support. They intro-duce the REVISION keyword, which is similar to the GRAPH SPARQL keyword for re-ferring to different dataset versions. Their implementation is based on B+Trees that are indexed in six ways GSPO, GPOS, GOSP, SPOG, POSG, OSPG. Each B+Tree value in-dicates the revisions in which a particular quad exists, which makes it a TB approach.

3.2.2.4. Hybrid approaches

TailR [72] is an HTTP archive for Linked Data pages based on the Memento protocol [44] for retrieving prior versions of certain HTTP resources. It is a hybrid CB/IC approach as it starts by storing a dataset snapshot, after which only deltas are stored for each consecutive version, as shown in Fig. 15. When the chain becomes too long, or oth-er conditions are fulfilled, a new snapshot is created for the next voth-ersion to avoid long version reconstruction times.

Results show that this is an effective way of reducing version reconstruction times [72], in particular for many versions. Within the delta chain, however, an increase in version reconstruction times can still be observed. Furthermore, it requires more storage space than pure delta-based approaches.

The authors’ implementation is based on a relational database system. Evaluation shows that resource lookup times for any version ranges between 1 and 50 ms for 10 versions containing around 500K triples. In total, these versions require ~64MB of storage space.

Dans le document Storing and querying evolving knowledge graphs on the web (Page 74-77)