Publishing and Querying Evolving Data

Querying Evolving Data

RSP-QL dataset

6.1.4. Publishing and Querying Evolving Data

Finally, Chapter 5 handled the challenge on a query interface for evolving knowledge graphs. The main goal of this work was to determine whether (part of) the effort for exe-cuting continuous queries over evolving knowledge graphs could be moved from server to client, in order to allow the server to handle more concurrent clients. The outcome of this work was a polling-based Web interface for evolving knowledge graphs, and a client-side algorithm that is able to perform continuous queries using this interface.

The first research question of this work was:

Can clients use volatility knowledge to perform more efficient continuous SPARQL query evaluation by polling for data?

This question was answered by annotating dynamic data server-side with time annota-tions, and by introducing a client-side algorithm that can detect these annotaannota-tions, and determine a polling frequency for continuous query evaluation based on that. By doing this, clients only have to re-download data from the server when it was changed. Further-more, static data only have to be downloaded once from the server when needed, and can therefore optimally be cached by the client. In practise, one could however argue that no

data is never truly indefinitely static, which is why practical implementations will require caches with a high maximum age for static data when performing continuous querying over long periods of time.

Our second research question was formulated as:

How does the client and server load of our solution compare to alternatives?

This question was answered by comparing the server and client load of our approach with state of the art server-side engines. Results show a clear movement of load from server to client, at the cost of increased bandwidth usage and execution time. The benefit of this becomes especially clear when the number of concurrent clients increase. The server load of our approach scales significantly better compared to other approaches for an increasing number of clients. This is caused by the fact that each clients now helps with query execution, which frees up a significant portion of server load. Since multiple concurrent clients also lead to server requests for overlapping URLs, a server cache should theoretically be beneficial as well. However, follow-up work has shown that such a cache leads to higher server load [129] due to the high cost of cache invalidation over dynamic data. This shows that caching dynamic data is unlikely to achieve overall per-formance benefits. More intelligent caching techniques may lead to better efficiency, by for example only caching data that will be valid for at least a given time period.

The final research question was defined as:

How do different time-annotation methods perform in terms of the resulting execution times?

Results have shown that by exploiting named graphs for annotating expiration times to dynamic data, total execution times are the lowest compared to other annotation ap-proaches. This is caused by the fact that the named graphs approach leads to a lower amount of triples to be downloaded from the server. And since bandwidth usage has a significant impact on query execution times, the number of triples that need to be down-load have such an impact.

6.1.5. Overview

By investigating these four challenges, our main research question can be answered.

Concretely, evolving knowledge graphs with a low volatility (order of minutes or slower) can be made queryable on the Web through a low-cost polling-based interface, with a hy-brid snapshot/delta/timestamp-based storage system in the back end. On top of this and other interfaces, intelligent client-side query engines can perform continuous queries.

This comes at the cost of an increase in bandwidth usage and execution time, but with a higher guarantee on result completeness as server availability is improved. All of this can be evaluated thoroughly using synthetic evolving datasets that can for example be gener-ated with a mimicking algorithm for public transport network.

This proves that evolving knowledge graphs can be published and queried on the Web.

Furthermore, no high-cost Web infrastructure is needed to publish or query such graphs, which lowers the barrier for smaller, decentralized evolving knowledge graphs to be pub-lished, without having to be a giant company with a large budget.

6.2. Limitations

There are several limitations to my contributions that require attention, which will be dis-cussed hereafter.

6.2.1. Generating Evolving Data

In Chapter 2, I introduced a mimicking algorithm for generating public transport datasets.

One could however question whether such domain-specific datasets are sufficient for testing evolving knowledge graphs systems in general. As shown in Section 2.5, the in-troduced data model contains a relatively small number of RDF properties and classes.

While large domain specific knowledge graphs like these are valuable, domain-overlap-ping knowledge graphs such as DBpedia [40] and Wikidata [132] many more distinct properties and classes, which place additional demands on systems. For such cases, mul-ti-domain (evolving) knowledge graph generators could be created in future work.

Furthermore, the mimicking algorithm produces temporal data in a batch-based manner, instead of a continuous streaming process. This requires an evolving knowledge graph to be produced with a fixed temporal range, and does it does not allow knowledge graphs to evolve continuously for an non-predetermined amount of time. The latter would be valu-able for stream processing systems that need to be evaluated for long periods of time, which would require an adaptation to the algorithm to make it streaming.

6.2.2. Indexing Evolving Data

In Chapter 3, a storage mechanism for evolving knowledge graphs was introduced. The main limitation of this work is that ingestion times continuously increase when more ver-sions are added. This is caused by the fact that verver-sions are typically relative to the previ-ous version, whereas this storage approach handles versions relative to the initial version.

As such, such versions need to be converted at ingestion time, which takes continuously longer for more versions. This shows that this approach can currently not be used for knowledge graphs that evolve indefinitely long, such as DBpedia Live [78]. One possible solution to this problem would be to fully maintain the latest version for faster relative version recalculation.

The second main limitation is the fact that delta (DM) queries do not efficiently support result offsets. As such, my approach is not ideal for use cases where random-access in version differences is needed within very large evolving knowledge graphs, such as for example finding the 10th or 1000th most read book between 2018 and 2019. My algo-rithm naively applies an offset by iterating and voiding results until the offset amount is reach, as opposed to the more intelligent offset algorithms for the other versioned query types where an index is used to apply the offset. One possible solution would be to add an additional index for optimizing the offsets for delta queries, which would also lead to increased storage space and ingestion times.

6.2.3. Heterogeneous Web Interfaces

The main limitation of the Comunica meta query engine from Chapter 4 is its non-inter-ruptible architecture. This means that once the execution of a certain query operation is started, it can not be stopped until it is completed without killing the engine completely.

This means that meta-algorithms that dynamically switch between algorithms depending on their execution times can not be implemented within Comunica. In order to make this possible, a significant change to the architecture of Comunica would be required where every actor could be interrupted after being started, where these interruptions would have to be propagated through to chained operations.

Another limitation of Comunica is its development complexity, which is a consequence of its modularity. Practise has shown that there is a steep learning curve for adding new modules to Comunica, which is due to the dependency injection system that is error-prone. To alleviate this problem, tutorials are being created and presented, and tools are being developed to simplify the usage of the dependency injection framework. Further-more, higher-level tools such as GraphQL-LD [133] and LDflex are being developed to lower the barrier for querying with Comunica.

6.2.4. Publishing and Querying Evolving Data

The main limitation of our publishing and querying approach for evolving data from Chapter 5 is the fact that it only works for slowly evolving data. From the moment that data changes at the order of one second or faster, then the polling-based query approach becomes too slow, and results become outdated even before they are produced. This is mainly caused by the roundtrip times of HTTP requests, and the fact that multiple of them are needed because of the Triple Pattern Fragments querying approach. For data that evolves much faster, a polling-based approach like this is not a good solution. Sock-et-like solutions where client and server maintain an open connection would be able to reach much higher data velocities, since servers can send updates to subscribed clients immediately, without having to wait for a client request, which reduces result latency.

The second limitation to consider is the significantly higher bandwidth usage compared to other approaches, which has been shown in follow-up work [129]. This means that this approach is not ideal for use cases where bandwidth is limited, such as querying from low-end mobile devices, or querying in rural areas with a slow internet connection. This higher bandwidth usage is inherent to the Triple Pattern Fragments approach, since more data needs to be downloaded from the server, so that the client can process it locally.

Dans le document Storing and querying evolving knowledge graphs on the web (Page 156-159)