Haut PDF Querying heterogeneous data in NoSQL document stores

Querying heterogeneous data in NoSQL document stores

Querying heterogeneous data in NoSQL document stores

Other line of work considers large volume of data and enable querying for doc- uments encoded in JSON. For instance, the work SQL++ (Ong et al., 2014) offers a query language designed to retrieve information from semi-structured data, e.g., documents. Furthermore, we find other work form the literature such as Google Ten- zing (Lin et al., 2011), Google Dremel (Melnik et al., 2010), Apache Drill (Hausenblas and Nadeau, 2013), and Apache Spark SQL (Armbrust et al., 2015) propose that user could query data without first defining a schema for the data. In the work Tenz- ing (Lin et al., 2011), the authors introduce a query mechanism inspired from the SQL querying language and performing in MapReduce systems. To this end they propose to infer relational models from the underlying documents. The limitation of this work is that it is only limited to flat structures. In other words, only documents composed of attributes of primitive types are supported. This is not always valid in the context of documents stores where nested structures is commonly used in current data-intensive applications. In contrast, other solutions such as Dremel (Melnik et al., 2010) and Drill support nested data. We notice also that systems such as Apache Spark SQL (Armbrust et al., 2015) fits the data into main memory using a custom data model called data frames. Hence, a data frame reuses the table structure in which each column contains values of one attribute. In case of heterogeneous structures, data frames are built based on the most used structures and thus leading to miss some el- ements present in limited number of documents. Furthermore, data frames could be materialised and loaded to optimise querying performances. However, if new struc- tures are inserted into the collection of heterogeneous documents, there is a need to regenerate the data frames and thus changing the columns signature which could affect the existing workloads.
En savoir plus

130 En savoir plus

Querying heterogeneous data in graph-oriented NoSQL systems

Querying heterogeneous data in graph-oriented NoSQL systems

Graph-oriented systems belong within the “schemaless” framework [ 2 , 7 ], that consists in writing data without any prior schema restrictions; i.e., each node and each edge have its own set of attributes, thus allowing a wide variety of representations [ 6 ]. This flexibility generates heterogeneous data, and makes their interrogation more complex for users, who are compelled to know the different schemas of the manipu- lated data. This paper addresses this issue and consider a straightforward approach for the interrogation of heterogeneous data into NoSQL graph-oriented systems. The proposed approach aims at simplifying the querying of heterogeneous data by limiting the negative impact of their heterogeneity and leads to make this heterogeneity “ transparent” for users.
En savoir plus

15 En savoir plus

Querying heterogeneous data in graph-oriented NoSQL systems

Querying heterogeneous data in graph-oriented NoSQL systems

Graph-oriented systems belong within the “schemaless” framework [ 2 , 7 ], that consists in writing data without any prior schema restrictions; i.e., each node and each edge have its own set of attributes, thus allowing a wide variety of representations [ 6 ]. This flexibility generates heterogeneous data, and makes their interrogation more complex for users, who are compelled to know the different schemas of the manipu- lated data. This paper addresses this issue and consider a straightforward approach for the interrogation of heterogeneous data into NoSQL graph-oriented systems. The proposed approach aims at simplifying the querying of heterogeneous data by limiting the negative impact of their heterogeneity and leads to make this heterogeneity “ transparent” for users.
En savoir plus

14 En savoir plus

Querying heterogeneous document stores

Querying heterogeneous document stores

Schema Discovery. Other works propose to infer implicit schemas from semi-structured documents. The idea is to give an overview of the different ele- ments present in the integrated data (Baazizi et al., 2017) (Ruiz et al., 2015). In (Wang et al., 2015) the authors propose summarizing all documents schema under a skeleton to discover the existence of fields or sub-schema inside the collection. In (Herrero et al., 2016) the authors suggest extracting collection struc- tures to help developers while designing their ap- plications. The heterogeneity problem here is de- tected when the same attribute is differently repre- sented (different type, different position inside docu- ments). Schema inferring methods are useful for the user to have an overview of the data and to take the necessary measures and decisions during application design phase. The limitation with such logical view is the need to manual process while building the desired queries by including the desired attributes and their possible navigational paths. In that case, the user is aware of data structures but is required to manage he- terogeneity.
En savoir plus

13 En savoir plus

Implementation of Multidimensional Databases with Document-Oriented NoSQL

Implementation of Multidimensional Databases with Document-Oriented NoSQL

Document-oriented systems are one of the most famous families of NoSQL systems. Data is stored in collections, which contain documents. Each document is composed of key-value pairs. The value can be composed of nested sub-documents. Document- oriented stores enable more flexibility in schema design: they allow the storage of complex structured data and heterogeneous data in one collection. Although, docu- ment-oriented databases are declared to be “schema less” (no schema needed), most uses convey to some data model.
En savoir plus

16 En savoir plus

Querying Key-Value Stores under Single-Key Constraints: Rewriting and Parallelization

Querying Key-Value Stores under Single-Key Constraints: Rewriting and Parallelization

We performed an experimental evaluation whose goal is to show the benefits of parallelization when querying key-value stores under semantic constraints. We deployed our tool on top of key-value store MongoDB version 3.6.3. Our experi- ments are based on the XMark benchmark which is a standard testing suite for semi-structured data [11]. XMark provides a document generator whose output was translated to obtain JSON records complying with our setting. Precisely, we performed our experiments on a key-value store instance created by shredding XMark generated data in JSON records. The results reported here concern an instance created from 100MB XMark and split in ∼60K records of size ∼1KB. XMark also provides a set of queries that were translated to our setting.To test query evaluation in the presence of constraints, we then extended the benchmark by manually adding a set of 68 rules on top of the data. These are “specializa- tion” rules of the form k new → k xmark where k xmark is a key of the XMark
En savoir plus

9 En savoir plus

Document-Oriented Models for Data Warehouses

Document-Oriented Models for Data Warehouses

2 Capgemini, Toulouse, France Keywords: NoSQL, Document-oriented, Data Warehouse, Multidimensional Data Model, Star Schema. Abstract: There is an increasing interest in NoSQL (Not Only SQL) systems developed in the area of Big Data as candidates for implementing multidimensional data warehouses due to the capabilities of data structuration/storage they offer. In this paper, we study implementation and modeling issues for data warehousing with document-oriented systems, a class of NoSQL systems. We study four different mappings of the multidimensional conceptual model to document data models. We focus on formalization and cross- model comparison. Experiments go through important features of data warehouses including data loading, OLAP cuboid computation and querying. Document-oriented systems are also compared to relational systems.
En savoir plus

9 En savoir plus

Towards schema-independent querying on document data stores

Towards schema-independent querying on document data stores

The first category of systems is designed to enable queries based on reliable knowledge about the schema or the navigational paths for desired values when dealing with nested data. Such systems offer complicated querying language such as regular expressions with XQuery or Xpath [17] when dealing with XML data. XQuery works with the structure to retrieve precisely the desired results. However, if the user does not know the structure, it is impossible to write the relevant query. Moreover, a single query is generally not able to retrieve data when several schemas are to be considered simultaneously. We can notice the same con- siderations with JSONiq [9], the extension of XQuery, designed to deal with large-scale data such as JSON data. Other systems suggest JavaScript queries API, the case of MongoDB [5], to build a query by specifying a document with properties expected to match with the results. It offers a broad range of querying capabil- ities, in particular data processing pipelines. The API requires a complex syntax and it is necessary that queries explicitly include all the various schema structures within documents to access data. Otherwise, the query engine returns only documents that match the supplied criteria even if the fields with the desired information exist but under other paths than those existing in the query. Another kind of works is SQL++ [19] relies on the rich SQL querying interface. In this case, it is also mandatory to express all exact navigational paths in order to obtain the desired results.
En savoir plus

11 En savoir plus

Querying heterogeneous document stores

Querying heterogeneous document stores

Schema Discovery. Other works propose to infer implicit schemas from semi-structured documents. The idea is to give an overview of the different ele- ments present in the integrated data (Baazizi et al., 2017) (Ruiz et al., 2015). In (Wang et al., 2015) the authors propose summarizing all documents schema under a skeleton to discover the existence of fields or sub-schema inside the collection. In (Herrero et al., 2016) the authors suggest extracting collection struc- tures to help developers while designing their ap- plications. The heterogeneity problem here is de- tected when the same attribute is differently repre- sented (different type, different position inside docu- ments). Schema inferring methods are useful for the user to have an overview of the data and to take the necessary measures and decisions during application design phase. The limitation with such logical view is the need to manual process while building the desired queries by including the desired attributes and their possible navigational paths. In that case, the user is aware of data structures but is required to manage he- terogeneity.
En savoir plus

12 En savoir plus

Implementation of Multidimensional Databases with Document-Oriented NoSQL

Implementation of Multidimensional Databases with Document-Oriented NoSQL

Document-oriented systems are one of the most famous families of NoSQL systems. Data is stored in collections, which contain documents. Each document is composed of key-value pairs. The value can be composed of nested sub-documents. Document- oriented stores enable more flexibility in schema design: they allow the storage of complex structured data and heterogeneous data in one collection. Although, docu- ment-oriented databases are declared to be “schema less” (no schema needed), most uses convey to some data model.
En savoir plus

15 En savoir plus

Towards schema-independent querying on document data stores

Towards schema-independent querying on document data stores

The first category of systems is designed to enable queries based on reliable knowledge about the schema or the navigational paths for desired values when dealing with nested data. Such systems offer complicated querying language such as regular expressions with XQuery or Xpath [17] when dealing with XML data. XQuery works with the structure to retrieve precisely the desired results. However, if the user does not know the structure, it is impossible to write the relevant query. Moreover, a single query is generally not able to retrieve data when several schemas are to be considered simultaneously. We can notice the same con- siderations with JSONiq [9], the extension of XQuery, designed to deal with large-scale data such as JSON data. Other systems suggest JavaScript queries API, the case of MongoDB [5], to build a query by specifying a document with properties expected to match with the results. It offers a broad range of querying capabil- ities, in particular data processing pipelines. The API requires a complex syntax and it is necessary that queries explicitly include all the various schema structures within documents to access data. Otherwise, the query engine returns only documents that match the supplied criteria even if the fields with the desired information exist but under other paths than those existing in the query. Another kind of works is SQL++ [19] relies on the rich SQL querying interface. In this case, it is also mandatory to express all exact navigational paths in order to obtain the desired results.
En savoir plus

12 En savoir plus

Data Location Management Protocol for Object Stores in a Fog Computing Infrastructure

Data Location Management Protocol for Object Stores in a Fog Computing Infrastructure

Fig. 10: Star diagram summarising the characteristics of our proposed approach. deleted because they are also used for other objects. To conclude this section, we have shown that our protocol is more adapted for Fog infrastructures than the DHT because the location is found along the physical path from the current node to the root node. Finally, in addition to reducing the lookup latency, the creation of location records enables the sites to locate reachable object replicas in case of network partitioning, increasing Fog sites autonomy. The properties of the proposed protocol are summarised in Figure 10. Our protocol limits the network traffic exchanged while locating an object and thus the impact on the access times. A second advantage is that it also limits the minimal number of replicas needed and the knowledge of the topology by the nodes. More precisely, in our protocol, location records are created only for objects accessed remotely and each site knows only its parent. We note that the amount of data to move the objects when a site is added or removed will be discussed in Section V.
En savoir plus

15 En savoir plus

C-Set : a Commutative Replicated Data Type for Semantic Stores

C-Set : a Commutative Replicated Data Type for Semantic Stores

Thomas write rule [ 5 ] present techniques by which a number of loosely cou- pled processes can maintain duplicate copies of a database, despite the unrelia- bility of their only means of communication. The copies of the database can be kept consistent. However, in order to remove the old deleted entries ”garbage collection” they propose the following scheme: each site could notify the other sites whenever it hears about a deletion. If these notifications are transmitted in order with the ”normal” sequence of modifications, then upon receipt of such a notification a site can be sure that the sending site has delivered any outstanding assignments to the deleted entry, has marked it as deleted, and will not generate any new assignments to it. This implies the knowledge of all the sites in the system. This constraint is not compatible with the P2P networks context.
En savoir plus

9 En savoir plus

Querying RDF Data Using A Multigraph-based Approach

Querying RDF Data Using A Multigraph-based Approach

The experimental results for DBPEDIA are depicted in Fig- ure 6 and Figure 7. The time performance (averaged over 200 queries) for Star-Shaped queries (Fig. 6a), affirm that AMbER clearly outperforms all the competitors. Further the robustness of each approach, evaluated in terms of per- centage of unanswered queries within the stipulated time, is shown in Figure 6b. For the given time constraint, x-RDF- 3X and Jena are unable to output results for size 20 and 30 onwards respectively. Although Virtuoso and gStore output results until query size 50, their time performance is still poor. However, as the query size increases, the percentage of unanswered queries for both Virtuoso and gStore keeps on increasing from ∼0% to 65% and ∼45% to 95% respectively. On the other hand AMbER answers >98% of the queries, even for queries of size 50, establishing its robustness. Analyzing the results for Complex-Shaped queries (Fig. 7), we underline that AMbER still outperforms all the competi- tors for all sizes. In Figure 7a, we observe that x-RDF-3X and Jena are the slowest engines; Virtuoso and gStore per- form better than them but nowhere close to AMbER. We further observe that x-RDF-3X and Jena are the least ro- bust as they don’t output results for size 30 onwards (Fig. 7b); on the other hand AMbER is the most robust engine as it answers >85% of the queries even for size 50. The percent- age of unanswered queries for Virtuoso and gStore increase from 0% to ∼80% and 25% to ∼70% respectively, as we increase the size from 10 to 50.
En savoir plus

13 En savoir plus

Flexible Querying of Web data to Simulate Bacterial Growth in Food

Flexible Querying of Web data to Simulate Bacterial Growth in Food

HAL Id: lirmm-00538961 https://hal-lirmm.ccsd.cnrs.fr/lirmm-00538961 Submitted on 29 May 2020 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

32 En savoir plus

Querying RDF Data: A Multigraph Based Approach

Querying RDF Data: A Multigraph Based Approach

entity graph to empower its search engine and provide further information extracted, for instance by Wikipedia. Another example is supplied by recent question-answering systems [CAB 12, ZOU 14a] that automatically translate natural language questions in SPARQL queries and successively retrieve answers by considering the available information in the different Linked Open Data sources. In all these examples, com- plex queries (in terms of size and structure) are generated to ensure the retrieval of all the required information. Since the use of large knowledge bases that are commonly stored as RDF triplets is becoming a common way to ameliorate a wide range of ap-
En savoir plus

33 En savoir plus

A quality-aware spatial data warehouse for querying hydroecological data

A quality-aware spatial data warehouse for querying hydroecological data

Firstly, we propose a modeling step completely tailored for environmen- tal purposes as it pools heterogeneous spatiotemporal data within a unified 280 model (i.e, it can reconcile data from a semantic standpoint). Once the model is unified, data can be properly structured and integrated to improve quality control in the data warehouse. Quality dimensions required by do- main experts can be monitored during this step. Secondly, we integrate the data into the spatial data warehouse without using any cleaning rules. ETL

25 En savoir plus

BayesDB : querying the probable implications of tabular data

BayesDB : querying the probable implications of tabular data

The three models that are currently implemented in BayesDB (Naive Bayes, CRP Mixture (also known as DP Mixture), and CrossCat) are all generative probabilistic models,[r]

95 En savoir plus

Integrating heterogeneous data sources in the Web of data

Integrating heterogeneous data sources in the Web of data

approach undertaking the linking of all scientific assets into executable papers 28 . The concept of executable paper advocates the adaptation of traditional journal articles to the needs of data intensive science. Concretely, an executable paper shall interlink articles, data, metadata, methods, software etc., thus delivering a validated, citable, tractable, and executable experimental context. To achieve this, Linked Open Science builds on several key components: (i) Linked Data to annotate and/or represent articles, data and metadata; (ii) Open Source and Web-based environments to provide software tools and methods thereby making it easy to reproduce experiments; (iii) Cloud-computing environments to make it easy to repeat CPU and space intensive tasks; (iv) Creative Commons licensing to provide a legal framework in which all scientific assets can be reused. Note that the latter condition may not always be appropriate: some licenses are better suited to deal with data and metadata e.g. Open Data Commons. LinkedScience.org is a community-driven project started in 2011, meant to showcase what Linked Open Science is about in practice. Not only does the project aim at reproducible science, but it also puts a specific stress on the need for education and dissemination of scientific results. Through different types of events, LinkedScience.org spurs Linked Science among scientific communities, and promotes tools and workflows that could facilitate the practice of Linked Science. Other initiatives address similar challenges although not necessarily under the term Linked Science. This is the case of four major projects funded by the European Union FP7 program: BioMedBridges, CRIPS, DASISH and ENVRI are clusters of research infrastructures in biomedical sciences, physics, social sciences and humanities, and environmental sciences respectively. They have come together to identify the common challenges across scientific disciplines with respect to data management, sharing and integration [Field et al., 2013]. They have drawn a list of topics of interest covering notably traceable and citable research objects (data, software, user), semantic interoperability (interlinked vocabularies and ontologies, context and provenance metadata), and data processing services (description, composition, discovery, marketplace). For all of these topics, recommendations extensively rely on the concepts of Linked Data, and linked science is implied in the idea of linking together all assets needed to reproduce an experiment.
En savoir plus

227 En savoir plus

Parallel Data Loading during Querying Deep Web and Linked Open Data with SPARQL

Parallel Data Loading during Querying Deep Web and Linked Open Data with SPARQL

3.1 A Concurrency Model for the Integrated RDF Graph Regarding our approach, we need a model that can handle concurrent insertions. However, RDF stores like Jena do not handle concurrent insertions, they are only able to favor one type of operation, e.g., reads or insertions. This strategy is im- plemented thanks to locks, but read and insert locks are mutually exclusive, i.e., they cannot be simultaneously activated. Existing RDF stores assume that there are more readers than writers and follow the multiple-readers/single-writer strategy (MRSW) 8 . According to MRSW, many readers may read simultane- ously, while a writer must have exclusive access. MRSW assumes writers have the priority to keep data up-to-date. Nevertheless, in our proposed approach, data insertions are going to be more frequent than data reads. A reader is the query engine that accesses the integrated RDF graph during query execution, while writers are the wrappers of the relevant views which load the data into the integrated RDF graph. The query engine cannot execute the query more often than loading views into the integrated RDF graph, because executing the query is expensive, and doing so too often may lead to performance degradation.
En savoir plus

16 En savoir plus

Show all 10000 documents...