• Aucun résultat trouvé

Challenges in Big Data Management and Analytics Uncertainty, Structure, Intensionality

N/A
N/A
Protected

Academic year: 2022

Partager "Challenges in Big Data Management and Analytics Uncertainty, Structure, Intensionality"

Copied!
14
0
0

Texte intégral

(1)

Challenges in Big Data Management and Analytics Uncertainty, Structure,

Intensionality

Pierre Senellart

(2)

The Five Vs of Big Data

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,

structured data), very diverse organization of data Velocity: Data produced or changing at high speed (LHC:

100,000,000 collisions / second), more than able to store Veracity: Data quality very diverse; imprecise, imperfect,

untrustworthy information

Value: Making sense of potentially very valuable data, but with a value not immediately apparent

Special focus within IPAL:Web Data (Web pages, social networks, e-commerce data, Semantic Web, Linked Open Data, etc.)

(3)

The Five Vs of Big Data

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB)

Variety: Very diverse forms of data (text, multimedia, graphs, structured data), very diverse organization of data Velocity: Data produced or changing at high speed (LHC:

100,000,000 collisions / second), more than able to store Veracity: Data quality very diverse; imprecise, imperfect,

untrustworthy information

Value: Making sense of potentially very valuable data, but with a value not immediately apparent

Special focus within IPAL:Web Data (Web pages, social networks, e-commerce data, Semantic Web, Linked Open Data, etc.)

(4)

The Five Vs of Big Data

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,

structured data), very diverse organization of data

Velocity: Data produced or changing at high speed (LHC:

100,000,000 collisions / second), more than able to store Veracity: Data quality very diverse; imprecise, imperfect,

untrustworthy information

Value: Making sense of potentially very valuable data, but with a value not immediately apparent

Special focus within IPAL:Web Data (Web pages, social networks, e-commerce data, Semantic Web, Linked Open Data, etc.)

(5)

The Five Vs of Big Data

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,

structured data), very diverse organization of data Velocity: Data produced or changing at high speed (LHC:

100,000,000 collisions / second), more than able to store

Veracity: Data quality very diverse; imprecise, imperfect, untrustworthy information

Value: Making sense of potentially very valuable data, but with a value not immediately apparent

Special focus within IPAL:Web Data (Web pages, social networks, e-commerce data, Semantic Web, Linked Open Data, etc.)

(6)

The Five Vs of Big Data

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,

structured data), very diverse organization of data Velocity: Data produced or changing at high speed (LHC:

100,000,000 collisions / second), more than able to store Veracity: Data quality very diverse; imprecise, imperfect,

untrustworthy information

Value: Making sense of potentially very valuable data, but with a value not immediately apparent

Special focus within IPAL:Web Data (Web pages, social networks, e-commerce data, Semantic Web, Linked Open Data, etc.)

(7)

The Five Vs of Big Data

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,

structured data), very diverse organization of data Velocity: Data produced or changing at high speed (LHC:

100,000,000 collisions / second), more than able to store Veracity: Data quality very diverse; imprecise, imperfect,

untrustworthy information

Value: Making sense of potentially very valuable data, but with a value not immediately apparent

Special focus within IPAL:Web Data (Web pages, social networks, e-commerce data, Semantic Web, Linked Open Data, etc.)

(8)

The Five Vs of Big Data

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,

structured data), very diverse organization of data Velocity: Data produced or changing at high speed (LHC:

100,000,000 collisions / second), more than able to store Veracity: Data quality very diverse; imprecise, imperfect,

untrustworthy information

Value: Making sense of potentially very valuable data, but with a value not immediately apparent

Special focus within IPAL:Web Data (Web pages, social networks, e-commerce data, Semantic Web, Linked Open Data, etc.)

(9)

Uncertain data is everywhere

Numerous sources of uncertain data:

Measurement errors

Data integration from contradicting sources

Imprecise mappings between heterogeneous schemas

Imprecise automatic processes (information extraction, natural language processing, etc.)

Imperfect human judgment Lies, opinions, rumors

(10)

Structured data is everywhere

Data is structured, not flat:

Variety of representation formats of data in the wild:

relational tables

trees, semi-structured documents

graphs, e.g., social networks or semantic graphs data streams

complex views aggregating individual information Heterogeneous schemas

Additionalstructural constraints: keys, inclusion dependencies

(11)

Intensional data is everywhere

Lots of data sources can be seen as intensional: accessing all the data in the source (in extension) isimpossible orvery costly, but it is possible to access the data through views, with someaccess constraints, associated with some access cost.

Indexesover regular data sources

Deep Web sources: Web forms, Web services

The Web or social networks as partial graphs that can be expanded bycrawling

Outcome of complex automated processes: information extraction, natural language analysis, machine learning, ontology matching Crowd data: (very) partial views of the world

(12)

Interactions between

uncertainty, structure, intensionality

If the data has complex structure, uncertain models should represent possible worlds over these structures (e.g., probability distributions over graph completions of a known subgraph in Web crawling).

If the data is intensional, we can use uncertainty to representprior distributionsabout what may happen if we access the data.

Sometimes good enough to reach a decision without having to make the access!

If the data is a RDF graph accessed by semantic Web services, each intensional data access willnot give a single data point, but a complex subgraph.

(13)

Introducing UnSAID

Uncertainty and Structure in the Access to Intensional Data Jointly deal with Uncertainty, Structure, and the fact that access to data is limited and has acost, to solve a user’s knowledge need Lazy evaluationwhenever possible

Evolving probabilistic, structured view of the current knowledge of the world

Solve at each step the problem: What is the best access to do next given my current knowledge of the world and the knowledge need Knowledge acquisition plan (recursive, dynamic, adaptive) that minimizes access cost, and provides probabilistic guarantees

Contributions to this project are welcome!

(14)

Introducing UnSAID

Uncertainty and Structure in the Access to Intensional Data Jointly deal with Uncertainty, Structure, and the fact that access to data is limited and has acost, to solve a user’s knowledge need Lazy evaluationwhenever possible

Evolving probabilistic, structured view of the current knowledge of the world

Solve at each step the problem: What is the best access to do next given my current knowledge of the world and the knowledge need Knowledge acquisition plan (recursive, dynamic, adaptive) that minimizes access cost, and provides probabilistic guarantees

Contributions to this project are welcome!

Références

Documents relatifs

In this talk, I will describe the key secular trends that characterize the field of Big Data with respect to enterprise analytics.. I will describe some of the open challenges

The key issues here are: (i) being able to associate a cost model for the data market, i.e., associate a cost to raw data and to processed data according on the amount of data

We believe that three aspects are key for dealing with intensive Big Data management: (i) dealing with efficient simple storage systems that focus efficient read and write

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,..

In conjunction with the demand for more and more frequent data from many different sources and for each of these entities, this results in generating vast

In SUPERSEDE, an integration- oriented RDF graph is used to represent and integrate the data related to monitoring and user feedback, as well as crossing it with contextual data

A layered architecture, DMEs and KVSs The idea is simple: A DME (the data consumer, DC) wants to use a data manipulation task at another DME (the data producer, DP).. For

Michael was responsi- ble for a number of National and European funded research projects in the area of Data Mining, Machine Learning, and Big Data. Distribution of this paper