• Aucun résultat trouvé

November21st,2016 AntoineAmarilli ,SilviuManiu UncertainDataManagementSourcesofUncertainData

N/A
N/A
Protected

Academic year: 2022

Partager "November21st,2016 AntoineAmarilli ,SilviuManiu UncertainDataManagementSourcesofUncertainData"

Copied!
23
0
0

Texte intégral

(1)

Uncertain Data Management

Uncertain Data Management Sources of Uncertain Data

Antoine Amarilli1, Silviu Maniu2

1Télécom ParisTech

2LRI

November 21st, 2016

(2)

Uncertain Data Management

Uncertain Data Management

Database systems usually assume that data iscorrectandcomplete

Incomplete andmissing data Imprecise data

Noisy data

Untrustworthy data

Whichapplications produce uncertain data nowadays?

(3)

Uncertain Data Management

Uncertain Data Management

Database systems usually assume that data iscorrectandcomplete Incomplete andmissing data

Imprecise data Noisy data

Untrustworthy data

Whichapplications produce uncertain data nowadays?

(4)

Uncertain Data Management

Uncertain Data Management

Database systems usually assume that data iscorrectandcomplete Incomplete andmissing data

Imprecise data Noisy data

Untrustworthy data

Whichapplications produce uncertain data nowadays?

(5)

Uncertain Data Management

Never-Ending Language Learning

NELL: Read the Web

(6)

Uncertain Data Management

Information extraction

DeepDive: extract facts from journal articles

����������������

�������������� ��������� ���������� �����

�����������������������������������������������

����

���

���

��� �������� ��������� ��������� ���� ���������

��� ��

���� ����

������� ������� �������

�������� ��������� ��� ��

�������� ��������� ��� �������

������������������� ����������

��������������

�����

��� ���������

�������� ���������

(7)

Uncertain Data Management

Many sources of uncertainty

Errors in sources:

Entity disambiguation:

“The place and function of Venusin Ovid...”

“Computed backscattering function of Venusand the moon...” Natural language parsing:

“such cities asNew York”, “such cities as Mohenjo-Daro”

“the classification of such cities as urban” Anaphora resolution:

“Obama told Hollande thathewas not a spying target” Noise:

Incompleteness

(8)

Uncertain Data Management

Many sources of uncertainty

Errors in sources:

Entity disambiguation:

“The place and function of Venusin Ovid...”

“Computed backscattering function of Venusand the moon...”

Natural language parsing:

“such cities asNew York”, “such cities as Mohenjo-Daro”

“the classification of such cities as urban” Anaphora resolution:

“Obama told Hollande thathewas not a spying target” Noise:

Incompleteness

(9)

Uncertain Data Management

Many sources of uncertainty

Errors in sources:

Entity disambiguation:

“The place and function of Venusin Ovid...”

“Computed backscattering function of Venusand the moon...”

Natural language parsing:

“such cities asNew York”, “such cities as Mohenjo-Daro”

“the classification of such cities asurban”

Anaphora resolution:

“Obama told Hollande thathewas not a spying target” Noise:

Incompleteness

(10)

Uncertain Data Management

Many sources of uncertainty

Errors in sources:

Entity disambiguation:

“The place and function of Venusin Ovid...”

“Computed backscattering function of Venusand the moon...”

Natural language parsing:

“such cities asNew York”, “such cities as Mohenjo-Daro”

“the classification of such cities asurban”

Anaphora resolution:

“Obama told Hollande thathewas not a spying target”

Noise:

Incompleteness

(11)

Uncertain Data Management

Many sources of uncertainty

Errors in sources:

Entity disambiguation:

“The place and function of Venusin Ovid...”

“Computed backscattering function of Venusand the moon...”

Natural language parsing:

“such cities asNew York”, “such cities as Mohenjo-Daro”

“the classification of such cities asurban”

Anaphora resolution:

Incompleteness

(12)

Uncertain Data Management

Many sources of uncertainty

Errors in sources:

Entity disambiguation:

“The place and function of Venusin Ovid...”

“Computed backscattering function of Venusand the moon...”

Natural language parsing:

“such cities asNew York”, “such cities as Mohenjo-Daro”

“the classification of such cities asurban”

Anaphora resolution:

“Obama told Hollande thathewas not a spying target”

Noise:

(13)

Uncertain Data Management

Crowdsourcing

Amazon Mechanical Turk

Users are untrustworthy!

(14)

Uncertain Data Management

Crowdsourcing

Amazon Mechanical Turk

(15)

Uncertain Data Management

Sentiment analysis

n Most positiven-grams Most negativen-grams

1 engaging; best; powerful; love; beautiful bad; dull; boring; fails; worst; stupid; painfully 2 excellent performances; A masterpiece; masterful

film; wonderful movie; marvelous performances

worst movie; very bad; shapeless mess; worst thing; instantly forgettable; complete failure 3 an amazing performance; wonderful all-ages tri-

umph; a wonderful movie; most visually stunning for worst movie; A lousy movie; a complete fail- ure; most painfully marginal; very bad sign 5 nicely acted and beautifully shot; gorgeous im-

agery, effective performances; the best of the year; a terrific American sports movie; refresh- ingly honest and ultimately touching

silliest and most incoherent movie; completely crass and forgettable movie; just another bad movie. A cumbersome and cliche-ridden movie;

a humorless, disjointed mess 8 one of the bestfilms of the year; A love forfilms

shines through each frame; created a masterful piece of artistry right here; A masterfulfilm from a masterfilmmaker,

A trashy, exploitative, thoroughly unpleasant ex- perience ; this sloppy drama is an empty ves- sel.; quickly drags on becoming boring and pre- dictable.; be the worst special-effects creation of the year

(16)

Uncertain Data Management

Schema mappings

(a)

(b)

(c)

(17)

Uncertain Data Management

Scientific data

(18)

Uncertain Data Management

Speech recognition and OCR

Decoding output is uncertain

(19)

Uncertain Data Management

Robotics

(20)

Uncertain Data Management

Other applications

Data integration: combine data across sources Data cleaning: fix errors in stale/outdated data Machine learning: predictions are uncertain Data mining: trends extracted from large datasets Computational biology: genomic data management

... and much more!

(21)

Image Credits

Slide 5: http://rtw.ml.cmu.edu/

Slide 7: https://en.wikipedia.org/wiki/Template:Disputed Slide 6: [Zhang, 2015], page 9

Slide 13: https://www.mturk.com/

Slide 15: [Socher et al., 2013], page 10 Slide 16: [Dong et al., 2009], page 4 Slide 17:

https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/CONFNOTES/ATLAS-CONF-2015-041/fig_06b.png

Slide 18: https:

//code.google.com/p/transducersaurus/wiki/CascadeTutorial

(22)

References I

Dong, X. L., Halevy, A., and Yu, C. (2009).

Data integration with uncertainty.

The VLDB Journal—The International Journal on Very Large Data Bases, 18(2):469–500.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., and Potts, C. (2013).

Recursive deep models for semantic compositionality over a sentiment treebank.

InProc. EMNLP.

(23)

References II

Zhang, C. (2015).

DeepDive: A Data Management System for Automatic Knowledge Base Construction.

PhD thesis, University of Winconsin–Madison.

https:

//cs.stanford.edu/people/czhang/zhang.thesis.pdf.

Références

Documents relatifs

 Cas idéal: placez 10 points rouges (classe 1) et 10 points bleus (classe 0) dans deux groupes similaires, distincts et linéairement séparables et apprenez avec l’algorithme

“The place and function of Venus in Ovid...”. “Computed backscattering function of Venus and

Extensional approach: evaluating the query directly Intensional approach: computing the query lineage Tractable rules for query evaluation. Tractable lineage classes

Basics Relational algebra SQL Relational calculus Query fragments Query evaluation.. Uncertain

This book is about the tools and techniques of machine learning used in practical data mining for finding, and describing, structural patterns in data.. As with any burgeoning

Research on this problem in the late 1970s found that these diagnostic rules could be generated by a machine learning algorithm, along with rules for every other disease category,

In these two environments, we faced the rapid development of a field connected with data analysis according to at least two features: the size of available data sets, as both number

The main body of the book is as follows: Part 1, Data Mining and Knowledge Discovery Process (two Chapters), Part 2, Data Understanding (three Chapters), Part 3, Data