• Aucun résultat trouvé

Democracy Big Bang: What data management can(not) do for journalism

N/A
N/A
Protected

Academic year: 2021

Partager "Democracy Big Bang: What data management can(not) do for journalism"

Copied!
47
0
0

Texte intégral

(1)

HAL Id: hal-01968347

https://hal.inria.fr/hal-01968347

Submitted on 2 Jan 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

Democracy Big Bang: What data management can(not) do for journalism

Ioana Manolescu

To cite this version:

Ioana Manolescu. Democracy Big Bang: What data management can(not) do for journalism. 34ème Conférence sur la Gestion de Données – Principes, Technologies et Applications, Oct 2018, Bucarest, Romania. 2018. �hal-01968347�

(2)

Democracy Big Bang

What data management can(not) do for journalism

Ioana Manolescu

Inria Saclay and Ecole polytechique

24/10/18 Bases de Données Avancées, 2018 1

(3)

MOTIVATION

(4)

Bad memories: Romania, 1989

3 Bases de Données Avancées, 2018

24/10/18

(5)

Bad memories: Romania, 1989

Ceauşescu re-elected at the 14th congress!

(6)

Bad memories: Romania, 1989

5 Bases de Données Avancées, 2018

Ceauşescu re-elected at the 14th congress!

He was in power since 1965.

24/10/18

(7)

Bad memories: Romania, 1989

(8)

Bad memories: Romania, 1985

7 Bases de Données Avancées, 2018

24/10/18

(9)

Things get beYer

(10)

... kind of

9 Bases de Données Avancées, 2018

1000 dead (approx) No one convicted.

24/10/18

(11)

Democra]c socie]es crucially need the press

•  To debate and express dissent

•  To analyze, confim or refute public statements

•  To expose and explain society func]oning

Fact-checking (see next)

(12)

Democra]c socie]es crucially need the press

•  To debate and express dissent

•  To analyze, confim or refute public statements

•  To expose and explain society func]oning

Bases de Données Avancées, 2018 11

Fact-checking (see next)

24/10/18

(13)

DATA JOURNALISM AND

JOURNALISTIC FACT-CHECKING

(14)

Data journalism

Inves]ga]ve journalism based on complex and/or large data

Bases de Données Avancées, 2018 13

24/10/18 hYp://abonnes.lemonde.fr/les-decodeurs/poraolio/2017/04/18/

les-fractures-francaises-1-5-le-logement-les-raisons-de-la-crise_5112859_4355770.html

(15)

Data journalism

Panama Papers (Interna]onal Consor]um of Inves]ga]ve Journalism, ICIJ)

(16)

Data journalism

Bases de Données Avancées, 2018 15

Panama Papers (Interna]onal Consor]um

of Inves]ga]ve Journalism, ICIJ)

24/10/18

(17)

Fact-checking (since 1930 approx.)

“The day I became a fact-checker at The New Yorker, I received one set of red pencils […]

for underlining passages on page proofs of arFcles that might contain checkable facts. […]

confirmed with the help of reference books from the magazine’s library”

hYp://www.ny]mes.com/2010/08/22/magazine/22FOB- medium-t.html

Fact-checking: verification of facts mentioned in media content

q  To protect media reputation and avoid legal action

(18)

Fact-checking (2012 – ongoing )

Bases de Données Avancées, 2018 17

24/10/18

(19)

It's not just checking

•  Most aspects of modern reality are complex

•  Explaining can be as important and useful as checking

–  Helps also analyze the future

(20)

Libéra]on Désintox: cost of saving Cyprus

$2-3 bn

$10 bn

$15 bn

Bases de Données Avancées, 2018 19

24/10/18

(21)

Saving Cyprus: how much does it cost?

The article reads:

q The money is not given but lent

q The European Mechanism of Stability will lend $9bn

q  Out of which France contributes 20%

or a bit less than $2bn

However, things are complicated because

q The initial contribution of France to the EMS ($16 bn) is counted toward French public debt

q  The remainder contribution ($124) is not

(22)

Bases de Données Avancées, 2018 21 24/10/18

Saving Cyprus: how much does it cost?

The article reads:

q The money is not given but lent

q The European Mechanism of Stability will lend $9bn

q  Out of which France contributes 20%

or a bit less than $2bn

However, things are complicated because

q The initial contribution of France to the EMS ($16 bn) is counted toward French public debt

q  The remainder contribution ($124) is not

(23)

The article reads:

q The money is not given but lent

q The European Mechanism of Stability will lend $9bn

q  Out of which France contributes 20% or a bit less

than $2bn

However, things are complicated because

q The initial contribution of France to the EMS ($16 bn) is counted toward French public debt

q  The remainder contribution ($124) is not

The ini]al statements are mostly false

Saving Cyprus: how much does it cost?

(24)

The article reads:

q The money is not given but lent

q The European Mechanism of Stability will lend $9bn

q  Out of which France contributes 20% or a bit less

than $2bn

However, things are complicated because

q The initial contribution of France to the EMS ($16 bn) is counted toward French public debt

q  The remainder contribution ($124) is not

The ini]al statements are mostly false

The complete picture is complicated

Bases de Données Avancées, 2018 23

24/10/18

Saving Cyprus: how much does it cost?

(25)

The article reads:

q The money is not given but lent

q The European Mechanism of Stability will lend $9bn

q  Out of which France contributes 20% or a bit less

than $2bn

However, things are complicated because

q The initial contribution of France to the EMS ($16 bn) is counted toward French public debt

q  The remainder contribution ($124) is not

The ini]al statements are mostly false

The complete picture is complicated

I would not have been able to give the answer

Saving Cyprus: how much does it cost?

(26)

The article reads:

q The money is not given but lent

q The European Mechanism of Stability will lend $9bn

q  Out of which France contributes 20% or a bit less

than $2bn

However, things are complicated because

q The initial contribution of France to the EMS ($16 bn) is counted toward French public debt

q  The remainder contribution ($124) is not

The ini]al statements are mostly false

The complete picture is complicated

Forming a competent opinion is non trivial

Bases de Données Avancées, 2018 25

24/10/18

Saving Cyprus: how much does it cost?

(27)

Do we need to understand?

"Populism is telling people that there are simple answers to complex problems"

(28)

Fact checking vs. fake news detec]on

•  Fact checking is based on some background informa]on source

•  Fake news detec]on may or may not use a source

•  E.g., text classifier (true, fake) trained with major news agency content, resp. known fake content (oren virulent style)

24/10/18 Bases de Données Avancées, 2018 27

(29)

A CONTENT MANAGEMENT

PERSPECTIVE

(30)

Objec]ves

1.  Model fact-checking through a data and informa]on management perspec]ve

2.  Iden]fy exisFng tools and techniques which could be directly applied

– In a special journalis]c context (see next)

3.  Devise new models, tools and techniques for fact-checking and data journalism problems

24/10/18 Bases de Données Avancées, 2018 29

(31)

ANR ContentCheck project

•  2016-2019

•  Partners:

–  Inria Saclay: I. Manolescu

–  LIMSI (now Sorbonne Université): Xavier Tannier

–  U. Lyon: Sylvie Cazalens, Philippe Lamarre, Jean-Marc Pe]t, M. Plantevit

–  U. Rennes 1: François Goasdoué –  Le Monde: Les Décodeurs

•  Also: WebClaimExplain team with AIST Japan (Julien Leblay)

(32)

Fact-checking as a content management problem

Claim to be checked

(text or data) Media

content

Media context

Reference informa]on

source 1

Human actors (journalists, experts, crowd

workers)

Reference informa]on

source 2

Reference informa]on

source n

Verifica]on tool (query, match, source search…)

Analysis result + proof

« True / rather true / rather false / false See sources: hYp://dataref.com… »

31 Bases de Données Avancées, 2018

24/10/18

(33)

Fact-checking as a content management problem

Claim to be checked

(text or data) Media

content

Media context

Reference informa]on

source 1

Human actors (journalists, experts, crowd

workers)

Reference informa]on

source 2

Reference informa]on

source n

Verifica]on tool (query, match, source search…)

Analysis result + proof

« True / rather true / rather false / false See sources: hYp://dataref.com… » Claim

extrac]on

Social network

analysis

Source selec]on

Reconcilia]on, reputa]on

(34)

Fact-checking as a content management problem

Claim to be checked

(text or data) Media

content

Media context

Reference informa]on

source 1

Human actors (journalists, experts, crowd

workers)

Reference informa]on

source 2

Reference informa]on

source n

Verifica]on tool (query, match, source search…)

Analysis result + proof

« True / rather true / rather false / false See sources: hYp://dataref.com… »

33

Claim extrac]on

Social network

analysis

Reconcilia]on, reputa]on

Source d’informa]on de référence n+1

Source d’informa]on de référence n+1

Reference informa]on source n+1

Source search / source selec]on

Bases de Données Avancées, 2018 24/10/18

Reference source construc]on, refinement, integra]on

(35)

Fact-checking as a content management problem

[WWW2018] "A Content Management Perspec]ve on Fact-Checking",

S. Cazalens, J. Leblay, I. Manolescu, X. Tannier (fact-checking track)

[WWW2018 tutorial] "Computa]onal fact-checking: problems, state of the art, and perspec]ves", J. Leblay, I. Manolescu, X. Tannier

[VLDB2018 tutorial] "Computa]onal fact-checking: a content management

perspec]ve", S. Cazalens, J. Leblay, P. Lamarre, I. Manolescu, X. Tannier

(36)

Which exis]ng technologies can be used?

•  Database queries?

– Journalists do not, generally, build databases.

– It's seen as "not part of our job"

– Persis]ng data is novel to some – SQL unfriendly

– Not always helped by journal informa]on systems

•  Remarkable excep]on:

Ouest France

24/10/18 Bases de Données Avancées, 2018 35

(37)

Which exis]ng technologies can be used?

•  Database queries?

– Journalists do not, generally, build databases.

– It's seen as "not part of our job"

•  Curse of the coverage: they need to cover (almost) any topic

•  Curse of noteworthiness: write about hot topics of today (or tomorrow)

•  Thet work under strong Fme pressure

(38)

Which exis]ng technologies can be used?

•  IR/NLP?

– BeYer op]ons

•  The outcome of fact-check has to be explainable

•  Not (only) "an ML algorithm said so"

•  But they don't like to train ML systems, either.

•  Extremely picky on data sources

– "A majority of Web sources claim it" is not an op]on

24/10/18 Bases de Données Avancées, 2018 37

(39)

Automa]c source selec]on: FactMinder

[SIGMOD2013demo] F. Goasdoué, K. Karanasos, Y. Katsis, J.

Leblay, I. Manolescu, S. Zampetakis: "Fact-checking and analyzing the Web"

•  Plug-in

•  Brings relevant data from document

and knowledge bases

(40)

Improving access to reference data sources

[BDA2018] T.-D. Cao, I. Manolescu, X. Tannier. "Extrac]ng Linked Data from sta]s]c spreadsheets" (WebDB 2018, SBD 2017)

“Créa3ons d’entreprises en France en 2015” à we return:

24/10/18 Bases de Données Avancées, 2018 39

(41)

Keyword search across heterogeneous sources

[VLDB2018demo] C. Chanial, R. Dziri, H. Galhardas, J. Leblay, M.

Le Nguyen and I. Manolescu: "Connec]onLens: Finding Connec]ons Across Heterogeneous Data Sources"

•  Also today

(42)

Temporal beliefs:

A structured data model for fact- checking

With F. Goasdoué and L. Duroyon (U. Rennes 1) Modeling who said what when

•  "On Jan 22, 2017, Le Canard Enchaîné wrote that François Fillon had stated on Jan 21 that Penelope worked for the NA from 2002 to 2008."

•  What has one actor heard of?

•  Who knew this at that ]me?

•  Who reversed their posi]on on a topic?

24/10/18 Bases de Données Avancées, 2018 41

(43)

WRAP-UP

WHERE WE STAND, WHAT CAN WE DO?

(44)

Perspec]ves

•  Any quesFon answering framework which can be plugged on trusted data sources is useful

–  Usually no ]me nor skills to integrate the data

•  No IT infrastructure can be counted upon

–  More "a jumble of tools"

•  ML promisses a lot but delivery is hard(er)

–  Also: journalists ambivalence

•  Educate: the general audience, journalists, other social scien]sts

•  Strong and increasingly IT-literate community

–  E.g. hYps://www.poynter.org/channels/fact-checking

24/10/18 Bases de Données Avancées, 2018 43

(45)

NLP is hard

(46)

Is this worth it?

“Some people will never be convinced”

•  Correct.

•  “Facts have a liberal bias” (Paul Krugman)

•  Source: hYps://www.ny]mes.com/2017/12/08/opinion/facts- have-a-well-known- liberal-bias.html

•  Scien]sts and humanity scholars believe in a constructed, logical discourse, and believe humans yield to reason.

Businesspeople know this is not true, in general. Businesspeople have thus an advantage in winning poli]cal compe]]ons

George Lakoff, former Berkeley professor

hYps://georgelakoff.com/2016/11/22/a-minority-president-why- the-polls-failed-and-what-the-majority-can-do/

•  Conspiracy theory adepts believe two obviously contradic]ng theories [Wood et al., 2012]

24/10/18 Bases de Données Avancées, 2018 45

(47)

THANK YOU / QUESTIONS?

HTTP://CONTENTCHECK.INRIA.FR/

Références

Documents relatifs

And we believe that today, data journalism is a premier type of such intermediaries, because data journalists combine the expertise of how to access, interpret,

countries: (1) the hypothesis that the real exchange rate follows a random walk is strongly rejected by data in the long run, (2) the real exchange rate tends to return to

In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing

Before returning to the proof, we explain what this has to do with the fourth protocol message. The special case of the fourth message delivery we deal with is in fact very simple,

• A third contribution is the dataset resulting from the integration process of three different sources that we used for experimental validation: land cover data of a specific

The main contribution of this software is to gather under the same framework different functionalities for (1) data access and display, (2) signal, image, and video processing, and

Thus Warehouse Management Sys- tem (WMS) is one of the IT logistics systems that help enterprises to store, manage and analyse their data within a warehouse efficiently (i.e.

The shape of the maxima over the city is different because in this case, the complete vertical column is assimilated for all species: the gain, mixed with the advection, enhanced