Multilinguality in Wikidata

(1)

Multilinguality in Wikidata

Lucie-Aimée Kaffee

[email protected]

(2)

About Me

PhD Student

WAIS, University of Southampton Previously worked as a Software

Developer at Wikimedia Deutschland, in the Wikidata team

Interest in (under-resourced) languages From Berlin, Germany

(3)

What we will talk about

Wikidata

Multilinguality in Wikidata My work

(4)

Wikidata

➔ Knowledge base maintained and edited by a community of users

➔ 48,775,926 items

➔ Each entity can have labels in >400 languages

LOD cloud

(5)

(6)

Multilinguality in Wikidata

(7)

A Glimpse into Babel:

An Analysis of Multilinguality in Wikidata

Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, Lydia Pintscher OpenSym 2017

(8)

Q7259

Multilinguality in Wikidata

Q82594 P106

Ada Lovelace سﯾﻼﻓوﻟ ادآ occupation ﺔﻧﮭﻣﻟا computer

scientist بوﺳﺎﺣ مﻟﺎﻋ

@en @ar @en @ar @en @ar

rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label

rdfs:label

(9)

Multilinguality in Wikidata - Why do we care?

● Labels are the access point for humans

● Give language communities access to existing knowledge

● Central storage for translations for (under resourced) languages

● Semantic Web in NLP and NLG

● Reuse: Wikipedia, translation, question answering, chat bots, ...

(10)

Research Questions

● What is the state of Wikidata with regard to multilinguality?

● How does Wikidata's label distribution relate to the real world and Wikipedia's language distribution?

● Is there a difference in the multilinguality of the properties, compared to the

overall multilinguality of the knowledge base?

(11)

(12)

11.04%

(13)

11.04%

6.5%

6%

5%

4%

(14)

Comparison of distribution of languages in Wikidata and first language speakers in the world

(15)

The most spoken language in the world, Chinese, is not well covered in Wikidata.

(16)

Bot edits can make a difference in content coverage (Cebuano and Swedish)

(17)

Dedicated communities change language representation (German and Dutch)

(18)

Q7259

Wikidata Properties

Q82594 P106

Ada Lovelace سﯾﻼﻓوﻟ ادآ occupation ﺔﻧﮭﻣﻟا computer

scientist بوﺳﺎﺣ مﻟﺎﻋ

@en @ar @en @ar @en @ar

rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label

rdfs:label

(19)

Ranking of number of

(20)

Wikipedia articles by language,

(21)

Wikipedia articles by language, all labels in Wikidata,

(22)

Wikipedia articles by language, all labels in Wikidata,

and labels for properties in Wikidata

(23)

German is widely used in Wikipedia and Wikidata

High coverage through active community

(24)

As German, active community that brings high coverage of labels

(25)

High coverage in labels on Wikidata through high number of bot-imported Wikipedia articles, however low number of community edited properties

(26)

Even more extreme than Swedish: Not in top 25 of community-edited properties by language

(27)

Users in Wikidata

(Work in Progress)

(28)

(29)

Native Languages of Wikidata users

(30)

Language coverage of labels does not reflect in languages Wikidata’s users speak

Native Languages of Wikidata users

(31)

From Wikidata to Wikipedia

(32)

Wikipedia is available in 285 languages, but the content is unevenly distributed English

Articles: 5,656,303 Editors: 132,781

Arabic

Articles: 576,376 Editors: 4,809

Esperanto

Articles: 247, 215 Editors: 361

(33)

Content From Wikidata to Wikipedia

(34)

Learning to Generate Wikipedia

Summaries for Underserved Languages from Wikidata

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl

NAACL 2018

(35)

Esperanto

➔ Esperanto is an artificial language

➔ Easy to learn

➔ Engaged Wikipedia community

➔ A good starting point

➔ Arabic is the 5th most spoken language in the world

➔ Content online in Arabic is sparse however

Arabic

en

ar

eo

(36)

Sample Input

Q490900 Floridia

P17 country

Q38 Italy

Q490900 Floridia

P31 instance of

Q747074 comune of Italy Q30025755

Floridia (town)

P36 capital

Q490900 Floridia

...

(37)

Neural Text Generation

Arabic and Esperanto output text Feed-forward architecture encodes Wikidata triples into vector of fixed dimensionality

RNN-based decoder generates text summaries, one token at a time Property placeholder to deal with out of vocabulary words

(38)

Q106693 Group 14 (chemical series)

رﺻﺎﻧﻌﻠﻟ يرودﻟا لودﺟﻟا ﻲﻓ ةدوﺟوﻣﻟا ةدوﺟوﻣﻟا رﺻﺎﻧﻌﻟا ﻲھ نوﺑرﻛﻟا ﺔﻋوﻣﺟﻣ

Karbongrupo estas elemento en grupo 0 de la perioda tabelo laŭ la IUPAC-sistemo .

The carbon group is a periodic table group consisting of carbon, silico-n, germanium, tin, lead, and flerovium.

Q16885 Thelxinoe (natural satellite)

. يرﺗﺷﻣﻟا بﻛوﻛﻟ ﻊﺑﺎﺗ ﺔﯾﻌﺟارﺗ ﺔﻛرﺣﺑ كرﺣﺗﯾ ﻲﻣﺎظﻧ رﯾﻏ ﻲﻌﯾﺑط رﻣﻗ وھ نوﯾﺳﻛﯾﻠﯾﺛ Telksino estas neregula satelito de Jupitero , kiu havas retrogradan orbiton .

Thelxinoe (/θɛlkˈsɪnoʊˌiː/ thelk-SIN-o-ee; Greek: Θελξινόη), also known as Jupiter XLII, is a natural satellite of Jupiter.

(39)

Automatic Evaluation

● Baselines

○ Machine Translation, Information Retrieval Based, Kneser-Ney

● Automatic Evaluation

○ BLEU 1, BLEU 2, BLEU 3, BLEU 4, METEOR, ROUGE

(40)

Results of the automatic evaluation: Our network outperforms all baselines

(41)

Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl

ESWC 2018

(42)

ArticlePlaceholder display

Wikidata triples on Wikipedia in tabular way

Currently deployed on 14

Wikipedias

(43)

(44)

(45)

Enriching ArticlePlaceholder with textual summaries generated from Wikidata triples

Working with Arabic and

Esperanto

(46)

Community Study

➔ Two 15 days online surveys, aimed at readers and editors in Esperanto and Arabic

➔ Aiming to test our work with the actual Wikipedia community, outreach on Wikipedia plattforms

➔ Reader:

◆ Fluency: Is the text understandable and grammatically correct?

◆ Appropriateness: Does the summary ‘feel’ like a Wikipedia article?

➔ Editor:

◆ Editors were asked to edit the article starting from our summary (2-3 sentences)

◆ How much of the text was reused?

(47)

Results of the reader study

Kaffee, Elsahar, Vougiouklis et al.: Mind the (Language) Gap: Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

(48)

Results of the reader study: We generate sentences of comparable fluency, that “feel” like Wikipedia sentences

Kaffee, Elsahar, Vougiouklis et al.: Mind the (Language) Gap: Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

(49)

Results of the editor study: We generate sentences that are highly reused by editors

wholly derived

partially derived wholly derived

non derived

partially derived non derived

(50)

Results of the editor study: We generate sentences that are highly reused by editors

wholly derived

partially derived wholly derived

non derived

partially derived

non derived

78.78%

94.77%

(51)

Our algorithms can always only be as good as information in our data.

(52)

Our algorithms can always only be as good as information in our data. Severe lack of data in Arabic in Wikidata.

(53)

Future Work:

Label Extraction From Wikipedia

For Wikidata

(54)

Example Wikidata Triple

Berlin Capital Of Germany

Q64 P1376 Q183

Berlin Hauptstadt X English

German

(55)

Example Wikidata Triple

Berlin Capital Of Germany

Q64 P1376 Q183

Berlin Hauptstadt X English

German

(56)

(57)

(58)

(59)

(60)

● Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders, Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, ESWC 2018

● Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata, Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, NAACL 2018

● The Human Face of the Web of Data: A Cross-Sectional Study of Labels, Lucie-Aimée Kaffee, Elena Simperl, SEMANTiCS 2018

● Property Label Stability in Wikidata, Thomas Pellissier Tanon, Lucie-Aimée Kaffee, Wiki Workshop at The Web Conference 2018

● A Glimpse into Babel: An Analysis of Multilinguality in Wikidata, Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, Lydia Pintscher, OpenSym 2017

● Provenance Information in a Collaborative Knowledge Graph: an Evaluation of Wikidata External References, Alessandro Piscopo, Lucie-Aimée Kaffee, Chris Phethean, Elena Simperl, ISWC 2017

● What do Wikidata and Wikipedia Have in Common?: An Analysis of their Use of External References, Alessandro Piscopo, Pavlos Vougiouklis, Lucie-Aimée Kaffee, Christopher Phethean, Jonathon Hare, Elena Simperl, OpenSym 2017

● Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples, Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christoph Gravier, Frederique Laforest, Jonathon Hare, Elena Simperl (under submission)

(61)

Thanks!

[email protected] luciekaffee.github.io

@frimelle

Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for

ArticlePlaceholders, Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, ESWC 2018 Learning to Generate Wikipedia Summaries for

Underserved Languages from Wikidata, Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, NAACL 2018

Property Label Stability in Wikidata, Thomas Pellissier Tanon, Lucie-Aimée Kaffee, Wiki Workshop at The Web Conference 2018

A Glimpse into Babel: An Analysis of Multilinguality in Wikidata, Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, Lydia Pintscher, OpenSym 2017