Information Extraction MPRI 2.26.2: Web Data Management

(1)

Information Extraction

MPRI 2.26.2: Web Data Management

Antoine Amarilli Friday, January 11th

(2)

Idea of Information Extraction (IE)

•

^{Going from}unstructured text...

•

^{... to}structured data

(3)

Detour: Natural Language Processing (NLP)

•

Part-of-speech (POS) tagging:annotate each word with its grammaticalnature

The/DET text/NN is/V annotated/ADJ.

•

Word-sense disambiguation (WSD):choose the rightmeaning: bass/1: a type of fish

bass/2: a music instrument

•

Coreference resolution:

Trump told Macron that \rd{he} was not a spying target.

(4)

•

Word-sense disambiguation (WSD):choose the rightmeaning:

bass/1: a type of fish bass/2: a music instrument

•

Trump told Macron that \rd{he} was not a spying target.

(5)

•

Word-sense disambiguation (WSD):choose the rightmeaning:

bass/1: a type of fish bass/2: a music instrument

•

(6)

Detour: Natural Language Processing (NLP), cont’d

•

^Parsing:figuring out thestructureof the sentence:

•

Preprocessing, e.g.:

• Stop-word removal:“the”, “is”, “at”

• Stemming: “suffixed”→“suffix”

→ All these tasks arerelatedto information extraction

→ But we will often try to do IEwithoutsolving these problems

(7)

•

(8)

•

(9)

Named Entity Recognition (NER)

•

Identifyingnamed entitiesin a document

The firstMPRIclass for2019takes place atSophie Germain.

•

Possibly classify names in a simpletype hierarchy: person, address, date, organization, etc.

•

Difficulties:

• Nested entities:“Bank of America”,“Carnegie Hall”

• Boundaries:“All England Lawn Tennis and Croquet Club”

(10)

•

Difficulties:

(11)

•

Difficulties:

(12)

Approaches for NER

•

If the set isfinite, use adictionary

• Forefficientimplementation, represent the dictionary as atrie and run theAho-Corasick algorithm

•

^Forfixed formatidentifiers (e.g., ISBNs, DOIs, emails, GTINs), write aregexpwith captures and run anautomaton

•

^Otherwise,statisticalapproaches

• May use variousfeatures: context, morphology, case, punctuation, part-of-speech tags, previously extracted named entities...

• Use apre-trainedmodel or train it on your data

•

Implemented, e.g., inSpacy,NLTK,OpenNLP, orStanford NER http://nlp.stanford.edu:8080/ner/process

(13)

•

(14)

•

(15)

•

Implemented, e.g., inSpacy,NLTK,OpenNLP, orStanford NER

(16)

Evaluating NER systems

•

We must oftenevaluatesystems to compare their performance

•

We do so against agold standardof correct results

•

Performance is measured along two independentdimensions:

• Precision, the percentage of extracted matches that are correct

• Recall, the percentage of correct matches that are extracted

→ Extracteverythinggives 100% recall (and very bad precision)

→ Extractnothinggives undefined precision and 0% recall

•

Combining these two scores:F1 measure, which is the harmonic mean of precision and recall

•

^Orprecision-recall curveto show the tradeoff

•

^{To avoid}overfitting, evaluate the system on avalidation dataset different from the one on which the system was designed/trained

(17)

•

(18)

•

(19)

•

(20)

Entity Disambiguation

•

Disambiguatewhich entityis being used

→ The place and function of Venus in Ovid

→ Computed backscattering function of Venus and the moon

•

Usually meanschoosingone of severalentitieswith that name

•

^Several^signals:

• Prior:

• Howwell-knownthe entity is

• How well the namefitsthe entity

→ She went to Paris.

• Similarity between thecontextof the word in the text and that of the entity in the knowledge base

• Consistencywith other disambiguated entities https://gate.d5.mpi-inf.mpg.de/webaida/

(21)

Entity Disambiguation

•

Disambiguatewhich entityis being used

→ The place and function of Venus in Ovid

→ Computed backscattering function of Venus and the moon

•

Usually meanschoosingone of severalentitieswith that name

•

^Several^signals:

• Prior:

• Howwell-knownthe entity is

• How well the namefitsthe entity

→ She went to Paris.

• Similarity between thecontextof the word in the text and that of the entity in the knowledge base

• with other disambiguated entities

(22)

Instance extraction

•

Extracting ataxonomywithis-Arelations

• “Pluto is a dog”

• “a dog is an animal”

•

Hearst patterns:

• “Manyscientists, including Einstein, believed...”

• “France, Germany and other countrieshave been plagued with...”

• “Otherforms of government such as constitutional monarchy...”

•

Set expansion:

• Start with a set of entities of the same type (e.g., countries),

• Find alistortable columncontaining several such entities

• Addthe other entities (assume that they have the same type)

•

^Problems_•

False positives: “the classification of such cities as urban”

• Boundaries:“some scientists, such as computer scientists”

• Disambiguation, andsemantic drift

•

Taxonomy induction:cleaning up the resulting taxonomy Example:NELLhttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:bird

(23)

•

Hearst patterns:

•

Set expansion:

•

^Problems_•

•

(24)

•

Hearst patterns:

•

Set expansion:

•

^Problems_•

•

(25)

•

Hearst patterns:

•

Set expansion:

•

^Problems_•

•

(26)

•

Hearst patterns:

•

Set expansion:

•

^Problems_•

•

9/12

(27)

Fact extraction

•

Generalization ofHearst patterns, e.g.: “X was born in Y”

•

DIPRE: Dual Iterative Pattern Relation Expansion:

• Apply thepatternsto generate morefacts

• Use the newfactsto learn morepatterns

→ Problems:semantic drift; sometimes multiple relations match...

•

Learning fromstructured Web content, e.g., lists and tables (cf Web Data Commons), Wikipedia infoboxes...

•

General technique:Wrapper induction(see next slide)

(28)

Fact extraction

•

(29)

Fact extraction

•

(30)

Fact extraction

•

(31)

Wrapper induction

(32)

(33)

(34)

(35)

Slide credits

•

Course structure inspired by the class by Fabian Suchanekhttps:

//suchanek.name/work/teaching/inf344-2018/index.html

•

^{Slide 4:}https://www.nltk.org/_images/tree.gif