• Aucun résultat trouvé

Information Extraction MPRI 2.26.2: Web Data Management

N/A
N/A
Protected

Academic year: 2022

Partager "Information Extraction MPRI 2.26.2: Web Data Management"

Copied!
35
0
0

Texte intégral

(1)

Information Extraction

MPRI 2.26.2: Web Data Management

Antoine Amarilli Friday, January 11th

(2)

Idea of Information Extraction (IE)

Going fromunstructured text...

... tostructured data

(3)

Detour: Natural Language Processing (NLP)

Part-of-speech (POS) tagging:annotate each word with its grammaticalnature

The/DET text/NN is/V annotated/ADJ.

Word-sense disambiguation (WSD):choose the rightmeaning: bass/1: a type of fish

bass/2: a music instrument

Coreference resolution:

Trump told Macron that \rd{he} was not a spying target.

(4)

Detour: Natural Language Processing (NLP)

Part-of-speech (POS) tagging:annotate each word with its grammaticalnature

The/DET text/NN is/V annotated/ADJ.

Word-sense disambiguation (WSD):choose the rightmeaning:

bass/1: a type of fish bass/2: a music instrument

Coreference resolution:

Trump told Macron that \rd{he} was not a spying target.

(5)

Detour: Natural Language Processing (NLP)

Part-of-speech (POS) tagging:annotate each word with its grammaticalnature

The/DET text/NN is/V annotated/ADJ.

Word-sense disambiguation (WSD):choose the rightmeaning:

bass/1: a type of fish bass/2: a music instrument

Coreference resolution:

(6)

Detour: Natural Language Processing (NLP), cont’d

Parsing:figuring out thestructureof the sentence:

Preprocessing, e.g.:

Stop-word removal:“the”, “is”, “at”

Stemming: “suffixed”“suffix”

→ All these tasks arerelatedto information extraction

→ But we will often try to do IEwithoutsolving these problems

(7)

Detour: Natural Language Processing (NLP), cont’d

Parsing:figuring out thestructureof the sentence:

Preprocessing, e.g.:

Stop-word removal:“the”, “is”, “at”

Stemming: “suffixed”“suffix”

→ All these tasks arerelatedto information extraction

→ But we will often try to do IEwithoutsolving these problems

(8)

Detour: Natural Language Processing (NLP), cont’d

Parsing:figuring out thestructureof the sentence:

Preprocessing, e.g.:

Stop-word removal:“the”, “is”, “at”

Stemming: “suffixed”“suffix”

→ All these tasks arerelatedto information extraction

→ But we will often try to do IEwithoutsolving these problems

(9)

Named Entity Recognition (NER)

Identifyingnamed entitiesin a document

The firstMPRIclass for2019takes place atSophie Germain.

Possibly classify names in a simpletype hierarchy: person, address, date, organization, etc.

Difficulties:

Nested entities:“Bank of America”,“Carnegie Hall”

Boundaries:“All England Lawn Tennis and Croquet Club”

(10)

Named Entity Recognition (NER)

Identifyingnamed entitiesin a document

The firstMPRIclass for2019takes place atSophie Germain.

Possibly classify names in a simpletype hierarchy: person, address, date, organization, etc.

Difficulties:

Nested entities:“Bank of America”,“Carnegie Hall”

Boundaries:“All England Lawn Tennis and Croquet Club”

(11)

Named Entity Recognition (NER)

Identifyingnamed entitiesin a document

The firstMPRIclass for2019takes place atSophie Germain.

Possibly classify names in a simpletype hierarchy: person, address, date, organization, etc.

Difficulties:

Nested entities:“Bank of America”,“Carnegie Hall”

Boundaries:“All England Lawn Tennis and Croquet Club”

(12)

Approaches for NER

If the set isfinite, use adictionary

• Forefficientimplementation, represent the dictionary as atrie and run theAho-Corasick algorithm

Forfixed formatidentifiers (e.g., ISBNs, DOIs, emails, GTINs), write aregexpwith captures and run anautomaton

Otherwise,statisticalapproaches

• May use variousfeatures: context, morphology, case, punctuation, part-of-speech tags, previously extracted named entities...

• Use apre-trainedmodel or train it on your data

Implemented, e.g., inSpacy,NLTK,OpenNLP, orStanford NER http://nlp.stanford.edu:8080/ner/process

(13)

Approaches for NER

If the set isfinite, use adictionary

• Forefficientimplementation, represent the dictionary as atrie and run theAho-Corasick algorithm

Forfixed formatidentifiers (e.g., ISBNs, DOIs, emails, GTINs), write aregexpwith captures and run anautomaton

Otherwise,statisticalapproaches

• May use variousfeatures: context, morphology, case, punctuation, part-of-speech tags, previously extracted named entities...

• Use apre-trainedmodel or train it on your data

Implemented, e.g., inSpacy,NLTK,OpenNLP, orStanford NER http://nlp.stanford.edu:8080/ner/process

(14)

Approaches for NER

If the set isfinite, use adictionary

• Forefficientimplementation, represent the dictionary as atrie and run theAho-Corasick algorithm

Forfixed formatidentifiers (e.g., ISBNs, DOIs, emails, GTINs), write aregexpwith captures and run anautomaton

Otherwise,statisticalapproaches

• May use variousfeatures: context, morphology, case, punctuation, part-of-speech tags, previously extracted named entities...

• Use apre-trainedmodel or train it on your data

Implemented, e.g., inSpacy,NLTK,OpenNLP, orStanford NER http://nlp.stanford.edu:8080/ner/process

(15)

Approaches for NER

If the set isfinite, use adictionary

• Forefficientimplementation, represent the dictionary as atrie and run theAho-Corasick algorithm

Forfixed formatidentifiers (e.g., ISBNs, DOIs, emails, GTINs), write aregexpwith captures and run anautomaton

Otherwise,statisticalapproaches

• May use variousfeatures: context, morphology, case, punctuation, part-of-speech tags, previously extracted named entities...

• Use apre-trainedmodel or train it on your data

Implemented, e.g., inSpacy,NLTK,OpenNLP, orStanford NER

(16)

Evaluating NER systems

We must oftenevaluatesystems to compare their performance

We do so against agold standardof correct results

Performance is measured along two independentdimensions:

Precision, the percentage of extracted matches that are correct

Recall, the percentage of correct matches that are extracted

Extracteverythinggives 100% recall (and very bad precision)

Extractnothinggives undefined precision and 0% recall

Combining these two scores:F1 measure, which is the harmonic mean of precision and recall

Orprecision-recall curveto show the tradeoff

To avoidoverfitting, evaluate the system on avalidation dataset different from the one on which the system was designed/trained

(17)

Evaluating NER systems

We must oftenevaluatesystems to compare their performance

We do so against agold standardof correct results

Performance is measured along two independentdimensions:

Precision, the percentage of extracted matches that are correct

Recall, the percentage of correct matches that are extracted

Extracteverythinggives 100% recall (and very bad precision)

Extractnothinggives undefined precision and 0% recall

Combining these two scores:F1 measure, which is the harmonic mean of precision and recall

Orprecision-recall curveto show the tradeoff

To avoidoverfitting, evaluate the system on avalidation dataset different from the one on which the system was designed/trained

(18)

Evaluating NER systems

We must oftenevaluatesystems to compare their performance

We do so against agold standardof correct results

Performance is measured along two independentdimensions:

Precision, the percentage of extracted matches that are correct

Recall, the percentage of correct matches that are extracted

Extracteverythinggives 100% recall (and very bad precision)

Extractnothinggives undefined precision and 0% recall

Combining these two scores:F1 measure, which is the harmonic mean of precision and recall

Orprecision-recall curveto show the tradeoff

To avoidoverfitting, evaluate the system on avalidation dataset different from the one on which the system was designed/trained

(19)

Evaluating NER systems

We must oftenevaluatesystems to compare their performance

We do so against agold standardof correct results

Performance is measured along two independentdimensions:

Precision, the percentage of extracted matches that are correct

Recall, the percentage of correct matches that are extracted

Extracteverythinggives 100% recall (and very bad precision)

Extractnothinggives undefined precision and 0% recall

Combining these two scores:F1 measure, which is the harmonic mean of precision and recall

Orprecision-recall curveto show the tradeoff

(20)

Entity Disambiguation

Disambiguatewhich entityis being used

The place and function of Venus in Ovid

Computed backscattering function of Venus and the moon

Usually meanschoosingone of severalentitieswith that name

Severalsignals:

Prior:

• Howwell-knownthe entity is

• How well the namefitsthe entity

She went to Paris.

• Similarity between thecontextof the word in the text and that of the entity in the knowledge base

Consistencywith other disambiguated entities https://gate.d5.mpi-inf.mpg.de/webaida/

(21)

Entity Disambiguation

Disambiguatewhich entityis being used

The place and function of Venus in Ovid

Computed backscattering function of Venus and the moon

Usually meanschoosingone of severalentitieswith that name

Severalsignals:

Prior:

• Howwell-knownthe entity is

• How well the namefitsthe entity

She went to Paris.

• Similarity between thecontextof the word in the text and that of the entity in the knowledge base

with other disambiguated entities

(22)

Instance extraction

Extracting ataxonomywithis-Arelations

“Pluto is a dog”

“a dog is an animal”

Hearst patterns:

“Manyscientists, including Einstein, believed...”

“France, Germany and other countrieshave been plagued with...”

“Otherforms of government such as constitutional monarchy...”

Set expansion:

• Start with a set of entities of the same type (e.g., countries),

• Find alistortable columncontaining several such entities

Addthe other entities (assume that they have the same type)

Problems

False positives: “the classification of such cities as urban”

Boundaries:“some scientists, such as computer scientists”

Disambiguation, andsemantic drift

Taxonomy induction:cleaning up the resulting taxonomy Example:NELLhttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:bird

(23)

Instance extraction

Extracting ataxonomywithis-Arelations

“Pluto is a dog”

“a dog is an animal”

Hearst patterns:

“Manyscientists, including Einstein, believed...”

“France, Germany and other countrieshave been plagued with...”

“Otherforms of government such as constitutional monarchy...”

Set expansion:

• Start with a set of entities of the same type (e.g., countries),

• Find alistortable columncontaining several such entities

Addthe other entities (assume that they have the same type)

Problems

False positives: “the classification of such cities as urban”

Boundaries:“some scientists, such as computer scientists”

Disambiguation, andsemantic drift

Taxonomy induction:cleaning up the resulting taxonomy Example:NELLhttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:bird

(24)

Instance extraction

Extracting ataxonomywithis-Arelations

“Pluto is a dog”

“a dog is an animal”

Hearst patterns:

“Manyscientists, including Einstein, believed...”

“France, Germany and other countrieshave been plagued with...”

“Otherforms of government such as constitutional monarchy...”

Set expansion:

• Start with a set of entities of the same type (e.g., countries),

• Find alistortable columncontaining several such entities

Addthe other entities (assume that they have the same type)

Problems

False positives: “the classification of such cities as urban”

Boundaries:“some scientists, such as computer scientists”

Disambiguation, andsemantic drift

Taxonomy induction:cleaning up the resulting taxonomy Example:NELLhttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:bird

(25)

Instance extraction

Extracting ataxonomywithis-Arelations

“Pluto is a dog”

“a dog is an animal”

Hearst patterns:

“Manyscientists, including Einstein, believed...”

“France, Germany and other countrieshave been plagued with...”

“Otherforms of government such as constitutional monarchy...”

Set expansion:

• Start with a set of entities of the same type (e.g., countries),

• Find alistortable columncontaining several such entities

Addthe other entities (assume that they have the same type)

Problems

False positives: “the classification of such cities as urban”

Taxonomy induction:cleaning up the resulting taxonomy Example:NELLhttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:bird

(26)

Instance extraction

Extracting ataxonomywithis-Arelations

“Pluto is a dog”

“a dog is an animal”

Hearst patterns:

“Manyscientists, including Einstein, believed...”

“France, Germany and other countrieshave been plagued with...”

“Otherforms of government such as constitutional monarchy...”

Set expansion:

• Start with a set of entities of the same type (e.g., countries),

• Find alistortable columncontaining several such entities

Addthe other entities (assume that they have the same type)

Problems

False positives: “the classification of such cities as urban”

Boundaries:“some scientists, such as computer scientists”

Disambiguation, andsemantic drift

Taxonomy induction:cleaning up the resulting taxonomy Example:NELLhttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:bird

9/12

(27)

Fact extraction

Generalization ofHearst patterns, e.g.: “X was born in Y”

DIPRE: Dual Iterative Pattern Relation Expansion:

• Apply thepatternsto generate morefacts

• Use the newfactsto learn morepatterns

Problems:semantic drift; sometimes multiple relations match...

Learning fromstructured Web content, e.g., lists and tables (cf Web Data Commons), Wikipedia infoboxes...

General technique:Wrapper induction(see next slide)

(28)

Fact extraction

Generalization ofHearst patterns, e.g.: “X was born in Y”

DIPRE: Dual Iterative Pattern Relation Expansion:

• Apply thepatternsto generate morefacts

• Use the newfactsto learn morepatterns

Problems:semantic drift; sometimes multiple relations match...

Learning fromstructured Web content, e.g., lists and tables (cf Web Data Commons), Wikipedia infoboxes...

General technique:Wrapper induction(see next slide)

(29)

Fact extraction

Generalization ofHearst patterns, e.g.: “X was born in Y”

DIPRE: Dual Iterative Pattern Relation Expansion:

• Apply thepatternsto generate morefacts

• Use the newfactsto learn morepatterns

Problems:semantic drift; sometimes multiple relations match...

Learning fromstructured Web content, e.g., lists and tables (cf Web Data Commons), Wikipedia infoboxes...

General technique:Wrapper induction(see next slide)

(30)

Fact extraction

Generalization ofHearst patterns, e.g.: “X was born in Y”

DIPRE: Dual Iterative Pattern Relation Expansion:

• Apply thepatternsto generate morefacts

• Use the newfactsto learn morepatterns

Problems:semantic drift; sometimes multiple relations match...

Learning fromstructured Web content, e.g., lists and tables (cf Web Data Commons), Wikipedia infoboxes...

General technique:Wrapper induction(see next slide)

(31)

Wrapper induction

(32)

Wrapper induction

(33)

Wrapper induction

(34)

Wrapper induction

(35)

Slide credits

Course structure inspired by the class by Fabian Suchanekhttps:

//suchanek.name/work/teaching/inf344-2018/index.html

Slide 4:https://www.nltk.org/_images/tree.gif

Références

Documents relatifs

SVG+JS Javascript manipulation of vector graphics WebGL Javascript API for accelerated 3D using a GPU. WebVR Describe scenes for virtual reality headsets with a-scene WebRTC Voice

JSP Integrating Java and a Web server (e.g., Apache Tomcat) node.js Chrome’s JavaScript engine (V8) plus a Web server Python Web frameworks: Django , CherryPy, Flask. Ruby

DOM (Document Object Model) gives an API for the tree representation of HTML and XML documents (independent from the programming language). Schema languages Specify the structure

• YAML: extends JSON with several features, e.g., explicit types, user-defined types, anchors and references. • Some JSON parsers are permissive and allow,

• While inside an element we can check that the word of its children satisfies the deterministic regular expression used to define it. • Can be implemented with a

descendant::C[@att1='1'] is a step which denotes all the Element nodes named C, descendant of the context node, having an Attribute node att1 with value

• Information extraction: creating structured information out of existing Web Data

• Entities and properties have a label and short description in each language, along with aliases (search engine). • Entities can also have sitelinks to Wikimedia projects (e.g.,