• Aucun résultat trouvé

Enumerating Pattern Matches in Texts and Trees

N/A
N/A
Protected

Academic year: 2022

Partager "Enumerating Pattern Matches in Texts and Trees"

Copied!
121
0
0

Texte intégral

(1)

Enumerating Pattern Matches in Texts and Trees

Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3, Matthias Niewerth4 October 11th, 2018

1Télécom ParisTech

2CNRS CRIStAL

3CNRS CRIL

4Universität Bayreuth

1/14

(2)

Problem: Finding Patterns in Text

We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.

French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.

More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

We want to find apatternPin the textT:

Example: findemail addresses

• Write the pattern as aregular expression:

P:= ␣ [a-z0-9.] @ [a-z0-9.]

→ How to find the patternPefficiently in the textT?

2/14

(3)

Problem: Finding Patterns in Text

We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.

French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.netAffiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.

More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

We want to find apatternPin the textT:

Example: findemail addresses

• Write the pattern as aregular expression:

P:= ␣ [a-z0-9.] @ [a-z0-9.]

→ How to find the patternPefficiently in the textT?

2/14

(4)

Problem: Finding Patterns in Text

We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.

French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.

More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

We want to find apatternPin the textT:

Example: findemail addresses

• Write the pattern as aregular expression:

P:= ␣ [a-z0-9.] @ [a-z0-9.]

→ How to find the patternPefficiently in the textT?

2/14

(5)

Problem: Finding Patterns in Text

We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.

French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.

More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

We want to find apatternPin the textT:

Example: findemail addresses

• Write the pattern as aregular expression:

P:= ␣ [a-z0-9.] @ [a-z0-9.]

→ How to find the patternPefficiently in the textT?

2/14

(6)

Solution: Automata

Convert theregular expressionPto anautomatonA

P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(7)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(8)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(9)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(10)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 11

start 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(11)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(12)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(13)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(14)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(15)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(16)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(17)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(18)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(19)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(20)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(21)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(22)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(23)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(24)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(25)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(26)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(27)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(28)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(29)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(30)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(31)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(32)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(33)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(34)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(35)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(36)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(37)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(38)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(39)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(40)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(41)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(42)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(43)

Solution: Automata

Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.] @ [a-z0-9.]

start 1 2 3 4

• [a-z0-9.] [a-z0-9.] •

@

Then, evaluate the automaton on thetextT

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP

This isvery efficientinTandreasonably efficientinP

3/14

(44)

Actual Problem: Extracting all Patterns

This only testsifthe patternoccurs inthe text!

“YES”

Goal: find allsubstringsin the textT which match the patternP

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l A f f i l i a t i o n

→One match:[5,20i

4/14

(45)

Actual Problem: Extracting all Patterns

This only testsifthe patternoccurs inthe text!

“YES”

Goal: find allsubstringsin the textTwhich match the patternP

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l A f f i l i a t i o n

→One match:[5,20i

4/14

(46)

Actual Problem: Extracting all Patterns

This only testsifthe patternoccurs inthe text!

“YES”

Goal: find allsubstringsin the textTwhich match the patternP

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

→One match:[5,20i

4/14

(47)

Actual Problem: Extracting all Patterns

This only testsifthe patternoccurs inthe text!

“YES”

Goal: find allsubstringsin the textTwhich match the patternP

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

→One match:[5,20i

4/14

(48)

Actual Problem: Extracting all Patterns

This only testsifthe patternoccurs inthe text!

“YES”

Goal: find allsubstringsin the textTwhich match the patternP

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n

→One match:[5,20i

4/14

(49)

Formal Problem Statement

Problem description:

Input:

• AtextT

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• ApatternPgiven as a regular expression

P:= ␣ [a-z0-9.] @ [a-z0-9.]

Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .

Goal:bevery efficientinT andreasonably efficientinP

5/14

(50)

Formal Problem Statement

Problem description:

Input:

• AtextT

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• ApatternPgiven as a regular expression

P:= ␣ [a-z0-9.] @ [a-z0-9.]

Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .

Goal:bevery efficientinT andreasonably efficientinP

5/14

(51)

Formal Problem Statement

Problem description:

Input:

• AtextT

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• ApatternPgiven as a regular expression

P:= ␣ [a-z0-9.] @ [a-z0-9.]

Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .

Goal:bevery efficientinT andreasonably efficientinP

5/14

(52)

Formal Problem Statement

Problem description:

Input:

• AtextT

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• ApatternPgiven as a regular expression

P:= ␣ [a-z0-9.] @ [a-z0-9.]

Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .

Goal:bevery efficientinT andreasonably efficientinP

5/14

(53)

Formal Problem Statement

Problem description:

Input:

• AtextT

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• ApatternPgiven as a regular expression

P:= ␣ [a-z0-9.] @ [a-z0-9.]

Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .

Goal:bevery efficientinTandreasonably efficientinP

5/14

(54)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(55)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT [i l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(56)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT [

i

l

[

i o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(57)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT [

i

l

[i

o

[

i l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(58)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT [

i

l

[i

o

[i

l

[

i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(59)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l [i o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(60)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l [

i

o

[

i l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(61)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l [

i

o

[i

l

[

i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(62)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o [i l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(63)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o [

i

l

[

i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(64)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l [i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(65)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(66)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(67)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(68)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(69)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(70)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(71)

Measuring the Complexity

Naive algorithm:Run the automatonAoneach substringofT

[i

l

[i

o

[i

l

[i

ComplexityisO(|T|2× |A| × |T|)

Can beoptimizedtoO(|T|2× |A|)

Problem:We may need to outputΩ(|T|2)matching substrings:

• Consider thetextT:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider thepatternP:= a

• Thenumber of matchesisΩ(|T|2)

→ We need adifferent wayto measure complexity

6/14

(72)

Enumeration Algorithms

Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly

→ Formalization:enumeration algorithms

7/14

(73)

Enumeration Algorithms

Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly

→ Formalization:enumeration algorithms

7/14

(74)

Enumeration Algorithms

Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly

→ Formalization:enumeration algorithms

7/14

(75)

Enumeration Algorithms

Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly

→ Formalization:enumeration algorithms

7/14

(76)

Enumeration Algorithms

Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly

→ Formalization:enumeration algorithms

7/14

(77)

Enumeration Algorithms

Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly

→ Formalization:enumeration algorithms

7/14

(78)

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name Antoine Amarilli.Handle:a3nm.Identity Born 1990-02-07.French national.Appearance as of 2017.Auth OpenPGP. OpenId. Bitcoin.

Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor ...

TextT

␣ [a-z0-9.]@ [a-z0-9.]

PatternP

Phase 1: Preprocessing

Index structure

Phase 2: Enumeration

[42,57i,[1337,1351i

Results

8/14

Références

Documents relatifs

• The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch 18/22... Phase 1: Preprocessing Data structure

Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science office C201-4 in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex

Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex

Combine this with the inverted index you built yesterday for the Simple English dataset, so that queries over this dataset use a combination of tf-idf and PageRank. Combine this

There is no fixed list of assignment for this lab session, but focus on connecting the systems produced in the first four labs: How to use PageRank to improve the results of

There is no fixed list of assignment for this lab session, but focus on connecting the systems produced in the first three labs: How to use PageRank to improve the results of

We use the following scheme for class hierarchies: An abstract subclass of Robot redefines the start method in order to indicate how the function processURL should be called on

Figure 1.5 Some consumer applications of computer vision: (a) image stitching: merging different views (Szeliski and Shum 1997) c 1997 ACM; (b) exposure bracketing: merging