Enumerating Pattern Matches in Texts and Trees
Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3, Matthias Niewerth4 October 11th, 2018
1Télécom ParisTech
2CNRS CRIStAL
3CNRS CRIL
4Universität Bayreuth
1/14
Problem: Finding Patterns in Text
•
We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.
More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
•
We want to find apatternPin the textT:→ Example: findemail addresses
• Write the pattern as aregular expression:
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
→ How to find the patternPefficiently in the textT?
2/14
Problem: Finding Patterns in Text
•
We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.netAffiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.
More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
•
We want to find apatternPin the textT:→ Example: findemail addresses
• Write the pattern as aregular expression:
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
→ How to find the patternPefficiently in the textT?
2/14
Problem: Finding Patterns in Text
•
We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.
More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
•
We want to find apatternPin the textT:→ Example: findemail addresses
• Write the pattern as aregular expression:
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
→ How to find the patternPefficiently in the textT?
2/14
Problem: Finding Patterns in Text
•
We have aAntoine Amarilli Description Name Antoine Amarilli.long textT: Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.
More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
•
We want to find apatternPin the textT:→ Example: findemail addresses
• Write the pattern as aregular expression:
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
→ How to find the patternPefficiently in the textT?
2/14
Solution: Automata
•
Convert theregular expressionPto anautomatonAP:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 11
start 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Solution: Automata
•
Convert theregular expressionPto anautomatonA P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣start 1 2 3 4
• [a-z0-9.] [a-z0-9.] •
@
•
Then, evaluate the automaton on thetextTE m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
•
ThecomplexityisO(|A| × |T|), i.e.,linearinTandpolynomialinP→ This isvery efficientinTandreasonably efficientinP
3/14
Actual Problem: Extracting all Patterns
•
This only testsifthe patternoccurs inthe text!→ “YES”
•
Goal: find allsubstringsin the textT which match the patternP0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
E m a i l A f f i l i a t i o n
→One match:[5,20i
4/14
Actual Problem: Extracting all Patterns
•
This only testsifthe patternoccurs inthe text!→ “YES”
•
Goal: find allsubstringsin the textTwhich match the patternP0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
E m a i l A f f i l i a t i o n
→One match:[5,20i
4/14
Actual Problem: Extracting all Patterns
•
This only testsifthe patternoccurs inthe text!→ “YES”
•
Goal: find allsubstringsin the textTwhich match the patternP0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
→One match:[5,20i
4/14
Actual Problem: Extracting all Patterns
•
This only testsifthe patternoccurs inthe text!→ “YES”
•
Goal: find allsubstringsin the textTwhich match the patternP0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
→One match:[5,20i
4/14
Actual Problem: Extracting all Patterns
•
This only testsifthe patternoccurs inthe text!→ “YES”
•
Goal: find allsubstringsin the textTwhich match the patternP0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
E m a i l ␣ a 3 n m @ a 3 n m . n e t ␣ A f f i l i a t i o n
→One match:[5,20i
4/14
Formal Problem Statement
•
Problem description:• Input:
• AtextT
Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
• ApatternPgiven as a regular expression
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
• Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .
•
Goal:bevery efficientinT andreasonably efficientinP5/14
Formal Problem Statement
•
Problem description:• Input:
• AtextT
Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
• ApatternPgiven as a regular expression
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
• Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .
•
Goal:bevery efficientinT andreasonably efficientinP5/14
Formal Problem Statement
•
Problem description:• Input:
• AtextT
Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
• ApatternPgiven as a regular expression
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
• Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .
•
Goal:bevery efficientinT andreasonably efficientinP5/14
Formal Problem Statement
•
Problem description:• Input:
• AtextT
Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
• ApatternPgiven as a regular expression
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
• Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .
•
Goal:bevery efficientinT andreasonably efficientinP5/14
Formal Problem Statement
•
Problem description:• Input:
• AtextT
Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...
• ApatternPgiven as a regular expression
P:= ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣
• Output:the list ofsubstringsofTthat matchP: [186,200i, [483,500i, . . .
•
Goal:bevery efficientinTandreasonably efficientinP5/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT [i l[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT [i
l
[
i o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT [i
l
[i
o
[
i l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT [i
l
[i
o
[i
l
[
i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l [i o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l [
i
o
[
i l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l [
i
o
[i
l
[
i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o [i l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o [
i
l
[
i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l [i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Measuring the Complexity
•
Naive algorithm:Run the automatonAoneach substringofT[i
l
[i
o
[i
l
[i
→ ComplexityisO(|T|2× |A| × |T|)
→ Can beoptimizedtoO(|T|2× |A|)
•
Problem:We may need to outputΩ(|T|2)matching substrings:• Consider thetextT:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Consider thepatternP:= a∗
• Thenumber of matchesisΩ(|T|2)
→ We need adifferent wayto measure complexity
6/14
Enumeration Algorithms
Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly
→ Formalization:enumeration algorithms
7/14
Enumeration Algorithms
Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly
→ Formalization:enumeration algorithms
7/14
Enumeration Algorithms
Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly
→ Formalization:enumeration algorithms
7/14
Enumeration Algorithms
Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly
→ Formalization:enumeration algorithms
7/14
Enumeration Algorithms
Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly
→ Formalization:enumeration algorithms
7/14
Enumeration Algorithms
Idea:In real life, we do not want to computeall the matches we just need to be able toenumeratematches quickly
→ Formalization:enumeration algorithms
7/14
Formalizing Enumeration Algorithms
Antoine Amarilli Description Name Antoine Amarilli.Handle:a3nm.Identity Born 1990-02-07.French national.Appearance as of 2017.Auth OpenPGP. OpenId. Bitcoin.
Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor ...
TextT
␣ [a-z0-9.]∗@ [a-z0-9.]∗ ␣
PatternP
Phase 1: Preprocessing
Index structure
Phase 2: Enumeration
[42,57i,[1337,1351i
Results
8/14