• Aucun résultat trouvé

Enumerating Pattern Matches in Texts and Trees

N/A
N/A
Protected

Academic year: 2022

Partager "Enumerating Pattern Matches in Texts and Trees"

Copied!
225
0
0

Texte intégral

(1)Enumerating Pattern Matches in Texts and Trees. Antoine Amarilli1 , Pierre Bourhis2 , Stefan Mengel3 , Matthias Niewerth4 October 24th, 2019 1 Télécom. Paris. 2 CNRS. CRIStAL. 3 CNRS. CRIL. 4 Universität. Bayreuth 1/28.

(2) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 2/28.

(3) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T: → Example: find email addresses. 2/28.

(4) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 2/28.

(5) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. → How to find the pattern P efficiently in the text T?. 2/28.

(6) Solution: Automata. • Convert the regular expression P to an automaton A. 3/28.

(7) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 3/28.

(8) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. 3/28.

(9) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. 3/28.

(10) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(11) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(12) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(13) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(14) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(15) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(16) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(17) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(18) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(19) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(20) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(21) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(22) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(23) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(24) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(25) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(26) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(27) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(28) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(29) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(30) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(31) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(32) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(33) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(34) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(35) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(36) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(37) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(38) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(39) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(40) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(41) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/28.

(42) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P 3/28.

(43) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P → This is very efficient in T and reasonably efficient in P. 3/28.

(44) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. 4/28.

(45) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P. 4/28.

(46) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation. 4/28.

(47) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/28.

(48) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/28.

(49) Formal Problem Statement. • Problem description:. 5/28.

(50) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 5/28.

(51) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 5/28.

(52) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. 5/28.

(53) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. • Goal: be very efficient in T and reasonably efficient in P 5/28.

(54) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. 6/28.

(55) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [i l. o. l. 6/28.

(56) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l i o. l. 6/28.

(57) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o i l. 6/28.

(58) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o. l i. 6/28.

(59) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [i o. l. 6/28.

(60) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o i l. 6/28.

(61) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o. l i. 6/28.

(62) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [i l. 6/28.

(63) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [ l i. 6/28.

(64) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l [i. 6/28.

(65) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|). 6/28.

(66) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). 6/28.

(67) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. 6/28.

(68) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. 6/28.

(69) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗. 6/28.

(70) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). 6/28.

(71) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). → We need a different way to measure complexity 6/28.

(72) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/28.

(73) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/28.

(74) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/28.

(75) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/28.

(76) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/28.

(77) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. → Formalization: enumeration algorithms 7/28.

(78) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/28.

(79) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/28.

(80) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T. Phase 1: Preprocessing Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/28.

(81) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 1: Preprocessing Index structure. Phase 2: Enumeration. 8/28.

(82) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i, Results 8/28.

(83) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i,[1337, 1351i Results 8/28.

(84) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. Two ways to measure performance:. • Total time for phase 1 • Delay between two results in phase 2. ... as a function of the text and pattern. . [42, 57i,[1337, 1351i Results 8/28.

(85) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 9/28.

(86) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 9/28.

(87) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm?. 9/28.

(88) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring. 9/28.

(89) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. 9/28.

(90) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. → Can we do better? 9/28.

(91) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds:. 10/28.

(92) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T • Delay constant (independent from T). 10/28.

(93) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. 10/28.

(94) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is:. 10/28.

(95) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is: Theorem We can enumerate all matches of a pattern P on a text T with: • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 10/28.

(96) Automaton Formalism. • We use automata that read letters and capture variables. 11/28.

(97) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P. •∗ α a∗ β •∗. 11/28.

(98) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. 11/28.

(99) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text. 11/28.

(100) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. 11/28.

(101) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. 11/28.

(102) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. • Challenge: Because of nondeterminism we can have many different runs of A producing the same tuple!. 11/28.

(103) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A. 12/28.

(104) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. 12/28.

(105) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. •. 1 α. a. 2 β 3. •. 12/28.

(106) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a. a. a. b. a. •. 1 α. a. 2 β 3. •. 12/28.

(107) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. 12/28.

(108) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match. 12/28.

(109) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , match hα : 0, β : 3i a •. 1. 1. a. 2. β 3. 1. 3. 1. 2 β. •. a. α. α. α 2. a. α. β 3. 1. 2. 1 α. 2. 3. 1 α. 2. α 2 β. β. β. β 3. a. b. 3. 3. → Each path in the product DAG corresponds to a match. 12/28.

(110) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match → Challenge: Enumerate paths but avoid duplicate matches and do not waste time to ensure constant delay. 12/28.

(111) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue. α 2. 2 β. 3. 3. 13/28.

(112) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue. α 2. 2 β. 3. 3. 13/28.

(113) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. α 2. 2 β. 3. 3. 13/28.

(114) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i. β 3. 3. 13/28.

(115) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 13/28.

(116) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 13/28.

(117) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 13/28.

(118) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. → We must have preprocessed the DAG to make sure that we can always finish the run. 13/28.

(119) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress. 14/28.

(120) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/28.

(121) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/28.

(122) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/28.

(123) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/28.

(124) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/28.

(125) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. 14/28.

(126) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. • Challenge: Preprocessing in linear time in T and polynomial in A: 14/28.

(127) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. Preprocessing in linear time in T and polynomial in A: • Challenge: → Compute for each state the next position where we can reach some state that can assign a variable. 14/28.

(128) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. Preprocessing in linear time in T and polynomial in A: • Challenge: → Compute for each state the next position where we can reach some state that can assign a variable → Compute at each position i the transitive closure to all positions j such that j is the next position of some state at i (there are ≤ |A|). 14/28.

(129) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. 15/28.

(130) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. • Idea: Explore a decision tree on the variables (built on the fly). 15/28.

(131) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. 0. α?. 1 β?. β? 0 γ? 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly). 15/28.

(132) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. 0. α?. 1 β?. β? 0 γ? 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly) At each decision tree node, find the reachable states which • have all required variables (1) and no forbidden variables (0) 15/28.

(133) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. 0. α?. 1 β?. β? 0 γ? 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly) At each decision tree node, find the reachable states which • have all required variables (1) and no forbidden variables (0) → Assumption: we don’t see the same variable twice on a path 15/28.

(134) Extension: From Text to Trees.

(135) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. 16/28.

(136) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. 16/28.

(137) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. • Results:. an h2 header and. an image in the same section?. 16/28.

(138) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results:. 16/28.

(139) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: hα : 4, β : 6i, hα : 4, β : 7i. 16/28.

(140) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton.... 17/28.

(141) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown:. 17/28.

(142) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T • Delay constant in T. 17/28.

(143) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T! We show:. 17/28.

(144) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T! We show: Theorem [Amarilli et al., 2019] • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 17/28.

(145) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. Phase 1: Preprocessing Tree. Data structure. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 18/28.

(146) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. • Preprocessing: Compute a circuit representation of the answers • Enumeration: Apply a generic algorithm on the circuit Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 18/28.

(147) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). 19/28.

(148) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6”. 19/28.

(149) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons. 19/28.

(150) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i. 19/28.

(151) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×. α:4. ∪. β :6. β :7 19/28.

(152) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. ∪. β :6. β :7 19/28.

(153) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. • Union gate. ∪. ∪. :. → union of sets of tuples. β :6. β :7 19/28.

(154) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. :. ∪. → union of sets of tuples. β :6. β :7. • Product gate. ×. :. → relational product 19/28.

(155) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. → union of sets of tuples.  hβ : 6i β :6. :. ∪. β :7. • Product gate. ×. :. → relational product 19/28.

(156) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪.  hβ : 6i β :6. . hβ : 7i. β :7. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/28.

(157) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×  α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/28.

(158) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/28.

(159) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:.  hα : 4, β : 6i hα : 4, β : 7i × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/28.

(160) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern Theorem For any tree automaton A with capture variables α1 , . . . , αk , given a tree T, we can build in O(|T| × |A|) a set circuit capturing exactly the set of tuples {hα1 : n1 , . . . , αk : nk i} in the output of A on T 20/28.

(161) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem Given a set circuit satisfying some conditions, we can enumerate all tuples that it captures with linear preprocessing and constant delay  E.g., for hα : 4, β : 6i, hα : 4, β : 7i : enumerate hα : 4, β : 6i then hα : 4, β : 7i 20/28.

(162) Extension: Supporting Updates.

(163) Updates. ×. Phase 1: Preprocessing. α:4 β :6. ∪ β :7. Data structure Tree T. • The input data can be modified after the preprocessing 21/28.

(164) Updates. ×. Phase 1: Preprocessing. α:4 β :6. ∪ β :7. Data structure Tree T. • The input data can be modified after the preprocessing 21/28.

(165) Updates. ∪. Phase 1: Preprocessing. α:4 β :6. × β :7. Data structure Tree T. • The input data can be modified after the preprocessing 21/28.

(166) Updates. ∪. Phase 1: Preprocessing. α:4. ×. β :6. β :7. Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch 21/28.

(167) Updates. ∪. Phase 1: Preprocessing. α:4. ×. β :6. β :7. Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch → Can we do better? 21/28.

(168) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(T). O(1). [Kazana and Segoufin, 2013]. 22/28.

(169) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T). O(log2 T) O(log2 T). [Kazana and Segoufin, 2013]. 22/28.

(170) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text. 22/28.

(171) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T) O(1) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text [Niewerth and Segoufin, 2018]. text. 22/28.

(172) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T) O(T) O(T). O(log2 T) O(log T) O(1) O(1). O(log2 T) O(log T) O(log T) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text [Niewerth and Segoufin, 2018] [Amarilli et al., 2019]. text trees. 22/28.

(173) Extension: Connection to Circuits.

(174) Connections with Boolean circuits. • Mapping DAGs and set circuits can be seen as variants of Boolean circuits. 23/28.

(175) Connections with Boolean circuits. • Mapping DAGs and set circuits can be seen as variants of Boolean circuits. • The answers to enumerate are their satisfying assignments. 23/28.

(176) Connections with Boolean circuits. • Mapping DAGs and set circuits can be seen as variants of Boolean circuits. • The answers to enumerate are their satisfying assignments • These circuits fall in restricted circuit classes that allow for efficient enumeration. 23/28.

(177) Connections with Boolean circuits. • Mapping DAGs and set circuits can be seen as variants of Boolean circuits. • The answers to enumerate are their satisfying assignments • These circuits fall in restricted circuit classes that allow for efficient enumeration. → Task: Given a Boolean circuit, how to efficiently enumerate its satisfying valuations?. 23/28.

(178) Boolean circuits. • Directed acyclic graph of gates • Output gate: • Variable gates: x • Internal gates: ∨ ∧ ¬. ∨. ¬. ∧. x. y. 24/28.

(179) Boolean circuits. • Directed acyclic graph of gates • Output gate: • Variable gates: x • Internal gates: ∨ ∧ ¬ • Valuation: function from variables to {0, 1}. ∨. ¬. ∧. x. y. Example: ν = {x 7→ 0, y 7→ 1}.... 24/28.

(180) Boolean circuits. • Directed acyclic graph of gates • Output gate: • Variable gates: x • Internal gates: ∨ ∧ ¬ • Valuation: function from variables to {0, 1}. ∨. ¬. ∧. x. y. Example: ν = {x 7→ 0, y 7→ 1}.... 24/28.

(181) Boolean circuits. • Directed acyclic graph of gates • Output gate: • Variable gates: x • Internal gates: ∨ ∧ ¬ • Valuation: function from variables to {0, 1}. ∨. ¬. ∧. x. y. Example: ν = {x 7→ 0, y 7→ 1}.... 24/28.

(182) Boolean circuits. • Directed acyclic graph of gates • Output gate: • Variable gates: x • Internal gates: ∨ ∧ ¬ • Valuation: function from variables to {0, 1}. ∨. ¬. ∧. x. y. Example: ν = {x 7→ 0, y 7→ 1}... mapped to 1. 24/28.

(183) Boolean circuits. • Directed acyclic graph of gates • Output gate: • Variable gates: x • Internal gates: ∨ ∧ ¬ • Valuation: function from variables to {0, 1}. ∨. ¬. ∧. x. y. Example: ν = {x 7→ 0, y 7→ 1}... mapped to 1. • Assignment: set of variables mapped to 1 Example: Sν = {y}; more concise than ν. 24/28.

(184) Boolean circuits. • Directed acyclic graph of gates • Output gate: • Variable gates: x • Internal gates: ∨ ∧ ¬ • Valuation: function from variables to {0, 1}. ∨. ¬. ∧. x. y. Example: ν = {x 7→ 0, y 7→ 1}... mapped to 1. • Assignment: set of variables mapped to 1 Example: Sν = {y}; more concise than ν. Our task: Enumerate all satisfying assignments of an input circuit 24/28.

(185) Circuit restrictions. d-DNNF:. •. ∨. are all deterministic:. The inputs are mutually exclusive (= no valuation ν makes two inputs simultaneously evaluate to 1). ∨ ∧ x. ∧ ¬. ∧ y. ¬ z. x. ∧ y. z 25/28.

(186) Circuit restrictions. d-DNNF:. •. ∨. are all deterministic:. The inputs are mutually exclusive (= no valuation ν makes two inputs simultaneously evaluate to 1). •. ∧. are all decomposable:. The inputs are independent (= no variable x has a path to two different inputs). ∨ ∧ x. ∧ ¬. ∧ y. ¬ z. x. ∧ y. z 25/28.

(187) Circuit restrictions. v-tree: ∧-gates follow a tree on the variables. d-DNNF:. •. ∨. are all deterministic:. •. The inputs are mutually exclusive (= no valuation ν makes two inputs simultaneously evaluate to 1). •. ∨ •. x ∧. ∧ y. ∧. are all decomposable:. The inputs are independent (= no variable x has a path to two different inputs). x. ¬. ∧ y. ¬ z. x. z. ∧ y. z 25/28.

(188) Main results Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments with preprocessing linear in |C| + |T| and delay linear in each assignment. 26/28.

(189) Main results Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments with preprocessing linear in |C| + |T| and delay linear in each assignment Also: restrict to assignments of constant size k ∈ N (at most k variables are set to 1): Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments of size ≤ k with preprocessing linear in |C| + |T| and constant delay. 26/28.

(190) Main results Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments with preprocessing linear in |C| + |T| and delay linear in each assignment Also: restrict to assignments of constant size k ∈ N (at most k variables are set to 1): Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments of size ≤ k with preprocessing linear in |C| + |T| and constant delay Subtleties: Must complete to a set circuit; memory usage problems 26/28.

(191) Summary and Future Work.

(192) Summary and Future Work. Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. 27/28.

(193) Summary and Future Work. Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. • Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. 27/28.

(194) Summary and Future Work. Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. • Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions:. • Enumeration on trees rather than words • Handling updates to the underlying data • Enumerating satisfying valuations of a circuit 27/28.

(195) Ongoing research and future work With P. Bourhis, R. Dupré, M. Niewerth, S. Mengel: Efficient implementation of the approach https://github.com/PoDMR/enum-spanner-rs. 28/28.

(196) Ongoing research and future work With P. Bourhis, R. Dupré, M. Niewerth, S. Mengel: Efficient implementation of the approach https://github.com/PoDMR/enum-spanner-rs With L. Jachiet, M. Muñoz, C. Riveros: Can we enumerate for context-free languages?. 28/28.

(197) Ongoing research and future work With P. Bourhis, R. Dupré, M. Niewerth, S. Mengel: Efficient implementation of the approach https://github.com/PoDMR/enum-spanner-rs With L. Jachiet, M. Muñoz, C. Riveros: Can we enumerate for context-free languages? With P. Bourhis, C. Riveros, S. Mengel: Links between automata and circuit classes?. 28/28.

(198) Ongoing research and future work With P. Bourhis, R. Dupré, M. Niewerth, S. Mengel: Efficient implementation of the approach https://github.com/PoDMR/enum-spanner-rs With L. Jachiet, M. Muñoz, C. Riveros: Can we enumerate for context-free languages? With P. Bourhis, C. Riveros, S. Mengel: Links between automata and circuit classes? With P. Bourhis, L. Jachiet, S. Mengel: Can we enumerate with constant memory usage?. 28/28.

Références

Documents relatifs

To this, each unassigned element of V is assigned to an element of U by computing an augmenting path (Algorithm 2) and then by swapping unassigned and assigned edges along the

With a regular expression that “scores” each solution, we can enumerate in score order with linear preprocessing and logarithmic delay..

• Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes • Enumerating results in a relevant

Theorem [Amarilli et al., 2019a, Amarilli et al., 2019b] We can enumerate all matches of a nondeterministic tree automaton on a tree with. • Preprocessing linear in the tree

We introduce estimation and test procedures through divergence minimization for models satisfying linear constraints with unknown parameter.. Several statistical examples

(a) The Fourier transform of voltage fluctuations measured at the driving frequency co (solid curve) and at its erst harmonic (dashed curve) in a wire.. The aperiodic

A proof of Theorem 3 concerning cyclic variation-diminishing

Keywords: large deviations, hashing with linear probing, parking problem, Brownian motion, Airy distribution, Łukasiewicz random walk, empirical processes, conditioned sums of