• Aucun résultat trouvé

Enumerating Pattern Matches in Texts and Trees

N/A
N/A
Protected

Academic year: 2022

Partager "Enumerating Pattern Matches in Texts and Trees"

Copied!
160
0
0

Texte intégral

(1)Enumerating Pattern Matches in Texts and Trees. Antoine Amarilli1 , Pierre Bourhis2 , Stefan Mengel3 , Matthias Niewerth4 October 3rd, 2019 1 Télécom. Paris. 2 CNRS. CRIStAL. 3 CNRS. CRIL. 4 Universität. Bayreuth 1/20.

(2) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 2/20.

(3) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T: → Example: find email addresses. 2/20.

(4) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 2/20.

(5) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. → How to find the pattern P efficiently in the text T?. 2/20.

(6) Solution: Automata. • Convert the regular expression P to an automaton A. 3/20.

(7) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 3/20.

(8) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. 3/20.

(9) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. 3/20.

(10) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(11) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(12) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(13) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(14) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(15) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(16) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(17) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(18) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(19) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(20) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(21) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(22) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(23) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(24) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(25) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(26) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(27) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(28) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(29) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(30) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(31) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(32) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(33) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(34) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(35) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(36) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(37) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(38) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(39) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(40) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(41) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/20.

(42) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P 3/20.

(43) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. •. [a-z0-9.] [a-z0-9.] ␣. 2. @. 3. ␣. 4. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P → This is very efficient in T and reasonably efficient in P. 3/20.

(44) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. 4/20.

(45) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P. 4/20.

(46) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation. 4/20.

(47) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/20.

(48) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/20.

(49) Formal Problem Statement. • Problem description:. 5/20.

(50) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 5/20.

(51) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 5/20.

(52) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. 5/20.

(53) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. • Goal: be very efficient in T and reasonably efficient in P 5/20.

(54) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. 6/20.

(55) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [i l. o. l. 6/20.

(56) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l i o. l. 6/20.

(57) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o i l. 6/20.

(58) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o. l i. 6/20.

(59) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [i o. l. 6/20.

(60) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o i l. 6/20.

(61) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o. l i. 6/20.

(62) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [i l. 6/20.

(63) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [ l i. 6/20.

(64) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l [i. 6/20.

(65) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|). 6/20.

(66) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). 6/20.

(67) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. 6/20.

(68) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. 6/20.

(69) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗. 6/20.

(70) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). 6/20.

(71) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). → We need a different way to measure complexity 6/20.

(72) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/20.

(73) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/20.

(74) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/20.

(75) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/20.

(76) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/20.

(77) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. → Formalization: enumeration algorithms 7/20.

(78) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/20.

(79) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/20.

(80) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T. Phase 1: Preprocessing Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/20.

(81) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 1: Preprocessing Index structure. Phase 2: Enumeration. 8/20.

(82) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i, Results 8/20.

(83) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i,[1337, 1351i Results 8/20.

(84) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. Two ways to measure performance:. • Total time for phase 1 • Delay between two results in phase 2. ... as a function of the text and pattern. . [42, 57i,[1337, 1351i Results 8/20.

(85) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 9/20.

(86) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 9/20.

(87) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm?. 9/20.

(88) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring. 9/20.

(89) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. 9/20.

(90) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom Paris, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. → Can we do better? 9/20.

(91) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds:. 10/20.

(92) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T • Delay constant (independent from T). 10/20.

(93) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. 10/20.

(94) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is:. 10/20.

(95) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is: Theorem We can enumerate all matches of a pattern P on a text T with: • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 10/20.

(96) Automaton Formalism. • We use automata that read letters and capture variables. 11/20.

(97) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P. •∗ α a∗ β •∗. 11/20.

(98) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. 11/20.

(99) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text. 11/20.

(100) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. 11/20.

(101) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. 11/20.

(102) Automaton Formalism. • We use automata:= that read letters and capture variables → Example: P • 1. •∗ α a∗ β •∗ a α. 2. • β. 3. • Semantics of the automaton A:. • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. • Challenge: Because of nondeterminism we can have many different runs of A producing the same tuple!. 11/20.

(103) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A. 12/20.

(104) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. 12/20.

(105) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. •. 1 α. a. 2 β 3. •. 12/20.

(106) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a. a. a. b. a. •. 1 α. a. 2 β 3. •. 12/20.

(107) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. 12/20.

(108) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match. 12/20.

(109) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , match hα : 0, β : 3i a •. 1. 1. a. 2. β 3. 1. 3. 1. 2 β. •. a. α. α. α 2. a. α. β 3. 1. 2. 1 α. 2. 3. 1 α. 2. α 2 β. β. β. β 3. a. b. 3. 3. → Each path in the product DAG corresponds to a match. 12/20.

(110) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match → Challenge: Enumerate paths but avoid duplicate matches and do not waste time to ensure constant delay. 12/20.

(111) Extension: From Text to Trees.

(112) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. 13/20.

(113) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. 13/20.

(114) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. • Results:. an h2 header and. an image in the same section?. 13/20.

(115) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results:. 13/20.

(116) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: hα : 4, β : 6i, hα : 4, β : 7i. 13/20.

(117) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton.... 14/20.

(118) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown:. 14/20.

(119) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T • Delay constant in T. 14/20.

(120) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T! We show:. 14/20.

(121) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T! We show: Theorem [Amarilli et al., 2019] • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 14/20.

(122) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. Phase 1: Preprocessing Tree. Data structure. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 15/20.

(123) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. • Preprocessing: Compute a circuit representation of the answers • Enumeration: Apply a generic algorithm on the circuit Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 15/20.

(124) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). 16/20.

(125) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6”. 16/20.

(126) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons. 16/20.

(127) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i. 16/20.

(128) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×. α:4. ∪. β :6. β :7 16/20.

(129) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. ∪. β :6. β :7 16/20.

(130) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. • Union gate. ∪. ∪. :. → union of sets of tuples. β :6. β :7 16/20.

(131) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. :. ∪. → union of sets of tuples. β :6. β :7. • Product gate. ×. :. → relational product 16/20.

(132) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. → union of sets of tuples.  hβ : 6i β :6. :. ∪. β :7. • Product gate. ×. :. → relational product 16/20.

(133) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪.  hβ : 6i β :6. . hβ : 7i. β :7. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/20.

(134) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×  α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/20.

(135) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/20.

(136) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:.  hα : 4, β : 6i hα : 4, β : 7i × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/20.

(137) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem For any tree automaton A with capture variables α1 , . . . , αk , given a tree T, we can build in O(|T| × |A|) a set circuit capturing exactly the set of tuples {hα1 : n1 , . . . , αk : nk i in the output of A on T 17/20.

(138) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem Given a set circuit satisfying some conditions, we can enumerate all tuples that it captures with linear preprocessing and constant delay  E.g., for hα : 4, β : 6i, hα : 4, β : 7i : enumerate hα : 4, β : 6i then hα : 4, β : 7i 17/20.

(139) Extension: Supporting Updates.

(140) Updates. ×. Phase 1: Preprocessing. α:4 β :6. ∪ β :7. Data structure Tree T. • The input data can be modified after the preprocessing 18/20.

(141) Updates. ×. Phase 1: Preprocessing. α:4 β :6. ∪ β :7. Data structure Tree T. • The input data can be modified after the preprocessing 18/20.

(142) Updates. ∪. Phase 1: Preprocessing. α:4 β :6. × β :7. Data structure Tree T. • The input data can be modified after the preprocessing 18/20.

(143) Updates. ∪. Phase 1: Preprocessing. α:4. ×. β :6. β :7. Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch 18/20.

(144) Updates. ∪. Phase 1: Preprocessing. α:4. ×. β :6. β :7. Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch → Can we do better? 18/20.

(145) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(T). O(1). [Kazana and Segoufin, 2013]. 19/20.

(146) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T). O(log2 T) O(log2 T). [Kazana and Segoufin, 2013]. 19/20.

(147) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text. 19/20.

(148) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T) O(1) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text [Niewerth and Segoufin, 2018]. text. 19/20.

(149) Results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T) O(T) O(T). O(log2 T) O(log T) O(1) O(1). O(log2 T) O(log T) O(log T) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text [Niewerth and Segoufin, 2018] [Amarilli et al., 2019]. text trees. 19/20.

(150) Summary and Future Work.

(151) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. 20/20.

(152) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. 20/20.

(153) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees. 20/20.

(154) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data. 20/20.

(155) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes. 20/20.

(156) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes • Enumerating results in a relevant order? 20/20.

(157) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes results in a relevant order? • Enumerating Testing how well our methods perform in practice • → Implementation: https://github.com/PoDMR/enum-spanner-rs 20/20.

(158) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes results in a relevant order? • Enumerating Testing how well our methods perform in practice • → Implementation: https://github.com/PoDMR/enum-spanner-rs Thanks for your attention!. 20/20.

(159) References i. Amarilli, A., Bourhis, P., Mengel, S., and Niewerth, M. (2019). Enumeration on Trees with Tractable Combined Complexity and Efficient Updates. In PODS. Bagan, G. (2006). MSO queries on tree decomposable structures are computable with linear delay. In CSL. Florenzano, F., Riveros, C., Ugarte, M., Vansummeren, S., and Vrgoc, D. (2018). Constant delay algorithms for regular document spanners. In PODS..

(160) References ii. Kazana, W. and Segoufin, L. (2013). Enumeration of monadic second-order queries on trees. TOCL, 14(4). Losemann, K. and Martens, W. (2014). MSO queries on trees: Enumerating answers under updates. In CSL-LICS. Niewerth, M. and Segoufin, L. (2018). Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS. To appear..

(161)

Références

Documents relatifs

The proof of Theorem 1 reveals indeed a possibility to find a maximal sharp bipolar- valued characterisation of an outranking (resp. outranked) kernel under the condition that

As a polynomial delay algorithm for an enumeration algorithms yields a polynomial time algorithm for the corresponding decision problem, it follows that ECSP(A , −) can only have

The construction of attribute sets is inspired by Next Closure, the most common batch algorithm for computing pseudo-intents, in that it uses the lectic order to ensure the

The disjunction φ of general solved formulas extracted from the final working formula is either the formula false or true or a formula having at least one free variable,

We have extended this theory by giving a complete first-order axiomatization T of the evaluated trees [6] which are combination of finite or infinite trees with construction

Main results Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments with preprocessing linear in |C| + |T| and delay linear in each

Theorem [Amarilli et al., 2019a, Amarilli et al., 2019b] We can enumerate all matches of a nondeterministic tree automaton on a tree with. • Preprocessing linear in the tree

The algorithm has several special optimizations that take into account algorithmic features of this problem: it employs a specific order of filling Latin square elements based on