• Aucun résultat trouvé

Enumerating Pattern Matches in Texts and Trees

N/A
N/A
Protected

Academic year: 2022

Partager "Enumerating Pattern Matches in Texts and Trees"

Copied!
230
0
0

Texte intégral

(1)Enumerating Pattern Matches in Texts and Trees. Antoine Amarilli1 , Pierre Bourhis2 , Stefan Mengel3 , Matthias Niewerth4 November 12th, 2018 1 Télécom. ParisTech. 2 CNRS. CRIStAL. 3 CNRS. CRIL. 4 Universität. Bayreuth 1/25.

(2) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 2/25.

(3) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T: → Example: find email addresses. 2/25.

(4) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 2/25.

(5) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. → How to find the pattern P efficiently in the text T?. 2/25.

(6) Solution: Automata. • Convert the regular expression P to an automaton A. 3/25.

(7) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 3/25.

(8) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. 3/25.

(9) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. 3/25.

(10) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(11) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(12) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(13) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(14) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(15) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(16) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(17) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(18) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(19) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(20) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(21) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(22) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(23) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(24) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(25) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(26) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(27) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(28) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(29) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(30) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(31) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(32) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(33) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(34) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(35) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(36) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(37) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(38) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(39) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(40) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(41) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/25.

(42) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P 3/25.

(43) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P → This is very efficient in T and reasonably efficient in P. 3/25.

(44) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. 4/25.

(45) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P. 4/25.

(46) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation. 4/25.

(47) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/25.

(48) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/25.

(49) Formal Problem Statement. • Problem description:. 5/25.

(50) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 5/25.

(51) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 5/25.

(52) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. 5/25.

(53) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. • Goal: be very efficient in T and reasonably efficient in P 5/25.

(54) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. 6/25.

(55) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [i l. o. l. 6/25.

(56) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l i o. l. 6/25.

(57) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o i l. 6/25.

(58) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o. l i. 6/25.

(59) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [i o. l. 6/25.

(60) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o i l. 6/25.

(61) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o. l i. 6/25.

(62) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [i l. 6/25.

(63) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [ l i. 6/25.

(64) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l [i. 6/25.

(65) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|). 6/25.

(66) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). 6/25.

(67) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. 6/25.

(68) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. 6/25.

(69) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗. 6/25.

(70) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). 6/25.

(71) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). → We need a different way to measure complexity 6/25.

(72) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/25.

(73) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/25.

(74) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/25.

(75) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/25.

(76) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/25.

(77) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. → Formalization: enumeration algorithms 7/25.

(78) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/25.

(79) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/25.

(80) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T. Phase 1: Preprocessing Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/25.

(81) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 1: Preprocessing Index structure. Phase 2: Enumeration. 8/25.

(82) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i, Results 8/25.

(83) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i,[1337, 1351i Results 8/25.

(84) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. Two ways to measure performance:. • Total time for phase 1 • Delay between two results in phase 2. ... as a function of the text and pattern. . [42, 57i,[1337, 1351i Results 8/25.

(85) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 9/25.

(86) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 9/25.

(87) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm?. 9/25.

(88) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring. 9/25.

(89) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. 9/25.

(90) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. → Can we do better? 9/25.

(91) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds:. 10/25.

(92) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T • Delay constant (independent from T). 10/25.

(93) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. 10/25.

(94) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is:. 10/25.

(95) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is: Theorem We can enumerate all matches of a pattern P on a text T with: • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 10/25.

(96) Automaton Formalism. • We use automata that read letters and capture variables. 11/25.

(97) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗. 11/25.

(98) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. 11/25.

(99) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text. 11/25.

(100) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. 11/25.

(101) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. 11/25.

(102) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. • Challenge: Because of nondeterminism we can have many different runs of A producing the same tuple!. 11/25.

(103) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A. 12/25.

(104) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. 12/25.

(105) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. •. 1 α. a. 2 β 3. •. 12/25.

(106) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a. a. a. b. a. •. 1 α. a. 2 β 3. •. 12/25.

(107) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. 12/25.

(108) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match. 12/25.

(109) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , match hα : 0, β : 3i a •. 1. 1. a. 2. β 3. 1. 3. 1. 2 β. •. a. α. α. α 2. a. α. β 3. 1. 2. 1 α. 2. 3. 1 α. 2. α 2 β. β. β. β 3. a. b. 3. 3. → Each path in the product DAG corresponds to a match. 12/25.

(110) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match → Challenge: Enumerate paths but avoid duplicate matches and do not waste time to ensure constant delay. 12/25.

(111) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue. α 2. 2 β. 3. 3. 13/25.

(112) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue. α 2. 2 β. 3. 3. 13/25.

(113) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. α 2. 2 β. 3. 3. 13/25.

(114) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i. β 3. 3. 13/25.

(115) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 13/25.

(116) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 13/25.

(117) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 13/25.

(118) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. → We must have preprocessed the DAG to make sure that we can always finish the run. 13/25.

(119) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress. 14/25.

(120) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/25.

(121) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/25.

(122) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/25.

(123) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/25.

(124) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 14/25.

(125) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. 14/25.

(126) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. • Challenge: Preprocessing in linear time in T and polynomial in A: 14/25.

(127) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. • Challenge: Preprocessing in linear time in T and polynomial in A: → Compute for each state the next position where we can reach some state that can assign a variable. 14/25.

(128) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. • Challenge: Preprocessing in linear time in T and polynomial in A:. → Compute for each state the next position where we can reach some state that can assign a variable → Compute at each position i the transitive closure to all positions j such that j is the next position of some state at i (there are ≤ |A|)14/25.

(129) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. 15/25.

(130) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. • Idea: Explore a decision tree on the variables (built on the fly). 15/25.

(131) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α β. 0. α?. 1 β?. β? 0 γ?. γ 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly). 15/25.

(132) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α β. 0. α?. 1 β?. β? 0 γ?. γ 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly) tree node, find the reachable states which • Athaveeachall decision required variables (1) and no forbidden variables (0) 15/25.

(133) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α β. 0. α?. 1 β?. β? 0 γ?. γ 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly) tree node, find the reachable states which • Athaveeachall decision required variables (1) and no forbidden variables (0) → Assumption: we don’t see the same variable twice on a path 15/25.

(134) Extension: From Text to Trees.

(135) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. 16/25.

(136) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. 16/25.

(137) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. • Results: 16/25.

(138) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: 16/25.

(139) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: hα : 4, β : 6i, hα : 4, β : 7i 16/25.

(140) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton.... 17/25.

(141) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown:. 17/25.

(142) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T • Delay constant in T. 17/25.

(143) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T!. 17/25.

(144) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T!. → We are working on proving the following: Conjecture. • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 17/25.

(145) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. Phase 1: Preprocessing Tree. Data structure. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 18/25.

(146) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. • Preprocessing: Compute a circuit representation of the answers • Enumeration: Apply a generic algorithm on the circuit Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 18/25.

(147) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). 19/25.

(148) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6”. 19/25.

(149) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons. 19/25.

(150) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i. 19/25.

(151) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×. α:4. ∪. β :6. β :7 19/25.

(152) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. ∪. β :6. β :7 19/25.

(153) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. • Union gate. ∪. ∪. :. → union of sets of tuples. β :6. β :7 19/25.

(154) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. :. ∪. → union of sets of tuples. β :6. β :7. • Product gate. ×. :. → relational product 19/25.

(155) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. → union of sets of tuples.  hβ : 6i β :6. :. ∪. β :7. • Product gate. ×. :. → relational product 19/25.

(156) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪.  hβ : 6i β :6. . hβ : 7i. β :7. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/25.

(157) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×  α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/25.

(158) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/25.

(159) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:.  hα : 4, β : 6i hα : 4, β : 7i × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 19/25.

(160) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem For any tree automaton A with capture variables α1 , . . . , αk , given a tree T, we can build in O(|T| × |A|) a set circuit capturing exactly the set of tuples {hα1 : n1 , . . . , αk : nk i in the output of A on T 20/25.

(161) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem Given a set circuit satisfying some conditions, we can enumerate all tuples that it captures with linear preprocessing and constant delay  E.g., for hα : 4, β : 6i, hα : 4, β : 7i : enumerate hα : 4, β : 6i then hα : 4, β : 7i 20/25.

(162) Extension: Supporting Updates.

(163) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 21/25.

(164) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 21/25.

(165) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 21/25.

(166) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch 21/25.

(167) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch → Can we do better? 21/25.

(168) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(T). O(1). [Kazana and Segoufin, 2013]. 22/25.

(169) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T). O(log2 T) O(log2 T). [Kazana and Segoufin, 2013]. 22/25.

(170) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text. 22/25.

(171) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T) O(1) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text [Niewerth and Segoufin, 2018]. text. 22/25.

(172) Relabelings. • Special kind of updates: relabelings that change the label of a node. 23/25.

(173) Relabelings. • Special kind of updates: relabelings that change the label of a node. • Example: relabel node 7 to <video>. 23/25.

(174) Relabelings. • Special kind of updates: relabelings that change the label of a node. • Example: relabel node 7 to <video>. 23/25.

(175) Relabelings. • Special kind of updates: relabelings that change the label of a node. • Example: relabel node 7 to <video> • The tree’s structure never changes. 23/25.

(176) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T). O(log2 T) O(log2 T). [Kazana and Segoufin, 2013]. 24/25.

(177) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(1) O(log T). [Kazana and Segoufin, 2013] [Amarilli et al., 2018]. trees. 24/25.

(178) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(1) O(log T). [Kazana and Segoufin, 2013] [Amarilli et al., 2018]. trees. • Current proof uses hybrid circuits but we want to simplify it. 24/25.

(179) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(1) O(log T). [Kazana and Segoufin, 2013] [Amarilli et al., 2018]. trees. • Current proof uses hybrid circuits but we want to simplify it • Remaining open questions:. → Does this hold for more general updates (insert/delete, etc.)? → Can we also achieve tractable combined complexity? 24/25.

(180) Summary and Future Work.

(181) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. 25/25.

(182) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. 25/25.

(183) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees. 25/25.

(184) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data. 25/25.

(185) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes. 25/25.

(186) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes • Enumerating results in a relevant order? 25/25.

(187) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes • Enumerating results in a relevant order? • Testing how well our methods perform in practice 25/25.

(188) Summary and Future Work Summary:. • Problem: given a text T and a pattern P, •. enumerate efficiently all matches of P on T Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes • Enumerating results in a relevant order? • Testing how well our methods perform in practice Thanks for your attention! 25/25.

(189) References i. Amarilli, A., Bourhis, P., and Mengel, S. (2018). Enumeration on trees under relabelings. In ICDT. Bagan, G. (2006). MSO queries on tree decomposable structures are computable with linear delay. In CSL. Florenzano, F., Riveros, C., Ugarte, M., Vansummeren, S., and Vrgoc, D. (2018). Constant delay algorithms for regular document spanners. In PODS..

(190) References ii. Kazana, W. and Segoufin, L. (2013). Enumeration of monadic second-order queries on trees. TOCL, 14(4). Losemann, K. and Martens, W. (2014). MSO queries on trees: Enumerating answers under updates. In CSL-LICS. Niewerth, M. and Segoufin, L. (2018). Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS. To appear..

(191) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ}.

(192) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n.

(193) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. α:n.

(194) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪. α:n ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪.

(195) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ×. α:n. ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪.

(196) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ×. α:n. ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪.

(197) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪ ×. ×. α:n. ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪.

(198) Proof idea for trees: general enumeration approach → Enumerate the set T(g) captured by each gate g → Do it by top-down induction on the circuit.

(199) Proof idea for trees: general enumeration approach → Enumerate the set T(g) captured by each gate g → Do it by top-down induction on the circuit Base case: variable. α:n. :.

Références

Documents relatifs

The construction of attribute sets is inspired by Next Closure, the most common batch algorithm for computing pseudo-intents, in that it uses the lectic order to ensure the

As a polynomial delay algorithm for an enumeration algorithms yields a polynomial time algorithm for the corresponding decision problem, it follows that ECSP(A , −) can only have

The proof of Theorem 1 reveals indeed a possibility to find a maximal sharp bipolar- valued characterisation of an outranking (resp. outranked) kernel under the condition that

The algorithm has several special optimizations that take into account algorithmic features of this problem: it employs a specific order of filling Latin square elements based on

The disjunction φ of general solved formulas extracted from the final working formula is either the formula false or true or a formula having at least one free variable,

We have extended this theory by giving a complete first-order axiomatization T of the evaluated trees [6] which are combination of finite or infinite trees with construction

Main results Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments with preprocessing linear in |C| + |T| and delay linear in each

• Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes • Enumerating results in a relevant