• Aucun résultat trouvé

Enumerating Pattern Matches in Texts and Trees

N/A
N/A
Protected

Academic year: 2022

Partager "Enumerating Pattern Matches in Texts and Trees"

Copied!
165
0
0

Texte intégral

(1)Enumerating Pattern Matches in Texts and Trees. Antoine Amarilli1 , Pierre Bourhis2 , Stefan Mengel3 , Matthias Niewerth4 October 24th, 2018 1 Télécom. ParisTech. 2 CNRS. CRIStAL. 3 CNRS. CRIL. 4 Universität. Bayreuth 1/22.

(2) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 2/22.

(3) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T: → Example: find email addresses. 2/22.

(4) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 2/22.

(5) Problem: Finding Patterns in Text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. → How to find the pattern P efficiently in the text T?. 2/22.

(6) Solution: Automata. • Convert the regular expression P to an automaton A. 3/22.

(7) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 3/22.

(8) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. 3/22.

(9) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. 3/22.

(10) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(11) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(12) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(13) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(14) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(15) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(16) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(17) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(18) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(19) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(20) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(21) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(22) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(23) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(24) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(25) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(26) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(27) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(28) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(29) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(30) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(31) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(32) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(33) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(34) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(35) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(36) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(37) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(38) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(39) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(40) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(41) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. 3/22.

(42) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P 3/22.

(43) Solution: Automata. • Convert the regular expression P to an automaton A P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ • start. 1. [a-z0-9.] [a-z0-9.]. •. @. 4. 2. 3. • Then, evaluate the automaton on the text T. Email␣a3nm@a3nm.net␣Affiliation. • The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P → This is very efficient in T and reasonably efficient in P. 3/22.

(44) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. 4/22.

(45) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P. 4/22.

(46) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation. 4/22.

(47) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/22.

(48) Actual Problem: Extracting all Patterns. • This only tests if the pattern occurs in the text! → “YES”. • Goal: find all substrings in the text T which match the pattern P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30. Email␣a3nm@a3nm.net␣Affiliation → One match: [5, 20i. 4/22.

(49) Formal Problem Statement. • Problem description:. 5/22.

(50) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 5/22.

(51) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 5/22.

(52) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. 5/22.

(53) Formal Problem Statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. test@example.com More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • Output: the list of substrings of T that match P: [186, 200i, [483, 500i, . . .. • Goal: be very efficient in T and reasonably efficient in P 5/22.

(54) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. 6/22.

(55) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [i l. o. l. 6/22.

(56) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l i o. l. 6/22.

(57) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o i l. 6/22.

(58) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T [ l. o. l i. 6/22.

(59) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [i o. l. 6/22.

(60) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o i l. 6/22.

(61) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l [ o. l i. 6/22.

(62) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [i l. 6/22.

(63) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o [ l i. 6/22.

(64) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l [i. 6/22.

(65) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|). 6/22.

(66) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). 6/22.

(67) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. 6/22.

(68) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. 6/22.

(69) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗. 6/22.

(70) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). 6/22.

(71) Measuring the Complexity. • Naive algorithm: Run the automaton A on each substring of T l. o. l. → Complexity is O(|T|2 × |A| × |T|) → Can be optimized to O(|T|2 × |A|). • Problem: We may need to output Ω(|T| ) matching substrings: 2. • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. • Consider the pattern P := a∗ • The number of matches is Ω(|T|2 ). → We need a different way to measure complexity 6/22.

(72) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/22.

(73) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/22.

(74) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/22.

(75) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/22.

(76) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. 7/22.

(77) Enumeration Algorithms Idea: In real life, we do not want to compute all the matches we just need to be able to enumerate matches quickly. → Formalization: enumeration algorithms 7/22.

(78) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/22.

(79) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/22.

(80) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T. Phase 1: Preprocessing Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. 8/22.

(81) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 1: Preprocessing Index structure. Phase 2: Enumeration. 8/22.

(82) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i, Results 8/22.

(83) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. . [42, 57i,[1337, 1351i Results 8/22.

(84) Formalizing Enumeration Algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T. Index structure. ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣ Pattern P. Phase 2: Enumeration. Two ways to measure performance:. • Total time for phase 1 • Delay between two results in phase 2. ... as a function of the text and pattern. . [42, 57i,[1337, 1351i Results 8/22.

(85) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 9/22.

(86) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. 9/22.

(87) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm?. 9/22.

(88) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring. 9/22.

(89) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. 9/22.

(90) Complexity of Enumeration Algorithms. • Recall the inputs to our problem: • A text T. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression P := ␣ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣. • What is the delay of the naive algorithm? → it is the maximal time to find the next matching substring → i.e. O(|T|2 × |A|), e.g., if only the beginning and end match. → Can we do better? 9/22.

(91) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds:. 10/22.

(92) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T • Delay constant (independent from T). 10/22.

(93) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. 10/22.

(94) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is:. 10/22.

(95) Results for Enumerating Pattern Matches. • Existing work has shown the best possible bounds in T: Theorem [Florenzano et al., 2018] We can enumerate all matches of a pattern P on a text T with: • Preprocessing linear in T and exponential in P • Delay constant (independent from T) and exponential in P. → Problem: They only measure the complexity as a function of T!. • Our contribution is: Theorem We can enumerate all matches of a pattern P on a text T with: • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 10/22.

(96) Automaton Formalism. • We use automata that read letters and capture variables. 11/22.

(97) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗. 11/22.

(98) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. 11/22.

(99) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text. 11/22.

(100) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. 11/22.

(101) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. 11/22.

(102) Automaton Formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. • Challenge: Because of nondeterminism we can have many different runs of A producing the same tuple!. 11/22.

(103) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A. 12/22.

(104) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. 12/22.

(105) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. •. 1 α. a. 2 β 3. •. 12/22.

(106) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a. a. a. b. a. •. 1 α. a. 2 β 3. •. 12/22.

(107) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. 12/22.

(108) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match. 12/22.

(109) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , match hα : 0, β : 3i a •. 1. 1. a. 2. β 3. 1. 3. 1. 2 β. •. a. α. α. α 2. a. α. β 3. 1. 2. 1 α. 2. 3. 1 α. 2. α 2 β. β. β. β 3. a. b. 3. 3. → Each path in the product DAG corresponds to a match. 12/22.

(110) Proof Idea: Product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match → Challenge: Enumerate paths but avoid duplicate matches and do not waste time to ensure constant delay. 12/22.

(111) Extension: From Text to Trees.

(112) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. 13/22.

(113) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. 13/22.

(114) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. • Results: 13/22.

(115) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: 13/22.

(116) Pattern Matching on Trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: hα : 4, β : 6i, hα : 4, β : 7i 13/22.

(117) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton.... 14/22.

(118) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown:. 14/22.

(119) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T • Delay constant in T. 14/22.

(120) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T!. 14/22.

(121) Definitions and Results on Trees. • Tree patterns P can be written as a kind of tree automaton... • Existing work has studied this problem and shown: Theorem [Bagan, 2006] We can find all matches on a tree T of a tree pattern P (with constantly many capture variables) with: • Preprocessing linear in T and exponential in P • Delay constant in T and exponential in P. • Again, this only measures the complexity in T!. → We are working on proving the following: Conjecture. • Preprocessing in O(|T| × Poly(P)) • Delay polynomial in P and independent from T 14/22.

(122) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. Phase 1: Preprocessing Tree. Data structure. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 15/22.

(123) Proof Idea for Trees: Structure Similar structure to the previous proof, but with a circuit:. • Preprocessing: Compute a circuit representation of the answers • Enumeration: Apply a generic algorithm on the circuit Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 15/22.

(124) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). 16/22.

(125) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6”. 16/22.

(126) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons. 16/22.

(127) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i. 16/22.

(128) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×. α:4. ∪. β :6. β :7 16/22.

(129) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. ∪. β :6. β :7 16/22.

(130) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. • Union gate. ∪. ∪. :. → union of sets of tuples. β :6. β :7 16/22.

(131) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. :. ∪. → union of sets of tuples. β :6. β :7. • Product gate. ×. :. → relational product 16/22.

(132) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. → union of sets of tuples.  hβ : 6i β :6. :. ∪. β :7. • Product gate. ×. :. → relational product 16/22.

(133) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪.  hβ : 6i β :6. . hβ : 7i. β :7. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/22.

(134) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×  α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/22.

(135) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/22.

(136) Proof Idea for Trees: Set Circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:.  hα : 4, β : 6i hα : 4, β : 7i × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 16/22.

(137) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem For any tree automaton A with capture variables α1 , . . . , αk , given a tree T, we can build in O(|T| × |A|) a set circuit capturing exactly the set of tuples {hα1 : n1 , . . . , αk : nk i in the output of A on T 17/22.

(138) Proof Idea for Trees: Results Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem Given a set circuit satisfying some conditions, we can enumerate all tuples that it captures with linear preprocessing and constant delay  E.g., for hα : 4, β : 6i, hα : 4, β : 7i : enumerate hα : 4, β : 6i then hα : 4, β : 7i 17/22.

(139) Extension: Supporting Updates.

(140) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 18/22.

(141) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 18/22.

(142) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 18/22.

(143) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch 18/22.

(144) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch → Can we do better? 18/22.

(145) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(T). O(1). [Kazana and Segoufin, 2013]. 19/22.

(146) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T). O(log2 T) O(log2 T). [Kazana and Segoufin, 2013]. 19/22.

(147) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text. 19/22.

(148) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T) O(1) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text [Niewerth and Segoufin, 2018]. text. 19/22.

(149) Relabelings. • Special kind of updates: relabelings that change the label of a node. 20/22.

(150) Relabelings. • Special kind of updates: relabelings that change the label of a node. • Example: relabel node 7 to <video>. 20/22.

(151) Relabelings. • Special kind of updates: relabelings that change the label of a node. • Example: relabel node 7 to <video>. 20/22.

(152) Relabelings. • Special kind of updates: relabelings that change the label of a node. • Example: relabel node 7 to <video> • The tree’s structure never changes. 20/22.

(153) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T). O(log2 T) O(log2 T). [Kazana and Segoufin, 2013]. 21/22.

(154) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(1) O(log T). [Kazana and Segoufin, 2013] [Amarilli et al., 2018]. trees. 21/22.

(155) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(1) O(log T). [Kazana and Segoufin, 2013] [Amarilli et al., 2018]. trees. • Current proof uses hybrid circuits but we want to simplify it. 21/22.

(156) New results on dynamic trees. • If we allow only relabeling updates, we can show: Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(1) O(log T). [Kazana and Segoufin, 2013] [Amarilli et al., 2018]. trees. • Current proof uses hybrid circuits but we want to simplify it • Remaining open questions:. → Does this hold for more general updates (insert/delete, etc.)? → Can we also achieve tractable combined complexity? 21/22.

(157) Summary and Future Work.

(158) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. 22/22.

(159) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. • Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. 22/22.

(160) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. • Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees. 22/22.

(161) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. • Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data. 22/22.

(162) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. • Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Testing how well our methods perform in practice 22/22.

(163) Summary and Future Work Summary:. • Problem: given a text T and a pattern P,. enumerate efficiently all matches of P on T. • Result: we can do this with reasonable complexity in P and with linear preprocessing and constant delay in T. Extensions and future work:. • Extending the results from text to trees • Supporting updates on the input data • Testing how well our methods perform in practice Thanks for your attention! 22/22.

(164) References i. Amarilli, A., Bourhis, P., and Mengel, S. (2018). Enumeration on trees under relabelings. In ICDT. Bagan, G. (2006). MSO queries on tree decomposable structures are computable with linear delay. In CSL. Florenzano, F., Riveros, C., Ugarte, M., Vansummeren, S., and Vrgoc, D. (2018). Constant delay algorithms for regular document spanners. In PODS..

(165) References ii. Kazana, W. and Segoufin, L. (2013). Enumeration of monadic second-order queries on trees. TOCL, 14(4). Losemann, K. and Martens, W. (2014). MSO queries on trees: Enumerating answers under updates. In CSL-LICS. Niewerth, M. and Segoufin, L. (2018). Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS. To appear..

(166)

Références

Documents relatifs

Our goal with this exploratory pilot study is to discuss data aggregation challenges from machine learning point of view, and identify relevant direc- tions for future research

(2) clean data (XSLT) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) XML-TEI

Keywords: Data preprocessing, machine learning; data mining; classification algorithms; stress; anxiety disorder; panic

The multidisciplinary contributions of the thesis are: (1) in environmental chemistry: a methodological procedure to determine the content of organochlorine pesticides in water

Main results Theorem Given a d-DNNF circuit C with a v-tree T, we can enumerate its satisfying assignments with preprocessing linear in |C| + |T| and delay linear in each

The tests reported here (on Yongning Na and other languages from the Pangloss Collection, an open archive of endangered languages) explore some possibilities for automating the

The paper presents a data preprocessing method utilizing formal concept analysis in a way that certain formal concepts are used to create new attributes describing the

In particu- lar, the first column (sQueezeBF) represents the full-featured preprocessor (i.e. no technique disabled); the suffixes “– Eq”, “– Qr” and “– Ss” repre- sent