• Aucun résultat trouvé

Enumerating Pattern Matches in Words and Trees

N/A
N/A
Protected

Academic year: 2022

Partager "Enumerating Pattern Matches in Words and Trees"

Copied!
229
0
0

Texte intégral

(1)Enumerating Pattern Matches in Words and Trees. Antoine Amarilli1 , Pierre Bourhis2 , Stefan Mengel3 , Matthias Niewerth4 October 8th, 2018 1 Télécom. ParisTech. 2 CNRS. CRIStAL. 3 CNRS. CRIL. 4 Universität. Bayreuth 1/36.

(2) Problem: Finding patterns in text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 2/36.

(3) Problem: Finding patterns in text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T: → Example: find email addresses. 2/36.

(4) Problem: Finding patterns in text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T: → Example: find email addresses. 2/36.

(5) Problem: Finding patterns in text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T: → Example: find email addresses. 2/36.

(6) Problem: Finding patterns in text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣+ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣+. 2/36.

(7) Problem: Finding patterns in text. • We have a long text T:. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • We want to find a pattern P in the text T:. → Example: find email addresses • Write the pattern as a regular expression: P := ␣+ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣+. → How to find the pattern P efficiently in the text T? 2/36.

(8) Solution: automata. • Convert the pattern from a regular expression to an automaton. 3/36.

(9) Solution: automata. • Convert the pattern from a regular expression to an automaton P := ␣+ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣+. 3/36.

(10) Solution: automata. • Convert the pattern from a regular expression to an automaton P := ␣+ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣+ [a-z0-9.] [a-z0-9.] start. 1. 2. @. 3. 4. 3/36.

(11) Solution: automata. • Convert the pattern from a regular expression to an automaton P := ␣+ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣+ [a-z0-9.] [a-z0-9.] start. 1. 2. @. 3. 4. • Then, evaluate the automaton on the text T. ... Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer .... 3/36.

(12) Solution: automata. • Convert the pattern from a regular expression to an automaton P := ␣+ [a-z0-9.]∗ @ [a-z0-9.]∗ ␣+ [a-z0-9.] [a-z0-9.] start. 1. 2. @. 3. 4. • Then, evaluate the automaton on the text T. ... Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer .... • How efficient is this? 3/36.

(13) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T. 4/36.

(14) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics:. 4/36.

(15) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T. 4/36.

(16) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. 4/36.

(17) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic.... 4/36.

(18) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic... • Data complexity is linear in T. 4/36.

(19) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic.... • Data complexity is linear in T • Combined complexity is polynomial in T and A. 4/36.

(20) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic.... • Data complexity is linear in T • Combined complexity is polynomial in T and A. • In general.... 4/36.

(21) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic.... • Data complexity is linear in T • Combined complexity is polynomial in T and A. • In general... • Compute a deterministic automaton from A.... 4/36.

(22) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic.... • Data complexity is linear in T • Combined complexity is polynomial in T and A. • In general... • Compute a deterministic automaton from A... → Linear data complexity but exponential combined complexity. 4/36.

(23) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic.... • Data complexity is linear in T • Combined complexity is polynomial in T and A. • In general... • Compute a deterministic automaton from A... → Linear data complexity but exponential combined complexity. • Better: Read T while remembering the current set of states (like determinizing A, but on the fly) 4/36.

(24) Complexity of automaton evaluation. • Task: testing if an automaton A accepts a text T • We measure complexity according to two metrics: • Data complexity: in the text T • Combined complexity: in T and A. • If the automaton A is deterministic.... • Data complexity is linear in T • Combined complexity is polynomial in T and A. • In general... • Compute a deterministic automaton from A... → Linear data complexity but exponential combined complexity. • Better: Read T while remembering the current set of states (like determinizing A, but on the fly) → Linear data complexity and polynomial combined complexity 4/36.

(25) Actual problem: Extracting all patterns. • This only tests if the pattern exactly matches the whole text! → “YES”. 5/36.

(26) Actual problem: Extracting all patterns. • This only tests if the pattern exactly matches the whole text! → “YES”. • We want to actually find all pattern matches!. → Find all pairs of positions that are the endpoints of a match. 5/36.

(27) Actual problem: Extracting all patterns. • This only tests if the pattern exactly matches the whole text! → “YES”. • We want to actually find all pattern matches!. → Find all pairs of positions that are the endpoints of a match. • Generalization: patterns that can capture a tuple of positions → Find the email addresses without leading/trailing spaces → Find all pairs of a name followed by an email address. 5/36.

(28) Patterns with capture variables. • Write the pattern P as a regular expression with capture variables P := •∗ ␣+ α [a-z0-9.]∗ @ [a-z0-9.]∗ β ␣+ •∗. 6/36.

(29) Patterns with capture variables. • Write the pattern P as a regular expression with capture variables P := •∗ ␣+ α [a-z0-9.]∗ @ [a-z0-9.]∗ β ␣+ •∗. • Semantics: a match of P maps α and β to positions of T. 6/36.

(30) Patterns with capture variables. • Write the pattern P as a regular expression with capture variables P := •∗ ␣+ α [a-z0-9.]∗ @ [a-z0-9.]∗ β ␣+ •∗. • Semantics: a match of P maps α and β to positions of T ... Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer .... 6/36.

(31) Patterns with capture variables. • Write the pattern P as a regular expression with capture variables P := •∗ ␣+ α [a-z0-9.]∗ @ [a-z0-9.]∗ β ␣+ •∗. • Semantics: a match of P maps α and β to positions of T ... Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer ... → One match: hα : 20, β : 32i. 6/36.

(32) Formal problem statement. • Problem description:. 7/36.

(33) Formal problem statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... 7/36.

(34) Formal problem statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression with capture variables P := •∗ ␣+ α [a-z0-9.]∗ @ [a-z0-9.]∗ β ␣+ •∗. 7/36.

(35) Formal problem statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression with capture variables P := •∗ ␣+ α [a-z0-9.]∗ @ [a-z0-9.]∗ β ␣+ •∗. • Output: the list of matches of P on T hα : 187, β : 199i, .... 7/36.

(36) Formal problem statement. • Problem description: • Input:. • A text T Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure. More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git .... • A pattern P given as a regular expression with capture variables P := •∗ ␣+ α [a-z0-9.]∗ @ [a-z0-9.]∗ β ␣+ •∗. • Output: the list of matches of P on T hα : 187, β : 199i, .... • We measure the complexity of the problem:. • In data complexity, as a function of T • In combined complexity, as a function of P and T 7/36.

(37) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. 8/36.

(38) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern αβ l o l. 8/36.

(39) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern α l β o l. 8/36.

(40) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern α l o β l. 8/36.

(41) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern α l o l β. 8/36.

(42) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern β l α o l. 8/36.

(43) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l αβ o l. 8/36.

(44) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l α o β l. 8/36.

(45) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l α o l β. 8/36.

(46) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern β l o α l. 8/36.

(47) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l β o α l. 8/36.

(48) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o αβ l. 8/36.

(49) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o α l β. 8/36.

(50) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern β l o l α. 8/36.

(51) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l β o l α. 8/36.

(52) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o β l α. 8/36.

(53) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l αβ. 8/36.

(54) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l → For k capture variables, data complexity.... 8/36.

(55) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. → For k capture variables, data complexity... O(|T|k+1 ). 8/36.

(56) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. → For k capture variables, data complexity... O(|T|k+1 ). • Hope: If T is big, we want data complexity to be in O(|T|). 8/36.

(57) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. → For k capture variables, data complexity... O(|T|k+1 ). • Hope: If T is big, we want data complexity to be in O(|T|) • Challenge: This is impossible, there can be too many matches:. 8/36.

(58) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. → For k capture variables, data complexity... O(|T|k+1 ). • Hope: If T is big, we want data complexity to be in O(|T|) • Challenge: This is impossible, there can be too many matches: • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. 8/36.

(59) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. → For k capture variables, data complexity... O(|T|k+1 ). • Hope: If T is big, we want data complexity to be in O(|T|) • Challenge: This is impossible, there can be too many matches: • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • Consider the regexp with captures P := •∗ α a∗ β •∗. 8/36.

(60) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. → For k capture variables, data complexity... O(|T|k+1 ). • Hope: If T is big, we want data complexity to be in O(|T|) • Challenge: This is impossible, there can be too many matches: • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • Consider the regexp with captures P := •∗ α a∗ β •∗ • The number of matches is O(|T|2 ). 8/36.

(61) Measuring the complexity. • Naive algorithm: Consider all ways to assign capture variables and test for each of them if it satisfies the pattern l o l. → For k capture variables, data complexity... O(|T|k+1 ). • Hope: If T is big, we want data complexity to be in O(|T|) • Challenge: This is impossible, there can be too many matches: • Consider the text T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • Consider the regexp with captures P := •∗ α a∗ β •∗ • The number of matches is O(|T|2 ). → We need a different way to measure complexity 8/36.

(62) Enumeration algorithms Idea: In real life, we do not want to compute all the answers we just need to be able to enumerate answers quickly. 9/36.

(63) Enumeration algorithms Idea: In real life, we do not want to compute all the answers we just need to be able to enumerate answers quickly. 9/36.

(64) Enumeration algorithms Idea: In real life, we do not want to compute all the answers we just need to be able to enumerate answers quickly. 9/36.

(65) Enumeration algorithms Idea: In real life, we do not want to compute all the answers we just need to be able to enumerate answers quickly. 9/36.

(66) Enumeration algorithms Idea: In real life, we do not want to compute all the answers we just need to be able to enumerate answers quickly. 9/36.

(67) Enumeration algorithms Idea: In real life, we do not want to compute all the answers we just need to be able to enumerate answers quickly. → Formalization: enumeration algorithms 9/36.

(68) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. 10/36.

(69) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. 10/36.

(70) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T. Phase 1: Preprocessing Data structure. •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. 10/36.

(71) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. Phase 1: Preprocessing Data structure. Phase 2: Enumeration. 10/36.

(72) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. Phase 1: Preprocessing Data structure. Phase 2: Enumeration. . hα : 42, β : 57i, Results. 10/36.

(73) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. Data structure. 0011 State. Phase 2: Enumeration. . hα : 42, β : 57i, Results. 10/36.

(74) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. Data structure. 0011 State. Phase 2: Enumeration. . hα : 42, β : 57i, Results. 10/36.

(75) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. Data structure. ⊥ State. Phase 2: Enumeration. . hα : 42, β : 57i, hα : 1337, β : 1351i Results. 10/36.

(76) Formalizing enumeration algorithms. Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor .... Phase 1: Preprocessing. Text T •∗ ␣+ α[a-z0-9.]∗ @ [a-z0-9.]∗ β␣+ •∗ Pattern P. Data structure. ⊥ State. Phase 2: Enumeration. Two ways to measure performance:. • Total time for phase 1 • Delay between two results in phase 2. ... in combined and data complexity. . hα : 42, β : 57i, hα : 1337, β : 1351i Results. 10/36.

(77) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. 11/36.

(78) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... 11/36.

(79) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is.... 11/36.

(80) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A. 11/36.

(81) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is.... 11/36.

(82) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is... constant: nothing to do on T. 11/36.

(83) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is... constant: nothing to do on T. • In terms of delay.... 11/36.

(84) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is... constant: nothing to do on T. • In terms of delay... • Combined complexity is.... 11/36.

(85) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is... constant: nothing to do on T. • In terms of delay... • Combined complexity is... polynomial: check if A accepts T. 11/36.

(86) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is... constant: nothing to do on T. • In terms of delay... • Combined complexity is... polynomial: check if A accepts T • Data complexity is.... 11/36.

(87) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is... constant: nothing to do on T. • In terms of delay... • Combined complexity is... polynomial: check if A accepts T • Data complexity is... polynomial in T: time to find the next match. 11/36.

(88) Complexity of enumeration algorithms. • Recall the inputs to our problem:. • The text T • The regexp with captures P → Assumption: there is a constant number k of capture variables. • What is the performance of the naive algorithm? • In terms of preprocessing.... • Combined complexity is... polynomial: convert P to an automaton A • Data complexity is... constant: nothing to do on T. • In terms of delay... • Combined complexity is... polynomial: check if A accepts T • Data complexity is... polynomial in T: time to find the next match. → Can we do better? 11/36.

(89) Results for enumerating pattern matches. • Existing work has shown the best possible bounds:. 12/36.

(90) Results for enumerating pattern matches. • Existing work has shown the best possible bounds: Theorem [Florenzano et al., 2018] We can find all matches of a regexp with captures P on text T with: • Preprocessing linear in T • Delay constant in T. 12/36.

(91) Results for enumerating pattern matches. • Existing work has shown the best possible bounds: Theorem [Florenzano et al., 2018] We can find all matches of a regexp with captures P on text T with: • Preprocessing linear in T (data) • Delay constant in T (data). → Problem: They only measure data complexity! The combined complexity is exponential with their approach!. 12/36.

(92) Results for enumerating pattern matches. • Existing work has shown the best possible bounds: Theorem [Florenzano et al., 2018] We can find all matches of a regexp with captures P on text T with: • Preprocessing linear in T (data) • Delay constant in T (data). → Problem: They only measure data complexity! The combined complexity is exponential with their approach!. • Our contribution is:. Theorem We can find all matches of a regexp with captures P on text T with: • Preprocessing linear in T (data) and polynomial in T and P (combined) • Delay constant in T (data) and polynomial in T and P (combined) 12/36.

(93) Automaton formalism. • We use automata that read letters and capture variables. 13/36.

(94) Automaton formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗. 13/36.

(95) Automaton formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. 13/36.

(96) Automaton formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text. 13/36.

(97) Automaton formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. 13/36.

(98) Automaton formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. 13/36.

(99) Automaton formalism. • We use automata that read letters and capture variables → Example: P := •∗ α a∗ β •∗ a • 1. α. 2. • β. 3. of the automaton A: • Semantics • Reads letters from the text • Guesses variables at positions in the text → Output: tuples hα : i, β : ji such that A has an accepting run reading α at position i and β at j. • Assumption: There is no run for which A reads. the same capture variable twice at the same position. • Challenge: Because of nondeterminism we can have many different runs of A producing the same tuple!. 13/36.

(100) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A. 14/36.

(101) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. 14/36.

(102) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ ,. •. 1 α. a. 2 β 3. •. 14/36.

(103) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a. a. a. b. a. •. 1 α. a. 2 β 3. •. 14/36.

(104) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. 14/36.

(105) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match. 14/36.

(106) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , match hα : 0, β : 3i a •. 1. 1. a. 2. β 3. 1. 3. 1. 2 β. •. a. α. α. α 2. a. α. β 3. 1. 2. 1 α. 2. 3. 1 α. 2. α 2 β. β. β. β 3. a. b. 3. 3. → Each path in the product DAG corresponds to a match. 14/36.

(107) Proof idea: product DAG Compute a product DAG of the text T and of the automaton A Example: Text T := aaaba and P := •∗ α a∗ β •∗ , a •. 1. 1. a. 2. β 3. 1 α. α 2. a. 3. 1 α. 2 β. •. a. α. β 3. 1. 2. 1 α. 2 β. 3. a. b. α 2. β 3. 1 α 2 β. β 3. 3. → Each path in the product DAG corresponds to a match → Challenge: Enumerate paths but avoid duplicate matches and do not waste time to ensure constant delay. 14/36.

(108) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue. α 2. 2 β. 3. 3. 15/36.

(109) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue. α 2. 2 β. 3. 3. 15/36.

(110) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. α 2. 2 β. 3. 3. 15/36.

(111) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i. β 3. 3. 15/36.

(112) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 15/36.

(113) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 15/36.

(114) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. 15/36.

(115) Proof idea: on-the-fly computation to avoid duplicates. i. i+1. 1. 1 α. 2. 2. • We are at a position i and set of states in blue • Partition tuples based on the set S of variables assigned at the current position. each S, consider the set of states • For where we can be at i + 1 when reading S at i • Example: S = {α}. β 3. 3. → We must have preprocessed the DAG to make sure that we can always finish the run. 15/36.

(116) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress. 16/36.

(117) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 16/36.

(118) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 16/36.

(119) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 16/36.

(120) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 16/36.

(121) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. 16/36.

(122) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. 16/36.

(123) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. • Challenge: Preprocessing in linear time in T and polynomial in A: 16/36.

(124) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. • Challenge: Preprocessing in linear time in T and polynomial in A: → Compute for each state the next position where we can reach some state that can assign a variable. 16/36.

(125) Proof idea: jump pointers to save time. • Issue: When we can’t assign variables, we do not make progress ··· ··· ··· α. α. α. α. α ···. • Idea: Directly jump to the reachable states. at the next position where we can assign a variable. • Challenge: Preprocessing in linear time in T and polynomial in A:. → Compute for each state the next position where we can reach some state that can assign a variable → Compute at each position i the transitive closure to all positions j such that j is the next position of some state at i (there are ≤ |A|)16/36.

(126) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. 17/36.

(127) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α. γ. β. • Idea: Explore a decision tree on the variables (built on the fly). 17/36.

(128) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α β. 0. α?. 1 β?. β? 0 γ?. γ 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly). 17/36.

(129) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α β. 0. α?. 1 β?. β? 0 γ?. γ 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly) tree node, find the reachable states which • Athaveeachall decision required variables (1) and no forbidden variables (0) 17/36.

(130) Proof idea: flashlight search. • Issue: Finding which variable sets we can assign at position i? i+1. i α. β. γ. α. α β. 0. α?. 1 β?. β? 0 γ?. γ 0. 0 γ?. 1 γ? 1 0. 1 0. 1 γ? 1 0. 1. • Idea: Explore a decision tree on the variables (built on the fly) tree node, find the reachable states which • Athaveeachall decision required variables (1) and no forbidden variables (0) → Assumption: we don’t see the same variable twice on a path 17/36.

(131) Extension: From Text to Trees.

(132) Pattern matching on trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. 18/36.

(133) Pattern matching on trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. 18/36.

(134) Pattern matching on trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree: Is there. an h2 header and. an image in the same section?. • Results: 18/36.

(135) Pattern matching on trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: 18/36.

(136) Pattern matching on trees. • The data T is no longer text but is now a tree: <body>1 <div>2. <section>3 <h2>4. <p>5 <img>6. <img>7. • The pattern P asks about the structure of the tree:. Is there α: an h2 header and β: an image in the same section?. • Results: hα : 4, β : 6i, hα : 4, β : 7i 18/36.

(137) Definitions and Results on Trees. • Tree patterns can be written as a tree automaton with captures. 19/36.

(138) Definitions and Results on Trees. • Tree patterns can be written as a tree automaton with captures • Like for text, we can enumerate the matches of tree automata.... 19/36.

(139) Definitions and Results on Trees. • Tree patterns can be written as a tree automaton with captures • Like for text, we can enumerate the matches of tree automata... Theorem [Bagan, 2006] We can find all matches on a tree T of a tree automaton A (with constantly many capture variables) with: • Preprocessing linear in T (data) • Delay constant in T (data). 19/36.

(140) Definitions and Results on Trees. • Tree patterns can be written as a tree automaton with captures • Like for text, we can enumerate the matches of tree automata... Theorem [Bagan, 2006] We can find all matches on a tree T of a tree automaton A (with constantly many capture variables) with: • Preprocessing linear in T (data) • Delay constant in T (data). • Again, this is only in data complexity!. 19/36.

(141) Definitions and Results on Trees. • Tree patterns can be written as a tree automaton with captures • Like for text, we can enumerate the matches of tree automata... Theorem [Bagan, 2006] We can find all matches on a tree T of a tree automaton A (with constantly many capture variables) with: • Preprocessing linear in T (data) • Delay constant in T (data). • Again, this is only in data complexity! • We conjecture the following bounds for this task (ongoing work): Conjecture • Preprocessing linear in T (data) and polynomial in A and T (combined) • Delay constant in T (data) and polynomial in A and T (combined) 19/36.

(142) Proof idea for trees: structure Similar structure to the previous proof, but with a circuit:. Phase 1: Preprocessing Tree. Data structure. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 20/36.

(143) Proof idea for trees: structure Similar structure to the previous proof, but with a circuit:. • Preprocessing: Compute a circuit representation of the answers • Enumeration: Apply a generic algorithm on the circuit Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern 20/36.

(144) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). 21/36.

(145) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6”. 21/36.

(146) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons. 21/36.

(147) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i. 21/36.

(148) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×. α:4. ∪. β :6. β :7 21/36.

(149) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. ∪. β :6. β :7 21/36.

(150) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. α:4. :. → captures hα : 4i . α:4. • Union gate. ∪. ∪. :. → union of sets of tuples. β :6. β :7 21/36.

(151) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. :. ∪. → union of sets of tuples. β :6. β :7. • Product gate. ×. :. → relational product 21/36.

(152) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪. → union of sets of tuples.  hβ : 6i β :6. :. ∪. β :7. • Product gate. ×. :. → relational product 21/36.

(153) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:. • Variable gate. ×. :. α:4. → captures hα : 4i . α:4. • Union gate. ∪.  hβ : 6i β :6. . hβ : 7i. β :7. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 21/36.

(154) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: ×  α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 21/36.

(155) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates: × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 21/36.

(156) Proof idea for trees: set circuits A set circuit represents a set of answers to a pattern P(α, β). • Singleton α : 6 → “the variable α is mapped to node 6” • Tuple hα : 4, β : 6i: tuple of singletons  • The circuit captures a set of tuples, e.g., hα : 4, β : 6i, hα : 4, β : 7i Three kinds of set-valued gates:.  hα : 4, β : 6i hα : 4, β : 7i × . . hα : 4i α:4. hβ : 6i, hβ : 7i. β :6. . hβ : 7i. β :7. :. α:4. → captures hα : 4i . • Union gate. ∪.  hβ : 6i. • Variable gate. :. ∪. → union of sets of tuples. • Product gate. ×. :. → relational product 21/36.

(157) Proof idea for trees: set circuit construction Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem For any tree automaton A with capture variables α1 , . . . , αk , given a tree T, we can build in O(|T| × |A|) a set circuit capturing exactly the set of tuples {hα1 : n1 , . . . , αk : nk i in the output of A on T 22/36.

(158) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ}.

(159) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. 23/36.

(160) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. α:n. 23/36.

(161) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪. α:n ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪. 23/36.

(162) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ×. α:n. ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪. 23/36.

(163) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ×. α:n. ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪. 23/36.

(164) Proof idea for trees: set circuit construction (details). • Automaton: “Select all node pairs (α, β)” • States: {∅, α, β, αβ} n. ∅. α. β. αβ. ∪. ∪. ∪. ∪ ×. ×. α:n. ∅. α. β. αβ. ∅. α. β. αβ. ∪. ∪. ∪. ∪. ∪. ∪. ∪. ∪. 23/36.

(165) Proof idea for trees: enumeration on set circuits Phase 1: Preprocessing Tree. × α:4 β :6. ∪ β :7. Phase 2: Enumeration. . hα : 4, β : 6i, hα : 4, β : 7i. Results. ∃s section(s)∧ s α∧s β∧ h2(α) ∧ img(β). Pattern. Theorem Given a set circuit satisfying some conditions, we can enumerate all tuples that it captures with linear preprocessing and constant delay  E.g., for hα : 4, β : 6i, hα : 4, β : 7i : enumerate hα : 4, β : 6i then hα : 4, β : 7i 24/36.

(166) Proof idea for trees: general enumeration approach → Enumerate the set T(g) captured by each gate g → Do it by top-down induction on the circuit. 25/36.

(167) Proof idea for trees: general enumeration approach → Enumerate the set T(g) captured by each gate g → Do it by top-down induction on the circuit Base case: variable. α:n. :. 25/36.

(168) Proof idea for trees: general enumeration approach → Enumerate the set T(g) captured by each gate g → Do it by top-down induction on the circuit Base case: variable. α:n. : enumerate hα : ni and stop. 25/36.

(169) Proof idea for trees: general enumeration approach → Enumerate the set T(g) captured by each gate g → Do it by top-down induction on the circuit Base case: variable. α:n. : enumerate hα : ni and stop. ∪-gate ∪ g1. g2. Concatenation: enumerate T(g1 ) and then enumerate T(g2 ). 25/36.

(170) Proof idea for trees: general enumeration approach → Enumerate the set T(g) captured by each gate g → Do it by top-down induction on the circuit Base case: variable. g1. α:n. : enumerate hα : ni and stop. ∪-gate. ×-gate. ∪. × g2. Concatenation: enumerate T(g1 ) and then enumerate T(g2 ). g1. g2. Lexicographic product: for every t1 in T(g1 ): for every t2 in T(g2 ): output t1 + t2 25/36.

(171) Proof idea for trees: circuit conditions Enumeration relies on some conditions on the input circuit (d-DNNF):. ×. α:4. ∪. β :6. β :7. 26/36.

(172) Proof idea for trees: circuit conditions Enumeration relies on some conditions on the input circuit (d-DNNF):. •. ∪. are all deterministic:. For any two inputs g1 and g2 of a ∪-gate, the captured sets T(g1 ) and T(g2 ) are disjoint (they have no tuple in common) → Avoids duplicate tuples. ×. α:4. ∪. β :6. β :7. 26/36.

(173) Proof idea for trees: circuit conditions Enumeration relies on some conditions on the input circuit (d-DNNF):. •. ∪. are all deterministic:. For any two inputs g1 and g2 of a ∪-gate, the captured sets T(g1 ) and T(g2 ) are disjoint (they have no tuple in common) → Avoids duplicate tuples. •. ×. ×. α:4. ∪. are all decomposable:. For any two inputs g1 and g2 of a ×-gate, no variable has a path to both g1 and g2. β :6. β :7. → Avoids duplicate singletons. 26/36.

(174) Proof idea for trees: circuit conditions Enumeration relies on some conditions on the input circuit (d-DNNF):. •. ∪. are all deterministic:. For any two inputs g1 and g2 of a ∪-gate, the captured sets T(g1 ) and T(g2 ) are disjoint (they have no tuple in common) → Avoids duplicate tuples. •. ×. ×. α:4. ∪. are all decomposable:. For any two inputs g1 and g2 of a ×-gate, no variable has a path to both g1 and g2. β :6. β :7. → Avoids duplicate singletons. • Also an additional upwards-determinism condition 26/36.

(175) Proof idea for trees: circuit conditions Enumeration relies on some conditions on the input circuit (d-DNNF):. •. ∪. are all deterministic:. For any two inputs g1 and g2 of a ∪-gate, the captured sets T(g1 ) and T(g2 ) are disjoint (they have no tuple in common) → Avoids duplicate tuples. •. ×. ×. α:4. ∪. are all decomposable:. For any two inputs g1 and g2 of a ×-gate, no variable has a path to both g1 and g2. β :6. β :7. → Avoids duplicate singletons. • Also an additional upwards-determinism condition • Our circuit satisfies these thanks to automaton determinism. 26/36.

(176) Proof idea for trees: enumeration subtleties. ∪ β :6. × α:4. ∪. • We must not waste time in gates capturing ∅. 27/36.

(177) Proof idea for trees: enumeration subtleties. ∪ β :6. × α:4. ∪. • We must not waste time in gates capturing ∅ → Label them during the preprocessing. 27/36.

(178) Proof idea for trees: enumeration subtleties × ×. ∪ β :6. × α:4. ∪. × ×. × ×. α:4. • We must not waste time in gates capturing ∅ → Label them during the preprocessing. • We must not waste time because of gates capturing. . hi. 27/36.

(179) Proof idea for trees: enumeration subtleties × ×. ∪ β :6. × α:4. ∪. × ×. × ×. α:4. • We must not waste time in gates capturing ∅ → Label them during the preprocessing. • We must not waste time because of gates capturing. . hi. → Homogenization to set them aside. 27/36.

(180) Proof idea for trees: enumeration subtleties × ×. ∪ β :6. × α:4. ∪. ∪ ×. ×. ∪ ×. ×. ··· α:4. ∪ ∪. ···. ∪ ··· ···. • We must not waste time in gates capturing ∅ → Label them during the preprocessing. • We must not waste time because of gates capturing. . hi. → Homogenization to set them aside. • We must not waste time in hierarchies of ∪-gates 27/36.

(181) Proof idea for trees: enumeration subtleties × ×. ∪ β :6. × α:4. ∪. ∪ ×. ×. ∪ ×. ×. ··· α:4. ∪ ∪. ···. ∪ ··· ···. • We must not waste time in gates capturing ∅ → Label them during the preprocessing. • We must not waste time because of gates capturing. . hi. → Homogenization to set them aside. • We must not waste time in hierarchies of ∪-gates. → Precompute a reachability index (uses upwards-determinism) 27/36.

(182) Extension: Handling Updates.

(183) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 28/36.

(184) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 28/36.

(185) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing 28/36.

(186) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch 28/36.

(187) Updates. Phase 1: Preprocessing Data structure Tree T. • The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch → Can we do better? 28/36.

(188) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(T). O(1). [Kazana and Segoufin, 2013]. 29/36.

(189) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T). O(log2 T) O(log2 T). [Kazana and Segoufin, 2013]. 29/36.

(190) Known results on dynamic trees. All these results are on data complexity in T (for a fixed pattern): Work. Data. Preproc. Delay. Updates. [Bagan, 2006],. trees. O(T). O(1). O(T). [Losemann and Martens, 2014] trees. O(T) O(T). O(log2 T) O(log2 T) O(log T) O(log T). [Kazana and Segoufin, 2013] [Losemann and Martens, 2014] text. 29/36.

Références

Documents relatifs

• Extending the results from text to trees • Supporting updates on the input data • Understanding the connections with circuit classes • Enumerating results in a relevant

• The input data can be modified after the preprocessing • If this happen, we must rerun the preprocessing from scratch 18/22... Phase 1: Preprocessing Data structure

Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex

Contact Email and XMPP a3nm@a3nm.net Affiliation Associate professor of computer science (office C201-4) in the DIG team of Télécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex

Combine this with the inverted index you built yesterday for the Simple English dataset, so that queries over this dataset use a combination of tf-idf and PageRank. Combine this

There is no fixed list of assignment for this lab session, but focus on connecting the systems produced in the first four labs: How to use PageRank to improve the results of

There is no fixed list of assignment for this lab session, but focus on connecting the systems produced in the first three labs: How to use PageRank to improve the results of

We use the following scheme for class hierarchies: An abstract subclass of Robot redefines the start method in order to indicate how the function processURL should be called on