Contrˆ ole de version incertain dans l’´ edition collaborative ouverte de documents arborescents
M. Lamine BA, Talel Abdessalem & Pierre Senellart
BDA’13 – 22-25 October 2013 (Nantes, France)
http://dbweb.enst.fr/
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (I)
◮ Large-scale, open and collaborative editing platforms, e.g., wikipedia
◮ Unreliable contributors, Novice vs. Experts, etc.
◮ Malicious edits, Vandalism acts,Contradictions
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (II)
Version control is used in large-scale web collaborative editing platforms in order to integratecontributions from different sources and to support fixing errors with the possibility to query previousdata versions
◮ No notion of more relevant versions or contributions which will fit to user-preferences, but just the concept of last validrevisions
◮ Deterministicversion control models [Kerstin et al.(2009), Al Koc et al.(2011)] in the literature
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (III)
v0 (a0, c0) v1 (a1, c1) v2 (a2, c2) v4 (a1, c4)
v3 (a3, c3) derivation derivation
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (III)
v0 (a0, c0) v1 (a1, c1) v2 (a2, c2) v4 (a1, c4)
Q1:v2 |¬a1
Q2:v2 |¬c1
Q3:all vi|Pr(vi)6=0
v3 (a3, c3) derivation derivation
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (III)
v0 (a0, c0) v1 (a1, c1) v2 (a2, c2) v4 (a1, c4)
Q1:v2 |¬a1
Q2:v2 |¬c1
Q3:all vi|Pr(vi)6=0
v3 (a3, c3) derivation derivation
☛ Uncertain version control in Open Collaborative Contexts
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (IV)
d2) Person
Name
“Obama”
Origin
“Tanzania”
d4) Person
Name
“Obama
Origin
“Tanzania”
Function
“US. President”
d1) Person
Name
“Obama”
d3) Person
Name
“B. Obama”
Origin
“Kenya”
e2
e3
e4
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (IV)
d2) Person
Name
“Obama”
Origin
“Tanzania”
d4) Person
Name
“Obama
Origin
“Tanzania”
Function
“US. President”
d1) Person
Name
“Obama”
d3) Person
Name
“B. Obama”
Origin
“Kenya”
d5) Person
Name
“B. Obama”
Origin
“Kenya”
Function
“US. President”
F={e1,e3,e4}
e2
e3
e4
e4
Motivations Versioning on the Web is Uncertain
Versioning on the Web is Uncertain (IV)
d2) Person
Name
“Obama”
Origin
“Tanzania”
d4) Person
Name
“Obama
Origin
“Tanzania”
Function
“US. President”
d1) Person
Name
“Obama”
d3) Person
Name
“B. Obama”
Origin
“Kenya”
d5) Person
Name
“B. Obama”
Origin
“Kenya”
Function
“US. President”
F={e1,e3,e4}
e2
e3
e4
e4
c
P) Person
Name
e1
“Obama”
¬e3
“B. Obama”
e3
Origin e2∨e3
“Tanzania”
¬e3
“Kenya”
e3
Function e4
“US. President”
Mapping Probabilistic Encoding
Uncertain Tree-Structured Data Model
Plan
Uncertain Tree-Structured Data Model Uncertain Multi-Version XML Document Conclusion and Further work
Uncertain Tree-Structured Data Model
Uncertain Tree-Structured Data
Probabilistic XML [Kimelfeld & Senellart.(2013)]
c P) r
p1 e1∨e2
t1 p2
¬e2
t2
d1) r
p1
t1 p2
t2 v1= (e1∧ ¬e2)
d2) r
p1
t1 v2= (¬e1∧e2) v′
2= (e1∧e2)
d3) r
p2
t2
v3= (¬e1∧ ¬e2)
Pr(e1) = 0.2; Pr(e2) = 0.8 Pr(d1) = Pr(e1)×Pr(¬e2)
Pr(d2) = (Pr(¬e1)×Pr(e2)) + (Pr(e1)×Pr(e2)) Pr(d3) = Pr(¬e1)×Pr(¬e2)
Uncertain Tree-Structured Data Model
Uncertain Tree-Structured Data
Probabilistic XML [Kimelfeld & Senellart.(2013)]
c P) r
p1 e1∨e2
t1 p2
¬e2
t2
d1) r
p1
t1 p2
t2 v1= (e1∧ ¬e2)
d2) r
p1
t1 v2= (¬e1∧e2) v′
2= (e1∧e2)
d3) r
p2
t2
v3= (¬e1∧ ¬e2)
Pr(e1) = 0.2; Pr(e2) = 0.8 Pr(d1) = 0.04
Pr(d2) = 0.80 Pr(d3) = 0.16
Uncertain Tree-Structured Data Model
Uncertain Tree-Structured Data
Probabilistic XML [Kimelfeld & Senellart.(2013)]
c P) r
p1 e1∨e2
t1 p2
¬e2
t2
d1) r
p1
t1 p2
t2 v1= (e1∧ ¬e2)
d2) r
p1
t1 v2= (¬e1∧e2) v′
2= (e1∧e2)
d3) r
p2
t2
v3= (¬e1∧ ¬e2)
➤ Pc=<T,C(E),fie,Pr>
➤ [[Pc]] =<D,Pr>; Σ{Pr(d) |d ∈D}= 1
Uncertain Multi-Version XML Document
Plan
Uncertain Tree-Structured Data Model Uncertain Multi-Version XML Document
Uncertain Version Control Model Probabilistic XML Encoding
Updating Uncertain Multi-version XML documents Performance Analysis
Conclusion and Further work
Uncertain Multi-Version XML Document Uncertain Version Control Model
Uncertain Multi-Version XML Document
Uncertain Version Control Model
◮ Probability space (PSV) over Uncertain versions of XML documents
◮ Randomderivation graph (DG) over the document versions produced Intuition: states in DG are complex variablesei’s calledversion control events based on simpler variables b1. . .bm managing uncertainty in data
<G,Ω> defines a multi-version XML document with uncertain data
◮ G is DG over a set of versioning eventsV ∪e0 withV ={e1. . .en}
◮ Ω : 2V\{e0}→D a mapping computing the PSV according to sets of valid events
Uncertain Multi-Version XML Document Uncertain Version Control Model
Uncertain Multi-Version XML Document
Possible versions and Probabilities
Possible versions
◮ Version control events come with edit scripts updating content
◮ ∀i,∀F⊆2V\{ei}, the possible version Ω({ei} ∪F) = [Ω(F)]∆i Probability of possible versions
◮ Assume a prior probability distribution over simple variables b1. . .bm
◮ The probability of a given possible version Ω(F) is the probability of W
F⊆V,Ω(F)F
Uncertain Multi-Version XML Document Probabilistic XML Encoding
Uncertain Multi-Version XML Document
Probabilistic XML Encoding
◮ Compact representation of all possible versions (PSV) in Ω mapping Intuition: Represent PSV usingpropositional formulasof simple variables
b1. . .bm attached to nodes in a global treeT containing all provided data
A probabilistic XML encoding of <G,Ω>is a couple <G,Pc>with
◮ Pc=<T,C(V),fie,Pr>is a probabilistic XML document
Thm1: JPcKdefines the same probability distribution overD as Ω, i.e.,
<G,JPcK>=<G,Ω>
Uncertain Multi-Version XML Document Updating Uncertain Multi-version XML documents
Uncertain Multi-Version XML Document
Updating Uncertain Versions (I)
◮ Uncertaineditions in ∆ over nodes of a possible tree version
An update is an uncertain version control event defined based on a triple
<ei,ej,∆> (ei ∈G andej 6∈G) updOP(<ei,ej, ∆>)[<G,Ω>]
− G :=G ∪({ej}, {(ei, ej)})
− Extension of Ω to a Ω′ mapping with for each F ∈2(V\{e0})∪{ej} Ω′(F) = [Ω(F\ {ej})]∆ ifej ∈F and Ω′(F) = Ω(F) otherwise
Uncertain Multi-Version XML Document Updating Uncertain Multi-version XML documents
Uncertain Multi-Version XML Document
Updating Uncertain Versions (II)
updPrXML(<ei,ej,∆>)[<Pc,Ω>]
− G :=G ∪({ej}, {(ei, ej)})
− For all ins(x, i)in ∆
◮ Set(x,fie(x)∨(ej))ifx already inPc
◮ Insertx inPcandSet(x,(ej))otherwise
− For all del(x)in ∆
◮ Set(x,fie(x)∧ ¬(ej))
Thm2: updOP(<ei,ej,∆>)≡updPrXML(<ei,ej,∆>) Thm3: updPrXMLruns inO(1) whilePcgrows linearly according to |∆|
Uncertain Multi-Version XML Document Performance Analysis
Uncertain Multi-version XML documents
Performance Analysis (Metrics, datasets and baseline)
◮ Estimation of the commit time and checkout cost of the model
Baseline Systems
☞Versioning tools SubVersion and Git
− Use of their Java implementations based on the APIs SvnKit and JGit
Real Datasets
History of commits over two large file systems (shared tree-structured data)
☞ Linux kernel development
☞ Cassandra project
◮ Implementation of our system (PrXML) in Java
◮ Measures are obtained with all accesses in RAM Disk
Uncertain Multi-Version XML Document Performance Analysis
Evaluation of the model
Performance Analysis (Commit time)
0 50 100 150 200 250 300 101
102 103 104
Commit (Linux kernel)
Committime(ms)
Subversion Git PrXML
0 200 400 600
101 102 103 104
Commit (Cassandra project)
Committime(ms)
Subversion Git PrXML
Uncertain Multi-Version XML Document Performance Analysis
Evaluation of the model
Performance Analysis (Checkout time)
0 50 100 150 200 250 300 100
200 300 400
Revision (Linux kernel)
Checkouttime(ms)
Subversion Git PrXML
0 200 400 600
100 200 300 400
Revision (Cassandra project)
Checkouttime(ms)
Subversion Git PrXML
Conclusion and Further work
Plan
Uncertain Tree-Structured Data Model Uncertain Multi-Version XML Document Conclusion and Further work
Conclusion and Further work
Conclusion and Further work (i)
◮ Design of a probabilistic version control approach for uncertain tree-structured documents
◮ Both logical description and an efficient probabilistic compact XML encoding
◮ Set-up of the most used version control operation, i.e., update and with an efficient mapping algorithms
◮ Both theoretical and practical complexity analysis of the proposed model
◮ Extension of our model in [Ba et al.(DChanges, 2013)] with the merging operationover uncertain versions
Conclusion and Further work
Conclusion and Further work (ii)
◮ Support of more complex versioning operations such as copying, renaming etc.
◮ Study the impact of introducing some constraints over the order of nodes in uncertain version control
Conclusion and Further work
Merci!
Conclusion and Further work
References
Kerstin Altmanninger, Martina Seidl, and Manuel Wimmer,A survey on model versioning approaches, IJWIS5(2009).
M. Lamine Ba, Talel Abdessalem, and Pierre Senellart,Towards a version control model with uncertain data, PIKM, 2011.
,Merging uncertain multi-version XML documents, Proc. DChanges (Florence, Italy), sep 2013.
,Uncertain version control in open collaborative editing of tree-structured documents, Proc. DocEng (Florence, Italy), sep 2013.
Evgeny Kharlamov, Werner Nutt, and Pierre Senellart,Updating Probabilistic XML, Updates in XML, 2010.
Benny Kimelfeld and Pierre Senellart,Probabilistic XML: Models and complexity, Advances in Probabilistic Databases for Uncertain Information Management (Zongmin Ma and Li Yan, eds.), Springer-Verlag, 2013.
Al Koc and Abdullah Uz Tansel,A survey of version control systems, ICEME, 2011.