Contents lists available atScienceDirect
Theoretical Computer Science
www.elsevier.com/locate/tcs
Computing possible and certain answers over order-incomplete data
Antoine Amarilli
a, Mouhamadou Lamine Ba
b, Daniel Deutch
c, Pierre Senellart
a,
d,
e,
∗aLTCI,TélécomParisTech,UniversitéParis-Saclay,Paris,France bUniversitéAliouneDiopdeBambey,Bambey,Senegal
cBlavatnikSchoolofComputerScience,TelAvivUniversity,TelAviv,Israel dDIENS,ENS,CNRS,PSLUniversity,Paris,France
eInria,Paris,France
a rt i c l e i n f o a b s t ra c t
Articlehistory:
Received17January2018
Receivedinrevisedform16April2019 Accepted3May2019
Availableonline24May2019 Keywords:
Certainanswer Possibleanswer Partialorder Uncertaindata
Thispaper studiesthe complexityofqueryevaluationfordatabaseswhose relationsare partiallyordered;theproblemcommonlyariseswhencombiningortransformingordered data from multiple sources. We focus on queries in a useful fragment of SQL, namely positiverelationalalgebrawithaggregates,whosebagsemanticsweextendtothepartially orderedsetting.Oursemanticsleadstothestudyoftwomaincomputationalproblems:the possibilityandcertaintyofqueryanswers.Weshowthattheseproblemsarerespectively NP-complete and coNP-complete, but identify tractable cases depending on the query operatorsorinput partial orders. Wefurther introduceaduplicate eliminationoperator andstudyitseffectonthecomplexityresults.
©2019ElsevierB.V.Allrightsreserved.
1. Introduction
Many applications need to combine and transform ordered data frommultiple sources. Examples include sequences of readings from multiple sensors, or log entries fromdifferent applications or machines, that need to be combined to form acomplete pictureof events;rankings ofrestaurantsandhotels based onvarious criteria(relevance,preference, or customer ratings);and concurrent editsof shared documents, wherethe order ofcontributions made by differentusers needs to be merged. Evenifthe orderof itemsfromeach individual source isusually known, theorder ofitemsacross sourcesisoftenuncertain.Forinstance,evenwhensensorreadingsorlogentriesareprovidedwithtimestamps,thesemay be ill-synchronizedacross sensorsormachines;rankings ofhotelsandrestaurantsmaybe biasedby differentpreferences of differentusers; concurrent contributions to documents maybe ordered inmultiple reasonable ways. We saythat the resultinginformationisorder-incomplete.
This paper studies query evaluationover order-incomplete data in a relational setting [1]. We focus on the running example of restaurants andhotels froma travel website,ranked according to a proprietary function. An example query wouldaskfortheorderedlistofrestaurant–hotelpairssuchthattherestaurantandhotelareinthesamedistrict,orsuch thattherestaurantfeaturesaparticularcuisine,andmayfurtherapplyorder-dependentoperatorstotheresult,e.g.,limiting theoutputtothetop-ksuchpairs,oraggregatingarelevancescore.Toevaluatesuchqueries,theinitialorderonthehotels
*
Correspondingauthor.E-mailaddresses:antoine.amarilli@telecom-paristech.fr(A. Amarilli),mouhamadoulamine.ba@uadb.edu.sn(M.L. Ba),danielde@post.tau.ac.il(D. Deutch), pierre@senellart.com(P. Senellart).
https://doi.org/10.1016/j.tcs.2019.05.013
0304-3975/©2019ElsevierB.V.Allrightsreserved.
andrestaurantsmustbepreservedthroughtransformations.Furthermore,aswe donot knowhowtheproprietaryorderis defined, theresultoftransformationsmaybecome uncertain; hence,we needtorepresentall possibleresultsthat canbe obtaineddependingontheunderlyingorder.
Ourapproachistohandlethisuncertaintythroughtheclassicalnotionsofpossibleandcertainanswers.Wesaythatthere isa certainanswer tothe querywhenthereisonly onepossible orderonqueryresults,oronlyone accumulationresult, whichis obtainednomatter theorder ontheinput andinintermediate results.Inthis case,it isusefultocompute the certain answer, so that the usercan then browse through the ordered queryresults (as is typically done when thereis nouncertainty,usingconstructssuch asSQL’sORDER BY).Certainanswerscan ariseeveninnon-trivialcaseswherethe combinationofinputdataadmitsmanypossibleorders:consideruserqueriesthatselectonlyasmallinterestingsubsetof thedata(forwhichtheorderinghappenstobecertain),orashortsummaryobtainedthroughaccumulationoverlargedata.
Inmanyothercases,thedifferentordersoninputdataortheuncertaintycausedbythequerymayleadtoseveralpossible answers.Inthiscase,itisstillofinterest(andnon-trivial)toverifywhetherananswerispossible,e.g.,tocheckwhethera givenrankingofhotel–restaurantpairsisconsistentwithacombinationofotherrankings(thelatterdonethroughaquery).
Thus,westudytheproblemsofdecidingwhetheragivenansweriscertain,andwhetheritispossible.
Ourmaincontributionsmaybesummarizedasfollows.
ModelandProblemDefinition(Sections2,3). Ourworkfocusesonbagsemantics,whereatuplemayappearmultipletimes.
Notethatinthecontext of(partially)orderedrelations,thismeansthatmultiplecopiesofthesametuplemayappearin different“positions” intheorder. Forexample,ifwe integratemultiple rankingsof restaurants,then thesame restaurant appearsmultipletimesindifferentpositions.Wecapturethismodelbyanotionofpo-relations(partiallyordered)relations.
A po-relationis essentially a relationaccompanied with a partialorder over its tuples; a technicalsubtlety is that each tupleisassociatedwithanidentifier(presumablyinternalandautomaticallygenerated),sothatwehaveawayofreferring toeachtupleoccurrenceinthepartialorder(seefurtherdiscussionintheproblemdefinitionbelow).
Wethenintroduceaquerylanguageforpartiallyordereddata.Ourlanguagedesignisguidedbythegoalofsupporting SQLevaluationinpresenceofsuchdata,andassuchwefocusondefiningasemanticsforanimportantfragmentofSQL – namelypositiverelationalalgebrawithaggregates.Thesemantics is“faithful” toSQLinthesensethat,ifweignoreorder, thenwegetthestandardSQLsemantics.ThenotioncorrespondingtoaggregationinourcontextisLISP-likeaccumulation, whosesemanticsweextendtoaccountforpartialorders.
Weviewpartiallyorderedrelationsasaconciserepresentationofasetofpossibleworlds,namely,thelinearextensions ofthepartialorders overtheunderlyingtuplesoftherelation.Forexample,alinearextension isaranked listofrestaurant cuisines,whereacuisinemayappearmultipletimesinthelist.Notethatineachsuchlinearextension,thetuplesappear withouttheir respectiveinternalidentifierswhich,asmentionedearlier,wereonlypresentinthepo-relationasatechnical tool.
Ourdefinitionsleadtoapossibleworldssemanticsforqueryevaluation, andtotwo naturalproblems:whetheracan- didate answer – i.e., a ranked list of tuples (restaurants, cuisines, etc.) – is possible, i.e., is obtained for some possible world, andwhetheritis certain,i.e., isobtainedfor everypossible world. Hereagain, note that ina candidateanswer a tuplemayappearmultipletimes,andnaturallyitappearswithoutidentifiers(which,asmentionedabove,areinternaland are unknownto theuser).We formally definethesetwo problemsforoursettings,andthen embarkona studyoftheir complexity.
ComplexityintheGeneralCase(Section4). Wefirststudythepossibilityandcertaintyproblemswithoutanyrestrictionson theinputdatabase. Asusualindatamanagement,giventhatqueries aretypicallymuchsmallerthandatabases,westudy thedatacomplexity oftheproblems,i.e.,thecomplexitywhenthequeryisfixed.Forourgeneraldefinitionofpo-relations, weshowthatdecidingwhetherananswerispossibleisNP-complete,evenwithoutaccumulation,andevenforsomevery simplequeries andinput relations.In a particularcasewherewe assume noduplicates – i.e.,where tuplesare uniquely identified,whichmeansweareessentially backtothesetsemantics–possibilityofan answerisinPTIMEwithoutaccu- mulation,butisagainNP-completewithaccumulation.Asforcertainty,theproblemcanbedecidedinpolynomialtimein thecasewithnoaccumulation,butitiscoNP-completeforquerieswithaccumulation(evenifweassumenoduplicatesin theinput).Facedby thegeneralintractabilityofthepossibilityandcertaintyproblems,intherestofthepaperwesearch forrestrictedcasesforwhichtractabilityholds.
TractableCasesforPossibilityWithoutAccumulation(Section5). EventhoughpossibilityisNP-hardevenwithoutaccumulation, weidentifyrealisticcaseswhereitisinfacttractable.Inparticular,weshowthatiftheinputrelationsaretotallyordered thenpossibilityisinPTIMEforqueriesusingasubsetofouroperators(allexceptthedirectproduct).Assumingmoresevere restrictions onthequerylanguage,we furthershow tractabilitywhensome oftherelationsare (almost)orderedandthe restare(almost)unordered,asformalizedviaanewlyintroducednotionofia-width.
TractableCaseswithAccumulation(Section6). Withaccumulation,thecertaintyproblembecomesintractableaswell.Yetwe showthatifaccumulationiscapturedbyafinitecancellativemonoid(inparticular,ifitisperformedinafinitegroup),then certaintycanagainbedecidedinpolynomialtime.Further,werevisit thetractability resultsforpossibilityfromSection5 andshowthattheyextendtoquerieswithaccumulationundercertainrestrictionsontheaccumulationfunction.
LanguageExtensions(Section7). Wethenstudytwoextensionstoourlanguage,whicharethecounterpartsofcommonSQL operators. Thefirstisgroup-by,whichallowsustogrouptuplesforaccumulation(asisdone foraggregationinSQLwith GROUP BY);werevisitourcomplexityresultsinitspresence.Thesecondisduplicateelimination:keepingasinglerepresen- tativeofidenticaltuples,asinSQLwithSELECT DISTINCT.Inpresenceoforder,itischallengingtodesignasemantics forthisoperator,andwediscussbothsemanticandcomplexityissuesthatarisefromdifferentpossibledefinitions.
WecompareourmodelandresultstorelatedworkinSection8,andconcludeinSection9.
Thisarticleisan extendedversion oftheconferencepaper [2].Incontrastwiththeconferencepaper [2],allproofsare includedhere.WealsodiscoveredabugintheproofofTheorem 22ofthatpaper [2],thatalsoimpactsTheorems 19and 30 of [2].Consequently,theseresultsareomittedinthepresentpaper.
2. Datamodelandquerylanguage
WedenotebyN thesetofnonnegativenaturalnumbersandbyN>0 thesetofpositive naturalnumbers,i.e.,N>0:=
N\ {0}. We fix a countable set of values D that includes N andinfinitely many values not in N. A tuple t overD of arity a
(
t)
isan element ofDa(t),denoted v1, . . . ,
va(t): for1≤i≤a(
t)
,we write t.
i to referto vi.The simplestnotion ofordered relationsarethen listrelations[3,4]:a listrelationofarityn∈N isanordered listoftuplesoverDofarityn (wherethesametuplemayappearmultipletimes).Listrelationsimposeasingleorderovertuples,butwhenonecombines (e.g.,unions)them,theremaybemultipleplausiblewaystoordertheresults.Wethusintroducepartiallyorderedrelations(po-relations).Apo-relation
=
(
ID,
T, <)
ofarityn∈N consistsofafinite set ofidentifiers ID (chosen fromsome infinitesetclosed underthe Cartesianproduct, e.g., wecan usetuples ofnatural numbers), astrictpartialorder<
onID,anda(generallynon-injective) mappingT fromID toDn.Thedomainofisthe subset ofvaluesofDthatoccur intheimage ofT.Theactualidentifiersin IDdonotmatter,butweneedthemto refer to occurrences of thesame tuplevalue. Hence, we always consider po-relationsuptoisomorphism,where
(
ID,
T, <)
and(
ID,
T, <
)
areisomorphiciffthereisabijectionϕ
:ID→IDsuchthat T( ϕ (
id))
=T(
id)
forallid∈ID,andϕ (
id1)<
ϕ (
id2)
iffid1<
id2forallid1,
id2∈ID.A specialcaseofpo-relations areunorderedpo-relations(or bagrelations), where
<
is empty:we denotethem(
ID,
T)
. Anotherspecialcaseisthatoftotallyorderedpo-relations,where<
isatotalorder.The pointofpo-relationsistorepresentsetsoflistrelations.Formally,alinearextension
<
of<
isatotalorderonID such that<
⊆<
,i.e.,foreach x<
y wehavex<
y.Thepossibleworlds pw()
ofarethendefinedasfollows:foreach linearextension
<
of<
,writingIDasid1<
· · ·<
id|ID|,thelistrelation(
T(
id1), . . . ,
T(
id|I D|))
isinpw()
.AsT isgenerally not injective,twodifferentlinearextensionsmayyield thesamelistrelation.Notethat eachsuch linearextension “strips away” theidentifiersandincludesonlythetuples.Forinstance,ifisunordered,thenpw
()
consistsofallpermutations ofthetuplesof;andif
istotallyorderedthenpw
()
containsexactlyonepossibleworld.Po-relationscanthusmodeluncertaintyovertheorderoftuples.However,note thattheycannotmodeluncertaintyon tuplevalues.Specifically,letusdefinetheunderlyingbagrelationofapo-relation
=
(
ID,
T, <)
as(
ID,
T)
.Unlikeorder,this underlyingbagrelationisalwayscertain.Weextendsomeclassicalnotionsfrompartialordertheorytopo-relations.
Letting
=
(
ID,
T, <)
be a po-relation, an orderideal ofis asubset S⊆ID such that, forall x
,
y∈ID,if x<
y and y∈S then x∈S.An antichain[5] ofis a set A⊆IDof pairwise incomparable tupleidentifiers.The width of
is the size ofits largestantichain,andthewidthofa po-databaseis themaximalwidthofits po-relations.Inparticular, totally orderedpo-relationshavewidth 1,andunorderedpo-relationshaveawidthequaltotheirnumberoftuples;thewidthof apo-relationcanbecomputedinpolynomialtime [6].
Achainpartitionof
isapartitionID=
1 · · ·
n suchthattherestrictionof
<
toeachi isatotalorder:wecall each
i achain.Notethat
<
mayincludecomparabilityrelationsacrosschains,i.e.,relatingelementsini toelementsin
j fori=j.Thewidthofthechainpartitionisn.ByDilworth’stheorem [7,6],thewidth wof
isthesmallestpossible widthofa chainpartitionof
;furthermore,given
,wecan computeinpolynomial timeboth itswidth w andachain partitionof
ofwidthw.
2.1. PosRA:Querieswithoutaccumulation
We now define a bag semantics for positiverelationalalgebra operators, to manipulate po-relations withqueries. The positiverelationalalgebra,written PosRA,isastandardquerylanguageforrelationaldata [1].Wewillextend PosRAwithac- cumulationinSection2.3,andaddfurtheroperationsinSection7.Each PosRAoperatorappliestopo-relationsandcomputes anewpo-relation;wepresenttheminturn.
The selectionoperatorrestrictstherelationtoasubsetofits tuples,andtheorderistherestrictionoftheinputorder.
The tuplepredicatesallowed inselectionsare Boolean combinationsofequalities andinequalities, whichinvolveconstant valuesinDandtupleattributeswrittenas
.
i fori∈N>0.Forinstance,theselectionσ
.1=“a”∧.2=.3selectstupleswhosefirst attributeisequaltotheconstant “a”andwhosesecondattributeisdifferentfromtheirthirdattribute.selection: For any po-relation
=
(
ID,
T, <)
and tuple predicateψ
, we define the selectionσ
ψ()
··=(
ID,
T|ID, <
|ID)
, whereID··= {id∈ID|ψ(
T(
id))
holds}.Restname District Gagnaire 8 TourArgent 5 (a)Restauranttable
Hotelname District Mercure 5 Balzac 8 Mercure 12
(b)Hoteltable
Hotelname District Balzac 8 Mercure 5 Mercure 12
(c)Hotel2table Fig. 1.Running example: Paris restaurants and hotels.
Theprojectionoperatorchangestuplevaluesintheusualway,butkeepstheoriginaltupleorderingintheresult,andretains allcopiesofduplicatetuples(followingourbagsemantics).
projection: Forapo-relation
=
(
ID,
T, <)
andattributes A1, . . . ,
An,wedefinetheprojectionA1,...,An
()
··=(
ID,
T, <)
, whereTmapseachid∈IDtoA1,...,An
(
T(
id))
:= T(
id).
A1, . . . ,
T(
id).
An.As forunion,we impose theminimal orderconstraintsthat are compatiblewiththoseofthe inputs.We usethe parallel composition[8] oftwopartialorders
<
and<
ondisjointsetsIDandID,i.e.,thepartialorder<
··=(<
∪<
)
onID∪ID. Notethat<
isthesameorderas<
onIDandas<
onID,andthatallelementsfromIDareincomparabletoallelements fromID.union: Let
=
(
ID,
T, <)
and=
(
ID,
T, <
)
betwopo-relationsofthesamearity.WeassumethattheidentifiersofhavebeenrenamedifnecessarytoensurethatIDandIDaredisjoint.Wethendefine
∪
··=
(
ID∪ID,
T, (<
∪<
))
,whereT mapsid∈IDtoT(
id)
andid∈IDto T(
id)
.Theunionresult
∪
doesnotdependon howwerenamed
,i.e., itisuniqueuptoisomorphism.Ourdefinitionalso impliesthat
∪
isdifferentfrom
,asper bagsemantics.In particular,when
and
haveonlyone possibleworld,
∪
usuallydoesnot.
We next introduce two possible product operators. First, as in [9], the direct product
<
DIR··=(<
×DIR<
)
of two partial orders<
and<
on sets ID and ID is defined by(
id1,
id1) <
DIR(
id2,
id2)
iff id1<
id2 and id1<
id2 for each(
id1,
id1), (
id2,
id2)
∈ID×ID. We define the directproduct operator over po-relations accordingly: two identifiers in the productarecomparableonlyifbothcomponentsofbothidentifierscompareinthesameway.direct product: Foranypo-relations
=
(
ID,
T, <)
and=
(
ID,
T, <
)
,rememberingthatthesetofpossibleidentifiersis closedunderproduct,welet×DIR
··=
(
ID×ID,
T, <
×DIR<
)
,whereT mapseach(
id,
id)
∈ID×IDtothe concatenationT(
id),
T(
id)
.Again,thedirectproductresultoftenhasmultiplepossibleworldsevenwheninputsdonot.
Thesecondproductoperatorusesthelexicographicproduct(orordinalproduct[9])
<
LEX··=(<
×LEX<
)
oftwopartialor- ders<
and<
,definedby(
id1,
id1) <
LEX(
id2,
id2)
iffeitherid1<
id2,orid1=id2andid1<
id2,forall(
id1,
id1), (
id2,
id2)
∈ ID×ID.lexicographic product: Forany po-relations
=
(
ID,
T, <)
and=
(
ID,
T, <
)
,we define×LEX
as
(
ID×ID,
T, <
×LEX
<
)
withT definedlikeforthedirectproduct.Last,wedefinetheconstantexpressionsthatweallow.
constant expressions: •foranytuplet,thesingletonpo-relation[t]hasonlyonetuplewithvaluet;
•foranyn∈N,thepo-relation[≤n]hasarity 1 andhaspw
(
[≤n])
= {(
1, . . . ,
n)
}.Wehavenowdefinedasemanticsonpo-relationsforeach PosRAoperator.WedefineaPosRAqueryintheexpectedway, asaquerybuiltfromtheseoperators andfromrelationnames.Callingschema asetS ofrelationnamesandarities,with anattributenameforeachpositionofeachrelation,wedefineapo-database Dashavingapo-relationofthecorrectarity foreachrelationnameR inS.Forapo-database D anda PosRAqueryQ,wedenoteby|Q|thenumberofsymbolsofQ, andwedenoteby Q
(
D)
thepo-relationobtainedbyevaluating Q over D.Example1. Thepo-database D inFig.1containsinformationaboutrestaurantsandhotelsinParis:each po-relationhasa totalorder(fromtoptobottom)accordingtocustomerratingsfromagiventravelwebsite.Forbrevity,wedonotrepresent identifiersinpo-relations,andwealsodeviateslightlyfromourformalismbyadoptingthenamedperspectiveinexamples, i.e.,givingnamestoattributes.
Let Q ··=Restaurant×DIR
( σ
district=“12”(
Hotel))
.Its result Q(
D)
hastwo possibleworlds,wherewe abbreviatehoteland restaurantnames:Fig. 2.Example1.
•
(
G,
8,
M,
5,
G,
8,
B,
8,
TA,
5,
M,
5,
TA,
5,
B,
8)
;•
(
G,
8,
M,
5,TA,
5,
M,
5,G,
8,
B,
8,TA,
5,
B,
8).In asense, theselistrelationsofhotel–restaurant pairs areconsistent withthe orderin D: wedo notknowhow toorder two pairs, exceptwhen both thehotel andrestaurantcomparein thesame way.The po-relation Q
(
D)
is representedin Fig.2asaHassediagram,againwritingtuplevaluesinsteadoftupleidentifiersforbrevity:notethat,followingtheusual conventionforHassediagramsinpartialordertheory,theorderinFig.2isdrawninthereversedirectionofthatofFig.1, i.e.,frombottomtotop.ConsidernowthequeryQ··=
( σ
Restaurant.district=Hotel.district(
Q))
,whereprojectsoutHotel
.
district.Thepossibleworlds of Q(
D)
are(
G,
B,
8,
TA,
M,
5)
and(
TA,
M,
5,
G,
B,
8)
,intuitively reflecting two different opinionson the order of restaurant–hotel pairs in the same district. Defining Q similarly to Q but replacing ×DIR by ×LEX in Q, we have pw(
Q(
D))
=(
G,
B,
8,TA,
M,
5).It is easy to show that we can efficiently evaluate PosRAqueries on po-relations, which we will use throughout the sequel.
Proposition2.ForanyfixedPosRAqueryQ ,givenapo-databaseD,wecanconstructthepo-relation Q
(
D)
intimeO|D||Q| ,i.e., inpolynomialtimeinthesizeof D.
Proof. Weshowtheclaimbyasimpleinductiononthequery Q,notingthat|Q|isatleastk+1,wherekisthenumber ofoperatorsinQ.
•If Q isarelationnameR,then Q
(
D)
isobtainedinlineartime.•If Q isaconstantexpression,then Q
(
D)
isobtainedinconstanttime.•If Q=
σ
ψ(
Q)
orQ =k1...kp
(
Q)
,then Q(
D)
isobtainedintimelinearin|Q(
D)|
,andweconcludebytheinduction hypothesis.•If Q =Q1∪Q2 or Q=Q1×LEXQ2 orQ =Q1×DIRQ2,then Q
(
D)
isobtainedintime linearin|Q1(
D)
|× |Q2(
D)
|, andweconcludeagainbytheinductionhypothesis. 2NotethatProposition 2computestheresultofaqueryasapo-relation
.However,wecannot efficientlycomputethe completesetpw
()
ofpossibleworldsof,evenifallrelationsoftheinputpo-databasearetotallyordered.Forinstance, considerthequeryQ :=R∪S,andapo-database Dinterpreting RandSastotallyorderedrelationswithdisjointdomains andwithntupleseach. Itiseasy toseethat thequeryresult Q
(
D)
has2nn
possibleworlds, whichisexponential in D.
Thisintractabilityisthereasonwhywillwestudythepossibilityandcertaintyproblemsinthesequel.
2.2. IncomparabilityofPosRAOperators
Beforeextendingourquerylanguagewithaccumulation,weaddressthenaturalquestionofwhetheranyofouroperators issubsumedbytheothers.Weshowthatthisisnotthecase.
Theorem3.NoPosRAoperatorcanbeexpressedthroughacombinationoftheothers.
We proveTheorem3intherestofthissubsection.Weconsidereach operatorinturn,andshowthatitcannot beex- pressedthroughacombinationoftheothers.Wefirstconsiderconstantexpressionsandshowdifferencesinexpressiveness evenwhensettingtheinputpo-databasetobeempty.
•For [t],consider the query [0].The value 0 is not inthe database, andcannot be produced by the [≤n] constant expression,andsothisqueryhasnoequivalentthatdoesnotusethe[t]constantexpression.
•For[≤n],observethat[≤2]isapo-relationwithanon-emptyorder,whileanyqueryinvolvingtheotheroperatorswill haveemptyorder(noneofourunaryandbinaryoperators turnsunorderedpo-relationsintoanorderedone,andthe [t]constantexpressionproducesanunorderedpo-relation).
Movingontounaryandbinaryoperators,alloperatorsbutproductsareeasilyshowntobenon-expressible.
selection. For anyconstant a not in N, consider the po-database Da consistingof a single unorderedpo-relation with nameRformedoftwounarytuples0anda.LetQ=
σ
.1=“0”(
R)
.Then,Q(
Da)
isthepo-relationconsistingonly ofthetuplea.No PosRA querywithoutselection hasthesamesemantics,asnoother operator thanselection cancreateapo-relationcontainingtheconstanta foranyinput Da,unlessitalsocontainstheconstant 0.projection.
istheonlyoperatorthatcandecreasethearityofaninputpo-relation.
union. [0]∪ [1](overtheemptypo-database)cannotbesimulatedbyanycombinationofoperators,ascanbesimply shownby induction: no other operatorwill produce a po-relationwhich hasthe two elements 0 and 1 inthe sameattribute.
Observethatproductoperatorsaretheonlyonesthatcanincreasearity,sotakentogethertheyarenon-redundantwith theother operators.Hence, itonlyremains toshow that eachof×DIR and×LEX isnot redundant.Todothis, letuscall PosRALEX thefragmentof PosRAthatdisallowsthe×DIR operator,butallowsallotheroperators(including×LEX).Wealso define PosRADIR thatdisallows×LEXbutnot×DIR.
We willfirst show that the×DIR product isnot redundant, which we willdousing thenotion ofwidth. Specifically, considerthequery QDIR=R×DIRR andaninputpo-database Dn where Rismappedto[≤n](aninputrelationofwidth 1) foran arbitrary Rn. Itis then clearthat thepo-relation Q
(
Dn)
has widthn.We will showthat this querycannot be capturedin PosRALEX,because PosRALEX queriescanonlymakewidthincreaseinawaythat dependsonthewidthofthe inputpo-relations,butnotontheirsize.Lemma4.Letk≥2andQ beaPosRALEXquery.Foranypo-database D ofwidth≤k,thepo-relationQ
(
D)
haswidth≤k|Q|+1. Proof. We first show by induction on the PosRALEX query Q that the width of the query output can be bounded as a function ofk. Forthe basecase, the input po-relations have width ≤k, andall constant po-relations (singletons and constantchains)havewidth 1.Letusshowtheinductionstep.• Giventwopo-relations
1and
2withwidthrespectivelyk1andk2,theirunion
:=
1∪
2 clearlyhaswidthatmost k1+k2.Indeed,anyantichainin
mustbetheunionofanantichainof
1 andofanantichainof
2.
• Givenapo-relation
1withwidthk1,applyingaprojectionorselectionto
1 cannotincreasethewidth.
• Giventwopo-relations
1=
(
ID1,
T1, <
1)
and2=
(
ID2,
T2, <
2)
withwidthrespectivelyk1 andk2,theirproduct··=
1×LEX
2haswidthatmostk1·k2.Toshowthis,write
=
(
ID,
T, <)
,consideranyset A⊆IDofcardinality>
k1·k2, andletusarguethat A isnot anantichain.Bythedefinitionof×LEX,wecanseeeach identifierof A asanelement ofID1×ID2.Now,oneofthefollowingmusthold.1. LettingS1bethesetofidentifiersu∈ID1 forwhichwehave
(
u,
v)
∈A forsome v∈ID2,itisthecasethat|S1|>
k1. 2. Thereexistsusuchthat,lettingS2(
u)
··= {v|(
u,
v)
∈A},wehave|S2(
u)
|>
k2.Informally,whenputting
>
k1·k2 valuesinbuckets(thevalue oftheirfirstcomponent),either>
k1 differentbuckets areused,orthereisabucketcontaining>
k2 elements.Inthefirstcase,asS1⊆ID1,as|S1|
>
k1,andas1 haswidthk1,weknowthat S1 cannotbeanantichain,soitmust containtwocomparableelementsu
<
1u.Hence,consideringanyv,
v∈ID2suchthat w=(
u,
v)
andw=(
u,
v)
are in A,wehavebythedefinitionof×LEX that w<
w,sothat A isnotanantichain.Inthesecondcase,asS2(
u)
⊆ID2, as|S2(
u)| >
k2,andas2 haswidthk2,weknowthatS2
(
u)
cannotbeanantichain,soitmustcontaintwocomparable elements v<
2v.Hence,considering w=(
u,
v)
andw=(
u,
v)
whicharein A,wehave w<
w,andagain Aisnot anantichain.Hence,nosetofcardinality>
k1·k2 ofisanantichain,so
haswidth≤k1·k2asclaimed.
Second,weexplainwhytheboundonthewidthofthequeryoutputcanbechosenasinthelemmastatement.Specifi- cally,lettingobethenumberofproductoperatorsin Q plusthenumberofunionoperators,weshowthatwecanbound thewidthof Q
(
D)
by ko+1.Indeed,theoutputofqueries withoutproductorunionoperators havewidthatmostk (be- causek≥1).Further,asprojectionsandselectionsdonotchangethewidth,theonlyoperatorstoconsiderareproductand union.Fortheunionoperator,if Q1 haso1 suchoperatorsand Q2 haso2 suchoperators,boundinginductivelythewidth of Q1(
D)
byko1+1 and Q2(
D)
byko2+1,forQ :=Q1∪Q2,thenumberofunionandproductoperatorsiso1+o2+1,and thenewboundisko1+1+ko2+1,whichis≤ko1+1+o2+1 becausek≥2,i.e., itis≤k(o1+o2+1)+1.Forthe×LEX operator,we proceedinthesamewayanddirectlyobtainthek(o1+o2+1)+1 bound.Hence,wecanindeedboundthewidthof Q(
D)
by k|Q|+1 asgiveninthestatement,whichconcludestheproof. 2We haveshownLemma4: PosRALEX queriescan onlymake thewidthincrease asafunction ofthequeryandofthe width of the input po-relations. Hence, the query QDIR cannot be captured in PosRALEX, and the ×DIR product is not redundant.
Conversely, let us show that the ×LEX product is not redundant. To do this, we introduce the concatenation of po- relations.
Definition5.Theconcatenation
∪CAT
oftwopo-relations
and
istheseriescompositionoftheirtwopartialorders.
Notethat pw
(
∪CAT)
= {L∪CATL|L∈pw(),
L∈pw(
)}
,whereL∪CATListheconcatenationoftwolistrelationsin theusualsense.Weshowthatconcatenationcanbecapturedin PosRALEX.
Lemma6.Foranyarityn∈NanddistinguishedrelationnamesR andR,thereisaqueryQnwithout×DIRoperatorsuchthat,for anytwopo-relations
and
ofarity n,lettingD bethedatabasemappingR to
andRto
,thequeryresultQn
(
D)
is∪CAT
.
Proof. Foranyn∈N andnamesR andR,considerthefollowingquery:
Qn
(
R,
R) ··=
3...n+2σ
.1=.2[≤
2] ×
LEX(( [
1] ×
LEXR) ∪ ( [
2] ×
LEXR))
Itiseasilyverifiedthat Qnsatisfiestheclaimedproperty. 2
Bycontrast,weshowthatconcatenationcannotbecapturedin PosRADIR.
Lemma7.Foranyarityn∈N>0anddistinguishedrelationnames R and R,thereisnoPosRADIRquery Qn suchthat,forany po-relations
and
ofarity n,lettingD bethepo-databasethatmapsR to
andRto
,thequeryresultQn
(
D)
is∪CAT
.
ToproveLemma7,wefirstintroducethefollowingconcept.
Definition8. Let v∈D.We call a po-relation
=
(
ID,
T, <)
v-impartial if, for any two identifiers id1 andid2 and1≤ i≤a()
suchthatexactlyone ofT(
id1).
i, T(
id2).
i is v,thefollowingholds:id1 andid2 areincomparable,namely,neither id1<
id2norid2<
id1hold.Lemma9.Letv∈D\Nbeavalue.ForanyPosRADIRqueryQ ,foranypo-databaseD ofv-impartialpo-relations,thepo-relation Q
(
D)
isv-impartial.Proof. Let D be a po-database of v-impartialpo-relations.We show by inductionon the query Q that v-impartiality is preserved.Thebasecasesarethefollowing.
•Forthebaserelations,theclaimisvacuousbyourhypothesison D.
•Forthesingletonconstantexpressions,theclaimistrivialastheycontainlessthantwotuples.
•Forthe[≤i]constantexpressions,theclaimisimmediateasv∈
/
N. Wenowprovetheinductionstep.•Forselection, theclaimisshownby noticingthat,forany v-impartialpo-relation
,letting
bethe imageof
by anyselection,
isitself v-impartial. Indeed,consideringtwo identifiersid1 andid2 in
and1≤i≤a
()
satisfying thecondition,asisv-impartial,id1andid2 areincomparablein
,sotheyarealsoincomparablein
.
•Forprojection,theclaimisalsoimmediateasthepropertytoproveismaintainedwhenreordering,copyingordeleting attributes. Indeed,considering againtwo identifiers id1 andid2 of
and1≤i≤a
(
)
,therespectivepreimages id1 andid2 inofid1 andid2 satisfy thesamecondition forsome different1≤i≤a
()
whichistheattributeinthat wasprojectedtogiveattribute iin
,soweagainusetheimpartialityoftheoriginalpo-relationtoconclude.
•Forunion,letting
··=
∪
,andwriting
=
(
ID,
T, <
)
,assumebycontradictiontheexistenceoftwoidentifiers id1,
id2∈and1≤i≤a
(
)
such that exactly one of T(
id1).
i and T(
id2).
i is v but (without loss ofgenerality) id1<
id2 in.Itiseasilyseenthat,asid1andid2arenotincomparable,theymustcomefromthesamerelation;but then,asthatrelationwas v-impartial,wehaveacontradiction.
•For ×DIR, consider
··=
×DIR
where
and
are v-impartial, and write
=
(
ID,
T, <
)
as above. As- sume that there are two identifiers id1 and id2 of ID and 1≤i≤a(
)
that violate the v-impartiality of. Let
(
id1,
id1), (
id2,
id2)
∈ID×ID be the pairs of identifiers used to create id1 and id2. We distinguish on whether 1≤i≤a()
ora() <
i≤a()
+a(
)
.In thefirstcase, we deducethat exactlyone of T(
id1).
i and T(
id2).
i is v,so that inparticularid1=id2.Thus,bythedefinitionoftheorderin×DIR,itiseasilyseen that,becauseid1 andid2 are comparablein,id1 andid2 mustcompareinthesamewayin
,contradicting thev-impartialityof
.Thesecond caseissymmetric. 2
WenowconcludewiththeproofofLemma7.
Proof. Let usassume by wayof contradiction that there is n∈N>0 and a PosRADIR query Qn that captures ∪CAT. Let v=v be two distinct valuesin D\N, andconsiderthe singletonpo-relation
containing one identifierof value t and
containing oneidentifier ofvalue t,wheret (resp.t) are tuples ofarityn containingn timesthe value v (resp. v).
Considerthepo-database D mapping Rto
andRto
.Write
··=Qn
(
D)
.Byourassumption,as=
(
ID,
T, <
)
is∪CAT
,itmustcontainanidentifierid∈ID suchthat T
(
id)
=t andanidentifierid∈ID suchthat T(
id)
=t.Now, asand
are(vacuously) v-impartial,Lemma9impliesthat
isv-impartial.Hence,asn
>
0,takingi=1,ast=tand exactlyoneoft.
1 andt.
1 isv,theidentifiersidandidareincomparablein<
,sothereisapossibleworldofwhere idprecedesid.Thiscontradictsthefactthat,asweshouldhave
=
∪CAT
,thepo-relation
shouldhaveexactlyone possibleworld,namely,
(
t,
t)
. 2Thisestablishesthatthe×LEX operatorcannotbe expressedusingtheothers,andshowsthatnoneofouroperatorsis redundant,whichconcludestheproofofTheorem3.
2.3. PosRAacc:Querieswithaccumulation
We now enrich PosRAwith order-aware accumulation as the outermost operation, inspired by rightaccumulation and iterationinlistprogramming, andaggregationinrelationaldatabases.Recallthata monoid
(
M,
⊕, ε )
consistsofaset M (notnecessarilyfinite),anassociativeoperation ⊕:M×M→M,andanelementε
∈Mwhichisneutralfor⊕,i.e.,for allm∈M,wehaveε
⊕m=m⊕ε
=m.Wewilluseamonoidasthestructureinwhichweperformaccumulation.Wecan nowdefineaccumulationonagivenlistrelation.Definition10. For k∈N, let h:Dk×N>0→M be a function called an arity-k accumulationmap, which maps pairs consistingofank-tupleandapositiontoavalueinthemonoidM.Wecallaccumh,⊕ anarity-k accumulationoperator;its resultaccumh,⊕
(
L)
onanarity-klistrelationL=(
t1, . . . ,
tn)
ish(
t1,
1)
⊕· · ·⊕h(
tn,
n)
,anditisε
ifLisempty.Forcomplexity purposes,wealwaysrequireaccumulationoperatorstobePTIME-evaluable,i.e.,wecanevaluatetheaccumulationmapand themonoidoperatorintimepolynomialintheirinputs,andwecancomputeaccumh,⊕(
L)
inpolynomialtimeonanyinput listrelationL.Intuitively,theaccumulationoperatormapseachoccurrenceofa tupleinthelistwithhtoM,whereaccumulationis performedwith⊕.(RememberthattheinputLtotheaccumulationisalistrelation,soeachtupleoccurrencehasaspecific position.)ThemaphmayuseitssecondargumenttotakeintoaccounttheabsolutepositionoftuplesinL.Inwhatfollows, weomitthearityofaccumulationwhenclearfromcontext.
Wewilloftenlookatspecialcasesforaccumulation,especiallywhenderivingcomplexity results.Herearetherestric- tionsthatwewillconsider.
Definition11.Wesaythatanaccumulationoperatorisposition-invariantifitsaccumulationmapignores thesecondinput, sothateffectivelyitsonlyinputisthetupleitself
Wesaythatanaccumulationoperatorisfiniteifitsmonoid
(M,
⊕,ε )
isfinite.For anymonoid
(
M,
⊕, ε )
, we calla∈M cancellable if, forall b,
c∈M, we havethat a⊕b=a⊕c impliesb=c, andb⊕a=c⊕a impliesb=c.We callM acancellativemonoid [10] ifall its elements are cancellable.We saythat an accumulationoperatoriscancellativeifitsmonoidis.Notethat,inparticular, agroup isalwayscancellative,butthereare somecancellativemonoids whicharenot groups, e.g.,themonoidofconcatenation.
Wecan nowdefine thelanguage PosRAaccthat containsall queries oftheform Q =accumh,⊕
(
Q)
,whereaccumh,⊕ is anaccumulationoperatorand Qisa PosRA query.ThepossibleresultsofQ onapo-database D,denoted Q(
D)
,istheset ofresultsobtainedbyapplyingaccumulationtoeachpossibleworldofQ(
D)
,namely:Definition12.Forapo-relation
,wedefineaccumh,⊕
()
··= {accumh,⊕(
L)
|L∈pw()
}.Ofcourse,accumulationhasexactlyoneresultwhenevertheaccumulationoperatoraccumh,⊕ doesnotdependonthe orderofinputtuples:thiscovers,e.g.,thestandardsum,min,max,etc.Hence,wefocusonaccumulationoperatorswhich dependontheorderoftuples,e.g., the monoidMofstrings with⊕beingtheconcatenation operation.Inthis case,there maybemorethanoneaccumulationresult.
Example13. As a first example, let Ratings
(
user,
restaurant,
rating)
be an unordered po-relation describing the numeri- cal ratings given by users to restaurants, where each user rated each restaurant at most once. Let Relevance(
user)
be a po-relation giving a partially-known ordering of users to indicate the relevance of their reviews. We wish to com- pute a total rating for each restaurant which is given by the sum of its reviews weighted by a PTIME-computable weight function w. Specifically, w(
i)
gives a nonnegative weight to the rating of the i-th most relevant user. Consider Q1··=accumh1,+( σ
ψ(
Relevance×LEXRatings))
whereweseth1(
t,
n)
··=t.
rating×w(
n)
,andwhereψ
isthetuplepredicate:restaurant=“Gagnaire”∧Ratings
.
user=Relevance.
user.ThequeryQ1 givesthetotalratingof“Gagnaire”,andeachpossibleworld ofRelevancemayleadtoadifferentaccumulationresult.Thisaccumulationoperatoriscancellative,butitisneither position-invariantnorfinite.
As a second example, consider an unordered po-relation HotelCity
(
hotel,
city)
indicating in which city each hotel is located, and consider a po-relation City(
city)
which is (partially) ranked by a criterion such as interest level, proxim- ity, etc. Now consider thequery Q2··=accumh2,concat(
hotel(
Q2))
,with Q2 ··=σ
City.city=HotelCity.city(
City×LEXHotelCity)
and h2(
t,
n)
··=t.Here,theoperator“concat”denotesstandardstringconcatenation.Q2 concatenatesthehotelnamesaccording tothepreferenceorderonthecitywheretheyarelocated,allowinganypossibleorderbetweenhotelsofthesamecityand betweenhotelsinincomparablecities.Thisaccumulationoperatoriscancellativeandposition-invariant,butitisnotfinite.3. Possibilityandcertainty
Evaluatinga PosRAor PosRAaccqueryQ onapo-databaseD yieldsasetofpossibleresults:for PosRAacc,ityieldsanexplicit setofaccumulationresults,andfor PosRA,ityieldsapo-relationthatrepresentsasetofpossibleworlds(listrelations).The uncertaintyontheresultmaycomefromuncertaintyontheorderoftheinputrelations(i.e.,iftheyarepo-relationswith multiplepossibleworlds),butitmayalsobecausedbythequery,e.g.,theunionoftwonon-emptytotallyorderedrelations isnottotallyordered.Insomecases,however,thereisonlyonepossibleresulttothequery,i.e., acertainanswer.Inother cases,wemaywishtoexaminemultiplepossibleanswers.Wethusdefinethecorrespondingproblems.
Definition14 (PossibilityandCertainty). Let Q be a PosRAquery, D be a po-database, and L a list relation.The possibil- ityproblem (POSS) asksif L∈pw
(
Q(
D))
, i.e., if L is a possible resultof Q on D. The certaintyproblem (CERT) asksif pw(
Q(
D))
= {L},i.e.,ifListheonlypossibleresultofQ onD.Likewise,ifQ isa PosRAaccquerywithanaccumulationmonoidM,foraresultv∈M,thePOSSproblemaskswhether v∈Q
(
D)
,andCERTaskswhetherQ(
D)
= {v}.For PosRAacc, ourdefinition followsthe usual notion ofpossible andcertain answers indata integration [11] andin- complete information[12].For PosRA,we askforpossibilityorcertaintyofanentireoutputlistrelationoftupleswithout identifiers:indeed,asweexplainedabove,theidentifiersareonlyinternallygeneratedandthusexpectedtobeunknownto the user.Theseproblemscorrespondtoinstancepossibilityandcertainty[13]. Wenowjustifythat thesenotions areuseful anddiscussmore“local”alternatives.
First,asweexemplifybelow,theoutputofaquerymaybecertain evenforacomplexqueryanduncertaininput.Itis importanttoidentifysuchcasesandpresenttheuserwiththecertainanswerinfull,likeorder-byqueryresultsincurrent DBMSs.OurCERTproblemisusefulforthistask,becausewecanuseittodecideifacertainoutputexists:andifitisthe case,thenwecancomputethecertainoutputinpolynomialtime,bychoosinganarbitrarylinearextensionandcomputing the corresponding possibleworld.However, CERTisa challenging problemtosolve,because ofduplicatevalues(see the
“Technicaldifficulties”paragraphbelow).
Example15.Considerthepo-databaseD ofFig.1withrelationsRestaurantandHotel2.Tofindrecommendedpairsofhotels andrestaurantsinthesamedistrict,wecanwrite Q ··=
σ
Restaurant.district=Hotel2.district(
Restaurant×DIRHotel2)
.EvaluatingQ(
D)
yieldsthelistrelation(
G,
8,
B,
8,TA,
5,
M,
5)asauniquepossibleworld:itisacertainresult.Wemayalsoobtainacertainresultincaseswhentheinputrelationsarelarger.Imagineforexamplethatwejoinhotels andrestaurantstofindpairsofahotelandarestaurantlocatedinthathotel.Theresultcanbecertainiftherelativeranking ofthehotelsandoftheirrestaurantsagree.
Ifthereisnocertainanswer,wecaninsteadtrytodecidewhethersomelistrelationsareapossibleanswer.Thiscanbe useful,e.g.,tocheckifalistrelation(obtainedfromanothersource)isconsistentwithaqueryresult.Forexample,wemay wish to check ifawebsite’s rankingofhotel–restaurant pairs isconsistent withthepreferences expressedin itsrankings forhotelsandrestaurants,todetectwhenapairisrankedhigherthanitscomponentswouldwarrant:thiscanbedoneby checkingiftherankingonthepairsisapossibleresultofthequerythatunifiesthehotelrankingandrestaurantranking.
When thereisnooverall certainanswer, orwhenwewantto checkthepossibility ofsome aggregate propertyofthe relation,we canusea PosRAaccquery.Inparticular,inaddition totheapplicationsofExample13,accumulationallowsus toencode alternativenotionsofPOSSandCERTforPosRAqueries,andtoexpressthemasPOSSandCERTfor PosRAacc. Forexample,insteadofpossibilityorcertainty forafullrelation,wecanexpresspossibilityorcertaintyoftheposition1 of particulartuplesofinterest.
One particular application of accumulation is to model position-basedselection queries. Consider for instance a top-k operator, definedonlist relations,whichretrieves alistrelationofthe firstk tuples.Letusextendthetop-k operator to po-relationsintheexpectedway:thesetoftop-kresultsonapo-relation
isthesetoftop-kresultsonthelistrelationsof pw
()
.Wecanimplementtop-kasaccumh3,concat withh3(
t,
n)
being(
t)
forn≤kandε
otherwise,andwithconcat being1 Rememberthattheexistenceofatupleisnotorder-dependent,soitistrivialtocheckinoursetting.