• Aucun résultat trouvé

order-incomplete data Computing possible and certain answers over Theoretical Computer Science

N/A
N/A
Protected

Academic year: 2022

Partager "order-incomplete data Computing possible and certain answers over Theoretical Computer Science"

Copied!
35
0
0

Texte intégral

(1)

Contents lists available atScienceDirect

Theoretical Computer Science

www.elsevier.com/locate/tcs

Computing possible and certain answers over order-incomplete data

Antoine Amarilli

a

, Mouhamadou Lamine Ba

b

, Daniel Deutch

c

, Pierre Senellart

a

,

d

,

e

,

aLTCI,TélécomParisTech,UniversitéParis-Saclay,Paris,France bUniversitéAliouneDiopdeBambey,Bambey,Senegal

cBlavatnikSchoolofComputerScience,TelAvivUniversity,TelAviv,Israel dDIENS,ENS,CNRS,PSLUniversity,Paris,France

eInria,Paris,France

a rt i c l e i n f o a b s t ra c t

Articlehistory:

Received17January2018

Receivedinrevisedform16April2019 Accepted3May2019

Availableonline24May2019 Keywords:

Certainanswer Possibleanswer Partialorder Uncertaindata

Thispaper studiesthe complexityofqueryevaluationfordatabaseswhose relationsare partiallyordered;theproblemcommonlyariseswhencombiningortransformingordered data from multiple sources. We focus on queries in a useful fragment of SQL, namely positiverelationalalgebrawithaggregates,whosebagsemanticsweextendtothepartially orderedsetting.Oursemanticsleadstothestudyoftwomaincomputationalproblems:the possibilityandcertaintyofqueryanswers.Weshowthattheseproblemsarerespectively NP-complete and coNP-complete, but identify tractable cases depending on the query operatorsorinput partial orders. Wefurther introduceaduplicate eliminationoperator andstudyitseffectonthecomplexityresults.

©2019ElsevierB.V.Allrightsreserved.

1. Introduction

Many applications need to combine and transform ordered data frommultiple sources. Examples include sequences of readings from multiple sensors, or log entries fromdifferent applications or machines, that need to be combined to form acomplete pictureof events;rankings ofrestaurantsandhotels based onvarious criteria(relevance,preference, or customer ratings);and concurrent editsof shared documents, wherethe order ofcontributions made by differentusers needs to be merged. Evenifthe orderof itemsfromeach individual source isusually known, theorder ofitemsacross sourcesisoftenuncertain.Forinstance,evenwhensensorreadingsorlogentriesareprovidedwithtimestamps,thesemay be ill-synchronizedacross sensorsormachines;rankings ofhotelsandrestaurantsmaybe biasedby differentpreferences of differentusers; concurrent contributions to documents maybe ordered inmultiple reasonable ways. We saythat the resultinginformationisorder-incomplete.

This paper studies query evaluationover order-incomplete data in a relational setting [1]. We focus on the running example of restaurants andhotels froma travel website,ranked according to a proprietary function. An example query wouldaskfortheorderedlistofrestaurant–hotelpairssuchthattherestaurantandhotelareinthesamedistrict,orsuch thattherestaurantfeaturesaparticularcuisine,andmayfurtherapplyorder-dependentoperatorstotheresult,e.g.,limiting theoutputtothetop-ksuchpairs,oraggregatingarelevancescore.Toevaluatesuchqueries,theinitialorderonthehotels

*

Correspondingauthor.

E-mailaddresses:antoine.amarilli@telecom-paristech.fr(A. Amarilli),mouhamadoulamine.ba@uadb.edu.sn(M.L. Ba),danielde@post.tau.ac.il(D. Deutch), pierre@senellart.com(P. Senellart).

https://doi.org/10.1016/j.tcs.2019.05.013

0304-3975/©2019ElsevierB.V.Allrightsreserved.

(2)

andrestaurantsmustbepreservedthroughtransformations.Furthermore,aswe donot knowhowtheproprietaryorderis defined, theresultoftransformationsmaybecome uncertain; hence,we needtorepresentall possibleresultsthat canbe obtaineddependingontheunderlyingorder.

Ourapproachistohandlethisuncertaintythroughtheclassicalnotionsofpossibleandcertainanswers.Wesaythatthere isa certainanswer tothe querywhenthereisonly onepossible orderonqueryresults,oronlyone accumulationresult, whichis obtainednomatter theorder ontheinput andinintermediate results.Inthis case,it isusefultocompute the certain answer, so that the usercan then browse through the ordered queryresults (as is typically done when thereis nouncertainty,usingconstructssuch asSQL’sORDER BY).Certainanswerscan ariseeveninnon-trivialcaseswherethe combinationofinputdataadmitsmanypossibleorders:consideruserqueriesthatselectonlyasmallinterestingsubsetof thedata(forwhichtheorderinghappenstobecertain),orashortsummaryobtainedthroughaccumulationoverlargedata.

Inmanyothercases,thedifferentordersoninputdataortheuncertaintycausedbythequerymayleadtoseveralpossible answers.Inthiscase,itisstillofinterest(andnon-trivial)toverifywhetherananswerispossible,e.g.,tocheckwhethera givenrankingofhotel–restaurantpairsisconsistentwithacombinationofotherrankings(thelatterdonethroughaquery).

Thus,westudytheproblemsofdecidingwhetheragivenansweriscertain,andwhetheritispossible.

Ourmaincontributionsmaybesummarizedasfollows.

ModelandProblemDefinition(Sections2,3). Ourworkfocusesonbagsemantics,whereatuplemayappearmultipletimes.

Notethatinthecontext of(partially)orderedrelations,thismeansthatmultiplecopiesofthesametuplemayappearin different“positions” intheorder. Forexample,ifwe integratemultiple rankingsof restaurants,then thesame restaurant appearsmultipletimesindifferentpositions.Wecapturethismodelbyanotionofpo-relations(partiallyordered)relations.

A po-relationis essentially a relationaccompanied with a partialorder over its tuples; a technicalsubtlety is that each tupleisassociatedwithanidentifier(presumablyinternalandautomaticallygenerated),sothatwehaveawayofreferring toeachtupleoccurrenceinthepartialorder(seefurtherdiscussionintheproblemdefinitionbelow).

Wethenintroduceaquerylanguageforpartiallyordereddata.Ourlanguagedesignisguidedbythegoalofsupporting SQLevaluationinpresenceofsuchdata,andassuchwefocusondefiningasemanticsforanimportantfragmentofSQL – namelypositiverelationalalgebrawithaggregates.Thesemantics is“faithful” toSQLinthesensethat,ifweignoreorder, thenwegetthestandardSQLsemantics.ThenotioncorrespondingtoaggregationinourcontextisLISP-likeaccumulation, whosesemanticsweextendtoaccountforpartialorders.

Weviewpartiallyorderedrelationsasaconciserepresentationofasetofpossibleworlds,namely,thelinearextensions ofthepartialorders overtheunderlyingtuplesoftherelation.Forexample,alinearextension isaranked listofrestaurant cuisines,whereacuisinemayappearmultipletimesinthelist.Notethatineachsuchlinearextension,thetuplesappear withouttheir respectiveinternalidentifierswhich,asmentionedearlier,wereonlypresentinthepo-relationasatechnical tool.

Ourdefinitionsleadtoapossibleworldssemanticsforqueryevaluation, andtotwo naturalproblems:whetheracan- didate answer – i.e., a ranked list of tuples (restaurants, cuisines, etc.) – is possible, i.e., is obtained for some possible world, andwhetheritis certain,i.e., isobtainedfor everypossible world. Hereagain, note that ina candidateanswer a tuplemayappearmultipletimes,andnaturallyitappearswithoutidentifiers(which,asmentionedabove,areinternaland are unknownto theuser).We formally definethesetwo problemsforoursettings,andthen embarkona studyoftheir complexity.

ComplexityintheGeneralCase(Section4). Wefirststudythepossibilityandcertaintyproblemswithoutanyrestrictionson theinputdatabase. Asusualindatamanagement,giventhatqueries aretypicallymuchsmallerthandatabases,westudy thedatacomplexity oftheproblems,i.e.,thecomplexitywhenthequeryisfixed.Forourgeneraldefinitionofpo-relations, weshowthatdecidingwhetherananswerispossibleisNP-complete,evenwithoutaccumulation,andevenforsomevery simplequeries andinput relations.In a particularcasewherewe assume noduplicates – i.e.,where tuplesare uniquely identified,whichmeansweareessentially backtothesetsemantics–possibilityofan answerisinPTIMEwithoutaccu- mulation,butisagainNP-completewithaccumulation.Asforcertainty,theproblemcanbedecidedinpolynomialtimein thecasewithnoaccumulation,butitiscoNP-completeforquerieswithaccumulation(evenifweassumenoduplicatesin theinput).Facedby thegeneralintractabilityofthepossibilityandcertaintyproblems,intherestofthepaperwesearch forrestrictedcasesforwhichtractabilityholds.

TractableCasesforPossibilityWithoutAccumulation(Section5). EventhoughpossibilityisNP-hardevenwithoutaccumulation, weidentifyrealisticcaseswhereitisinfacttractable.Inparticular,weshowthatiftheinputrelationsaretotallyordered thenpossibilityisinPTIMEforqueriesusingasubsetofouroperators(allexceptthedirectproduct).Assumingmoresevere restrictions onthequerylanguage,we furthershow tractabilitywhensome oftherelationsare (almost)orderedandthe restare(almost)unordered,asformalizedviaanewlyintroducednotionofia-width.

TractableCaseswithAccumulation(Section6). Withaccumulation,thecertaintyproblembecomesintractableaswell.Yetwe showthatifaccumulationiscapturedbyafinitecancellativemonoid(inparticular,ifitisperformedinafinitegroup),then certaintycanagainbedecidedinpolynomialtime.Further,werevisit thetractability resultsforpossibilityfromSection5 andshowthattheyextendtoquerieswithaccumulationundercertainrestrictionsontheaccumulationfunction.

(3)

LanguageExtensions(Section7). Wethenstudytwoextensionstoourlanguage,whicharethecounterpartsofcommonSQL operators. Thefirstisgroup-by,whichallowsustogrouptuplesforaccumulation(asisdone foraggregationinSQLwith GROUP BY);werevisitourcomplexityresultsinitspresence.Thesecondisduplicateelimination:keepingasinglerepresen- tativeofidenticaltuples,asinSQLwithSELECT DISTINCT.Inpresenceoforder,itischallengingtodesignasemantics forthisoperator,andwediscussbothsemanticandcomplexityissuesthatarisefromdifferentpossibledefinitions.

WecompareourmodelandresultstorelatedworkinSection8,andconcludeinSection9.

Thisarticleisan extendedversion oftheconferencepaper [2].Incontrastwiththeconferencepaper [2],allproofsare includedhere.WealsodiscoveredabugintheproofofTheorem 22ofthatpaper [2],thatalsoimpactsTheorems 19and 30 of [2].Consequently,theseresultsareomittedinthepresentpaper.

2. Datamodelandquerylanguage

WedenotebyN thesetofnonnegativenaturalnumbersandbyN>0 thesetofpositive naturalnumbers,i.e.,N>0:=

N\ {0}. We fix a countable set of values D that includes N andinfinitely many values not in N. A tuple t overD of arity a

(

t

)

isan element ofDa(t),denoted v1

, . . . ,

va(t): for1ia

(

t

)

,we write t

.

i to referto vi.The simplestnotion ofordered relationsarethen listrelations[3,4]:a listrelationofarityn∈N isanordered listoftuplesoverDofarityn (wherethesametuplemayappearmultipletimes).Listrelationsimposeasingleorderovertuples,butwhenonecombines (e.g.,unions)them,theremaybemultipleplausiblewaystoordertheresults.

Wethusintroducepartiallyorderedrelations(po-relations).Apo-relation

=

(

ID

,

T

, <)

ofarityn∈N consistsofafinite set ofidentifiers ID (chosen fromsome infinitesetclosed underthe Cartesianproduct, e.g., wecan usetuples ofnatural numbers), astrictpartialorder

<

onID,anda(generallynon-injective) mappingT fromID toDn.Thedomainof

isthe subset ofvaluesofDthatoccur intheimage ofT.Theactualidentifiersin IDdonotmatter,butweneedthemto refer to occurrences of thesame tuplevalue. Hence, we always consider po-relationsuptoisomorphism,where

(

ID

,

T

, <)

and

(

ID

,

T

, <

)

areisomorphiciffthereisabijection

ϕ

:IDIDsuchthat T

( ϕ (

id

))

=T

(

id

)

forallidID,and

ϕ (

id1

)<

ϕ (

id2

)

iffid1

<

id2forallid1

,

id2ID.

A specialcaseofpo-relations areunorderedpo-relations(or bagrelations), where

<

is empty:we denotethem

(

ID

,

T

)

. Anotherspecialcaseisthatoftotallyorderedpo-relations,where

<

isatotalorder.

The pointofpo-relationsistorepresentsetsoflistrelations.Formally,alinearextension

<

of

<

isatotalorderonID such that

<

<

,i.e.,foreach x

<

y wehavex

<

y.Thepossibleworlds pw

()

of

arethendefinedasfollows:foreach linearextension

<

of

<

,writingIDasid1

<

· · ·

<

id|ID|,thelistrelation

(

T

(

id1

), . . . ,

T

(

id|I D|

))

isinpw

()

.AsT isgenerally not injective,twodifferentlinearextensionsmayyield thesamelistrelation.Notethat eachsuch linearextension “strips away” theidentifiersandincludesonlythetuples.Forinstance,if

isunordered,thenpw

()

consistsofallpermutations ofthetuplesof

;andif

istotallyorderedthenpw

()

containsexactlyonepossibleworld.

Po-relationscanthusmodeluncertaintyovertheorderoftuples.However,note thattheycannotmodeluncertaintyon tuplevalues.Specifically,letusdefinetheunderlyingbagrelationofapo-relation

=

(

ID

,

T

, <)

as

(

ID

,

T

)

.Unlikeorder,this underlyingbagrelationisalwayscertain.

Weextendsomeclassicalnotionsfrompartialordertheorytopo-relations.

Letting

=

(

ID

,

T

, <)

be a po-relation, an orderideal of

is asubset SID such that, forall x

,

yID,if x

<

y and yS then xS.An antichain[5] of

is a set AIDof pairwise incomparable tupleidentifiers.The width of

is the size ofits largestantichain,andthewidthofa po-databaseis themaximalwidthofits po-relations.Inparticular, totally orderedpo-relationshavewidth 1,andunorderedpo-relationshaveawidthequaltotheirnumberoftuples;thewidthof apo-relationcanbecomputedinpolynomialtime [6].

Achainpartitionof

isapartitionID=

1 · · ·

n suchthattherestrictionof

<

toeach

i isatotalorder:wecall each

i achain.Notethat

<

mayincludecomparabilityrelationsacrosschains,i.e.,relatingelementsin

i toelementsin

j fori=j.Thewidthofthechainpartitionisn.ByDilworth’stheorem [7,6],thewidth wof

isthesmallestpossible widthofa chainpartitionof

;furthermore,given

,wecan computeinpolynomial timeboth itswidth w andachain partitionof

ofwidthw.

2.1. PosRA:Querieswithoutaccumulation

We now define a bag semantics for positiverelationalalgebra operators, to manipulate po-relations withqueries. The positiverelationalalgebra,written PosRA,isastandardquerylanguageforrelationaldata [1].Wewillextend PosRAwithac- cumulationinSection2.3,andaddfurtheroperationsinSection7.Each PosRAoperatorappliestopo-relationsandcomputes anewpo-relation;wepresenttheminturn.

The selectionoperatorrestrictstherelationtoasubsetofits tuples,andtheorderistherestrictionoftheinputorder.

The tuplepredicatesallowed inselectionsare Boolean combinationsofequalities andinequalities, whichinvolveconstant valuesinDandtupleattributeswrittenas

.

i fori∈N>0.Forinstance,theselection

σ

.1=“a”.2=.3selectstupleswhosefirst attributeisequaltotheconstant “a”andwhosesecondattributeisdifferentfromtheirthirdattribute.

selection: For any po-relation

=

(

ID

,

T

, <)

and tuple predicate

ψ

, we define the selection

σ

ψ

()

··=

(

ID

,

T|ID

, <

|ID

)

, whereID··= {idID|

ψ(

T

(

id

))

holds}.

(4)

Restname District Gagnaire 8 TourArgent 5 (a)Restauranttable

Hotelname District Mercure 5 Balzac 8 Mercure 12

(b)Hoteltable

Hotelname District Balzac 8 Mercure 5 Mercure 12

(c)Hotel2table Fig. 1.Running example: Paris restaurants and hotels.

Theprojectionoperatorchangestuplevaluesintheusualway,butkeepstheoriginaltupleorderingintheresult,andretains allcopiesofduplicatetuples(followingourbagsemantics).

projection: Forapo-relation

=

(

ID

,

T

, <)

andattributes A1

, . . . ,

An,wedefinetheprojection

A1,...,An

()

··=

(

ID

,

T

, <)

, whereTmapseachidIDto

A1,...,An

(

T

(

id

))

:= T

(

id

).

A1

, . . . ,

T

(

id

).

An.

As forunion,we impose theminimal orderconstraintsthat are compatiblewiththoseofthe inputs.We usethe parallel composition[8] oftwopartialorders

<

and

<

ondisjointsetsIDandID,i.e.,thepartialorder

<

··=

(<

<

)

onIDID. Notethat

<

isthesameorderas

<

onIDandas

<

onID,andthatallelementsfromIDareincomparabletoallelements fromID.

union: Let

=

(

ID

,

T

, <)

and

=

(

ID

,

T

, <

)

betwopo-relationsofthesamearity.Weassumethattheidentifiersof

havebeenrenamedifnecessarytoensurethatIDandIDaredisjoint.Wethendefine

··=

(

IDID

,

T

, (<

<

))

,whereT mapsidIDtoT

(

id

)

andidIDto T

(

id

)

.

Theunionresult

doesnotdependon howwerenamed

,i.e., itisuniqueuptoisomorphism.Ourdefinitionalso impliesthat

isdifferentfrom

,asper bagsemantics.In particular,when

and

haveonlyone possibleworld,

usuallydoesnot.

We next introduce two possible product operators. First, as in [9], the direct product

<

DIR··=

(<

×DIR

<

)

of two partial orders

<

and

<

on sets ID and ID is defined by

(

id1

,

id1

) <

DIR

(

id2

,

id2

)

iff id1

<

id2 and id1

<

id2 for each

(

id1

,

id1

), (

id2

,

id2

)

ID×ID. We define the directproduct operator over po-relations accordingly: two identifiers in the productarecomparableonlyifbothcomponentsofbothidentifierscompareinthesameway.

direct product: Foranypo-relations

=

(

ID

,

T

, <)

and

=

(

ID

,

T

, <

)

,rememberingthatthesetofpossibleidentifiersis closedunderproduct,welet

×DIR

··=

(

ID×ID

,

T

, <

×DIR

<

)

,whereT mapseach

(

id

,

id

)

ID×IDtothe concatenationT

(

id

),

T

(

id

)

.

Again,thedirectproductresultoftenhasmultiplepossibleworldsevenwheninputsdonot.

Thesecondproductoperatorusesthelexicographicproduct(orordinalproduct[9])

<

LEX··=

(<

×LEX

<

)

oftwopartialor- ders

<

and

<

,definedby

(

id1

,

id1

) <

LEX

(

id2

,

id2

)

iffeitherid1

<

id2,orid1=id2andid1

<

id2,forall

(

id1

,

id1

), (

id2

,

id2

)

ID×ID.

lexicographic product: Forany po-relations

=

(

ID

,

T

, <)

and

=

(

ID

,

T

, <

)

,we define

×LEX

as

(

ID×ID

,

T

, <

×LEX

<

)

withT definedlikeforthedirectproduct.

Last,wedefinetheconstantexpressionsthatweallow.

constant expressions:foranytuplet,thesingletonpo-relation[t]hasonlyonetuplewithvaluet;

foranyn∈N,thepo-relation[≤n]hasarity 1 andhaspw

(

[≤n]

)

= {

(

1

, . . . ,

n

)

}.

Wehavenowdefinedasemanticsonpo-relationsforeach PosRAoperator.WedefineaPosRAqueryintheexpectedway, asaquerybuiltfromtheseoperators andfromrelationnames.Callingschema asetS ofrelationnamesandarities,with anattributenameforeachpositionofeachrelation,wedefineapo-database Dashavingapo-relationofthecorrectarity foreachrelationnameR inS.Forapo-database D anda PosRAqueryQ,wedenoteby|Q|thenumberofsymbolsofQ, andwedenoteby Q

(

D

)

thepo-relationobtainedbyevaluating Q over D.

Example1. Thepo-database D inFig.1containsinformationaboutrestaurantsandhotelsinParis:each po-relationhasa totalorder(fromtoptobottom)accordingtocustomerratingsfromagiventravelwebsite.Forbrevity,wedonotrepresent identifiersinpo-relations,andwealsodeviateslightlyfromourformalismbyadoptingthenamedperspectiveinexamples, i.e.,givingnamestoattributes.

Let Q ··=Restaurant×DIR

( σ

district=“12”

(

Hotel

))

.Its result Q

(

D

)

hastwo possibleworlds,wherewe abbreviatehoteland restaurantnames:

(5)

Fig. 2.Example1.

(

G

,

8

,

M

,

5

,

G

,

8

,

B

,

8

,

TA

,

5

,

M

,

5

,

TA

,

5

,

B

,

8

)

;

(

G

,

8

,

M

,

5,TA

,

5

,

M

,

5,G

,

8

,

B

,

8,TA

,

5

,

B

,

8).

In asense, theselistrelationsofhotel–restaurant pairs areconsistent withthe orderin D: wedo notknowhow toorder two pairs, exceptwhen both thehotel andrestaurantcomparein thesame way.The po-relation Q

(

D

)

is representedin Fig.2asaHassediagram,againwritingtuplevaluesinsteadoftupleidentifiersforbrevity:notethat,followingtheusual conventionforHassediagramsinpartialordertheory,theorderinFig.2isdrawninthereversedirectionofthatofFig.1, i.e.,frombottomtotop.

ConsidernowthequeryQ··=

( σ

Restaurant.district=Hotel.district

(

Q

))

,where

projectsoutHotel

.

district.Thepossibleworlds of Q

(

D

)

are

(

G

,

B

,

8

,

TA

,

M

,

5

)

and

(

TA

,

M

,

5

,

G

,

B

,

8

)

,intuitively reflecting two different opinionson the order of restaurant–hotel pairs in the same district. Defining Q similarly to Q but replacing ×DIR by ×LEX in Q, we have pw

(

Q

(

D

))

=

(

G

,

B

,

8,TA

,

M

,

5).

It is easy to show that we can efficiently evaluate PosRAqueries on po-relations, which we will use throughout the sequel.

Proposition2.ForanyfixedPosRAqueryQ ,givenapo-databaseD,wecanconstructthepo-relation Q

(

D

)

intimeO

|D||Q| ,i.e., inpolynomialtimeinthesizeof D.

Proof. Weshowtheclaimbyasimpleinductiononthequery Q,notingthat|Q|isatleastk+1,wherekisthenumber ofoperatorsinQ.

If Q isarelationnameR,then Q

(

D

)

isobtainedinlineartime.

If Q isaconstantexpression,then Q

(

D

)

isobtainedinconstanttime.

If Q=

σ

ψ

(

Q

)

orQ =

k1...kp

(

Q

)

,then Q

(

D

)

isobtainedintimelinearin|Q

(

D

)|

,andweconcludebytheinduction hypothesis.

If Q =Q1Q2 or Q=Q1×LEXQ2 orQ =Q1×DIRQ2,then Q

(

D

)

isobtainedintime linearin|Q1

(

D

)

|× |Q2

(

D

)

|, andweconcludeagainbytheinductionhypothesis. 2

NotethatProposition 2computestheresultofaqueryasapo-relation

.However,wecannot efficientlycomputethe completesetpw

()

ofpossibleworldsof

,evenifallrelationsoftheinputpo-databasearetotallyordered.Forinstance, considerthequeryQ :=RS,andapo-database Dinterpreting RandSastotallyorderedrelationswithdisjointdomains andwithntupleseach. Itiseasy toseethat thequeryresult Q

(

D

)

has2n

n

possibleworlds, whichisexponential in D.

Thisintractabilityisthereasonwhywillwestudythepossibilityandcertaintyproblemsinthesequel.

2.2. IncomparabilityofPosRAOperators

Beforeextendingourquerylanguagewithaccumulation,weaddressthenaturalquestionofwhetheranyofouroperators issubsumedbytheothers.Weshowthatthisisnotthecase.

Theorem3.NoPosRAoperatorcanbeexpressedthroughacombinationoftheothers.

We proveTheorem3intherestofthissubsection.Weconsidereach operatorinturn,andshowthatitcannot beex- pressedthroughacombinationoftheothers.Wefirstconsiderconstantexpressionsandshowdifferencesinexpressiveness evenwhensettingtheinputpo-databasetobeempty.

For [t],consider the query [0].The value 0 is not inthe database, andcannot be produced by the [≤n] constant expression,andsothisqueryhasnoequivalentthatdoesnotusethe[t]constantexpression.

For[≤n],observethat[≤2]isapo-relationwithanon-emptyorder,whileanyqueryinvolvingtheotheroperatorswill haveemptyorder(noneofourunaryandbinaryoperators turnsunorderedpo-relationsintoanorderedone,andthe [t]constantexpressionproducesanunorderedpo-relation).

(6)

Movingontounaryandbinaryoperators,alloperatorsbutproductsareeasilyshowntobenon-expressible.

selection. For anyconstant a not in N, consider the po-database Da consistingof a single unorderedpo-relation with nameRformedoftwounarytuples0anda.LetQ=

σ

.1=“0”

(

R

)

.Then,Q

(

Da

)

isthepo-relationconsistingonly ofthetuplea.No PosRA querywithoutselection hasthesamesemantics,asnoother operator thanselection cancreateapo-relationcontainingtheconstanta foranyinput Da,unlessitalsocontainstheconstant 0.

projection.

istheonlyoperatorthatcandecreasethearityofaninputpo-relation.

union. [0]∪ [1](overtheemptypo-database)cannotbesimulatedbyanycombinationofoperators,ascanbesimply shownby induction: no other operatorwill produce a po-relationwhich hasthe two elements 0 and 1 inthe sameattribute.

Observethatproductoperatorsaretheonlyonesthatcanincreasearity,sotakentogethertheyarenon-redundantwith theother operators.Hence, itonlyremains toshow that eachof×DIR and×LEX isnot redundant.Todothis, letuscall PosRALEX thefragmentof PosRAthatdisallowsthe×DIR operator,butallowsallotheroperators(including×LEX).Wealso define PosRADIR thatdisallows×LEXbutnot×DIR.

We willfirst show that the×DIR product isnot redundant, which we willdousing thenotion ofwidth. Specifically, considerthequery QDIR=R×DIRR andaninputpo-database Dn where Rismappedto[≤n](aninputrelationofwidth 1) foran arbitrary Rn. Itis then clearthat thepo-relation Q

(

Dn

)

has widthn.We will showthat this querycannot be capturedin PosRALEX,because PosRALEX queriescanonlymakewidthincreaseinawaythat dependsonthewidthofthe inputpo-relations,butnotontheirsize.

Lemma4.Letk2andQ beaPosRALEXquery.Foranypo-database D ofwidthk,thepo-relationQ

(

D

)

haswidthk|Q|+1. Proof. We first show by induction on the PosRALEX query Q that the width of the query output can be bounded as a function ofk. Forthe basecase, the input po-relations have width ≤k, andall constant po-relations (singletons and constantchains)havewidth 1.Letusshowtheinductionstep.

Giventwopo-relations

1and

2withwidthrespectivelyk1andk2,theirunion

:=

1

2 clearlyhaswidthatmost k1+k2.Indeed,anyantichainin

mustbetheunionofanantichainof

1 andofanantichainof

2.

Givenapo-relation

1withwidthk1,applyingaprojectionorselectionto

1 cannotincreasethewidth.

Giventwopo-relations

1=

(

ID1

,

T1

, <

1

)

and

2=

(

ID2

,

T2

, <

2

)

withwidthrespectivelyk1 andk2,theirproduct

··=

1×LEX

2haswidthatmostk1·k2.Toshowthis,write

=

(

ID

,

T

, <)

,consideranyset AIDofcardinality

>

k1·k2, andletusarguethat A isnot anantichain.Bythedefinitionof×LEX,wecanseeeach identifierof A asanelement ofID1×ID2.Now,oneofthefollowingmusthold.

1. LettingS1bethesetofidentifiersuID1 forwhichwehave

(

u

,

v

)

A forsome vID2,itisthecasethat|S1|

>

k1. 2. Thereexistsusuchthat,lettingS2

(

u

)

··= {v|

(

u

,

v

)

A},wehave|S2

(

u

)

|

>

k2.

Informally,whenputting

>

k1·k2 valuesinbuckets(thevalue oftheirfirstcomponent),either

>

k1 differentbuckets areused,orthereisabucketcontaining

>

k2 elements.

Inthefirstcase,asS1ID1,as|S1|

>

k1,andas

1 haswidthk1,weknowthat S1 cannotbeanantichain,soitmust containtwocomparableelementsu

<

1u.Hence,consideringanyv

,

vID2suchthat w=

(

u

,

v

)

andw=

(

u

,

v

)

are in A,wehavebythedefinitionof×LEX that w

<

w,sothat A isnotanantichain.Inthesecondcase,asS2

(

u

)

ID2, as|S2

(

u

)| >

k2,andas

2 haswidthk2,weknowthatS2

(

u

)

cannotbeanantichain,soitmustcontaintwocomparable elements v

<

2v.Hence,considering w=

(

u

,

v

)

andw=

(

u

,

v

)

whicharein A,wehave w

<

w,andagain Aisnot anantichain.Hence,nosetofcardinality

>

k1·k2 of

isanantichain,so

haswidth≤k1·k2asclaimed.

Second,weexplainwhytheboundonthewidthofthequeryoutputcanbechosenasinthelemmastatement.Specifi- cally,lettingobethenumberofproductoperatorsin Q plusthenumberofunionoperators,weshowthatwecanbound thewidthof Q

(

D

)

by ko+1.Indeed,theoutputofqueries withoutproductorunionoperators havewidthatmostk (be- causek1).Further,asprojectionsandselectionsdonotchangethewidth,theonlyoperatorstoconsiderareproductand union.Fortheunionoperator,if Q1 haso1 suchoperatorsand Q2 haso2 suchoperators,boundinginductivelythewidth of Q1

(

D

)

byko1+1 and Q2

(

D

)

byko2+1,forQ :=Q1Q2,thenumberofunionandproductoperatorsiso1+o2+1,and thenewboundisko1+1+ko2+1,whichisko1+1+o2+1 becausek2,i.e., itisk(o1+o2+1)+1.Forthe×LEX operator,we proceedinthesamewayanddirectlyobtainthek(o1+o2+1)+1 bound.Hence,wecanindeedboundthewidthof Q

(

D

)

by k|Q|+1 asgiveninthestatement,whichconcludestheproof. 2

We haveshownLemma4: PosRALEX queriescan onlymake thewidthincrease asafunction ofthequeryandofthe width of the input po-relations. Hence, the query QDIR cannot be captured in PosRALEX, and the ×DIR product is not redundant.

Conversely, let us show that the ×LEX product is not redundant. To do this, we introduce the concatenation of po- relations.

(7)

Definition5.Theconcatenation

∪CAT

oftwopo-relations

and

istheseriescompositionoftheirtwopartialorders.

Notethat pw

(

∪CAT

)

= {L∪CATL|Lpw

(),

Lpw

(

)}

,whereL∪CATListheconcatenationoftwolistrelationsin theusualsense.

Weshowthatconcatenationcanbecapturedin PosRALEX.

Lemma6.Foranyarityn∈NanddistinguishedrelationnamesR andR,thereisaqueryQnwithout×DIRoperatorsuchthat,for anytwopo-relations

and

ofarity n,lettingD bethedatabasemappingR to

andRto

,thequeryresultQn

(

D

)

is

CAT

.

Proof. Foranyn∈N andnamesR andR,considerthefollowingquery:

Qn

(

R

,

R

) ··=

3...n+2

σ

.1=.2

[≤

2

] ×

LEX

(( [

1

] ×

LEXR

)( [

2

] ×

LEXR

))

Itiseasilyverifiedthat Qnsatisfiestheclaimedproperty. 2

Bycontrast,weshowthatconcatenationcannotbecapturedin PosRADIR.

Lemma7.Foranyarityn∈N>0anddistinguishedrelationnames R and R,thereisnoPosRADIRquery Qn suchthat,forany po-relations

and

ofarity n,lettingD bethepo-databasethatmapsR to

andRto

,thequeryresultQn

(

D

)

is

CAT

.

ToproveLemma7,wefirstintroducethefollowingconcept.

Definition8. Let vD.We call a po-relation

=

(

ID

,

T

, <)

v-impartial if, for any two identifiers id1 andid2 and1≤ ia

()

suchthatexactlyone ofT

(

id1

).

i, T

(

id2

).

i is v,thefollowingholds:id1 andid2 areincomparable,namely,neither id1

<

id2norid2

<

id1hold.

Lemma9.LetvD\Nbeavalue.ForanyPosRADIRqueryQ ,foranypo-databaseD ofv-impartialpo-relations,thepo-relation Q

(

D

)

isv-impartial.

Proof. Let D be a po-database of v-impartialpo-relations.We show by inductionon the query Q that v-impartiality is preserved.Thebasecasesarethefollowing.

Forthebaserelations,theclaimisvacuousbyourhypothesison D.

Forthesingletonconstantexpressions,theclaimistrivialastheycontainlessthantwotuples.

Forthe[≤i]constantexpressions,theclaimisimmediateasv

/

N. Wenowprovetheinductionstep.

Forselection, theclaimisshownby noticingthat,forany v-impartialpo-relation

,letting

bethe imageof

by anyselection,

isitself v-impartial. Indeed,consideringtwo identifiersid1 andid2 in

and1≤ia

()

satisfying thecondition,as

isv-impartial,id1andid2 areincomparablein

,sotheyarealsoincomparablein

.

Forprojection,theclaimisalsoimmediateasthepropertytoproveismaintainedwhenreordering,copyingordeleting attributes. Indeed,considering againtwo identifiers id1 andid2 of

and1≤ia

(

)

,therespectivepreimages id1 andid2 in

ofid1 andid2 satisfy thesamecondition forsome different1≤ia

()

whichistheattributein

that wasprojectedtogiveattribute iin

,soweagainusetheimpartialityoftheoriginalpo-relationtoconclude.

Forunion,letting

··=

,andwriting

=

(

ID

,

T

, <

)

,assumebycontradictiontheexistenceoftwoidentifiers id1

,

id2

and1≤ia

(

)

such that exactly one of T

(

id1

).

i and T

(

id2

).

i is v but (without loss ofgenerality) id1

<

id2 in

.Itiseasilyseenthat,asid1andid2arenotincomparable,theymustcomefromthesamerelation;but then,asthatrelationwas v-impartial,wehaveacontradiction.

For ×DIR, consider

··=

×DIR

where

and

are v-impartial, and write

=

(

ID

,

T

, <

)

as above. As- sume that there are two identifiers id1 and id2 of ID and 1≤ia

(

)

that violate the v-impartiality of

. Let

(

id1

,

id1

), (

id2

,

id2

)

ID×ID be the pairs of identifiers used to create id1 and id2. We distinguish on whether 1≤ia

()

ora

() <

ia

()

+a

(

)

.In thefirstcase, we deducethat exactlyone of T

(

id1

).

i and T

(

id2

).

i is v,so that inparticularid1=id2.Thus,bythedefinitionoftheorderin×DIR,itiseasilyseen that,becauseid1 andid2 are comparablein

,id1 andid2 mustcompareinthesamewayin

,contradicting thev-impartialityof

.Thesecond caseissymmetric. 2

WenowconcludewiththeproofofLemma7.

Proof. Let usassume by wayof contradiction that there is n∈N>0 and a PosRADIR query Qn that captures ∪CAT. Let v=v be two distinct valuesin D\N, andconsiderthe singletonpo-relation

containing one identifierof value t and

(8)

containing oneidentifier ofvalue t,wheret (resp.t) are tuples ofarityn containingn timesthe value v (resp. v).

Considerthepo-database D mapping Rto

andRto

.Write

··=Qn

(

D

)

.Byourassumption,as

=

(

ID

,

T

, <

)

is

∪CAT

,itmustcontainanidentifieridID suchthat T

(

id

)

=t andanidentifieridID suchthat T

(

id

)

=t.Now, as

and

are(vacuously) v-impartial,Lemma9impliesthat

isv-impartial.Hence,asn

>

0,takingi=1,ast=tand exactlyoneoft

.

1 andt

.

1 isv,theidentifiersidandidareincomparablein

<

,sothereisapossibleworldof

where idprecedesid.Thiscontradictsthefactthat,asweshouldhave

=

∪CAT

,thepo-relation

shouldhaveexactlyone possibleworld,namely,

(

t

,

t

)

. 2

Thisestablishesthatthe×LEX operatorcannotbe expressedusingtheothers,andshowsthatnoneofouroperatorsis redundant,whichconcludestheproofofTheorem3.

2.3. PosRAacc:Querieswithaccumulation

We now enrich PosRAwith order-aware accumulation as the outermost operation, inspired by rightaccumulation and iterationinlistprogramming, andaggregationinrelationaldatabases.Recallthata monoid

(

M

,

, ε )

consistsofaset M (notnecessarilyfinite),anassociativeoperation ⊕:M×MM,andanelement

ε

Mwhichisneutralfor⊕,i.e.,for allmM,wehave

ε

m=m

ε

=m.Wewilluseamonoidasthestructureinwhichweperformaccumulation.Wecan nowdefineaccumulationonagivenlistrelation.

Definition10. For k∈N, let h:Dk×N>0M be a function called an arity-k accumulationmap, which maps pairs consistingofank-tupleandapositiontoavalueinthemonoidM.Wecallaccumh,⊕ anarity-k accumulationoperator;its resultaccumh,⊕

(

L

)

onanarity-klistrelationL=

(

t1

, . . . ,

tn

)

ish

(

t1

,

1

)

⊕· · ·⊕h

(

tn

,

n

)

,anditis

ε

ifLisempty.Forcomplexity purposes,wealwaysrequireaccumulationoperatorstobePTIME-evaluable,i.e.,wecanevaluatetheaccumulationmapand themonoidoperatorintimepolynomialintheirinputs,andwecancomputeaccumh,⊕

(

L

)

inpolynomialtimeonanyinput listrelationL.

Intuitively,theaccumulationoperatormapseachoccurrenceofa tupleinthelistwithhtoM,whereaccumulationis performedwith⊕.(RememberthattheinputLtotheaccumulationisalistrelation,soeachtupleoccurrencehasaspecific position.)ThemaphmayuseitssecondargumenttotakeintoaccounttheabsolutepositionoftuplesinL.Inwhatfollows, weomitthearityofaccumulationwhenclearfromcontext.

Wewilloftenlookatspecialcasesforaccumulation,especiallywhenderivingcomplexity results.Herearetherestric- tionsthatwewillconsider.

Definition11.Wesaythatanaccumulationoperatorisposition-invariantifitsaccumulationmapignores thesecondinput, sothateffectivelyitsonlyinputisthetupleitself

Wesaythatanaccumulationoperatorisfiniteifitsmonoid

(M,

⊕,

ε )

isfinite.

For anymonoid

(

M

,

, ε )

, we callaM cancellable if, forall b

,

cM, we havethat ab=ac impliesb=c, andba=ca impliesb=c.We callM acancellativemonoid [10] ifall its elements are cancellable.We saythat an accumulationoperatoriscancellativeifitsmonoidis.

Notethat,inparticular, agroup isalwayscancellative,butthereare somecancellativemonoids whicharenot groups, e.g.,themonoidofconcatenation.

Wecan nowdefine thelanguage PosRAaccthat containsall queries oftheform Q =accumh,

(

Q

)

,whereaccumh, is anaccumulationoperatorand Qisa PosRA query.ThepossibleresultsofQ onapo-database D,denoted Q

(

D

)

,istheset ofresultsobtainedbyapplyingaccumulationtoeachpossibleworldofQ

(

D

)

,namely:

Definition12.Forapo-relation

,wedefineaccumh,⊕

()

··= {accumh,⊕

(

L

)

|Lpw

()

}.

Ofcourse,accumulationhasexactlyoneresultwhenevertheaccumulationoperatoraccumh,⊕ doesnotdependonthe orderofinputtuples:thiscovers,e.g.,thestandardsum,min,max,etc.Hence,wefocusonaccumulationoperatorswhich dependontheorderoftuples,e.g., the monoidMofstrings withbeingtheconcatenation operation.Inthis case,there maybemorethanoneaccumulationresult.

Example13. As a first example, let Ratings

(

user

,

restaurant

,

rating

)

be an unordered po-relation describing the numeri- cal ratings given by users to restaurants, where each user rated each restaurant at most once. Let Relevance

(

user

)

be a po-relation giving a partially-known ordering of users to indicate the relevance of their reviews. We wish to com- pute a total rating for each restaurant which is given by the sum of its reviews weighted by a PTIME-computable weight function w. Specifically, w

(

i

)

gives a nonnegative weight to the rating of the i-th most relevant user. Consider Q1··=accumh1,+

( σ

ψ

(

Relevance×LEXRatings

))

whereweseth1

(

t

,

n

)

··=t

.

rating×w

(

n

)

,andwhere

ψ

isthetuplepredicate:

restaurant=“Gagnaire”Ratings

.

user=Relevance

.

user.ThequeryQ1 givesthetotalratingof“Gagnaire”,andeachpossible

(9)

world ofRelevancemayleadtoadifferentaccumulationresult.Thisaccumulationoperatoriscancellative,butitisneither position-invariantnorfinite.

As a second example, consider an unordered po-relation HotelCity

(

hotel

,

city

)

indicating in which city each hotel is located, and consider a po-relation City

(

city

)

which is (partially) ranked by a criterion such as interest level, proxim- ity, etc. Now consider thequery Q2··=accumh2,concat

(

hotel

(

Q2

))

,with Q2 ··=

σ

City.city=HotelCity.city

(

City×LEXHotelCity

)

and h2

(

t

,

n

)

··=t.Here,theoperator“concat”denotesstandardstringconcatenation.Q2 concatenatesthehotelnamesaccording tothepreferenceorderonthecitywheretheyarelocated,allowinganypossibleorderbetweenhotelsofthesamecityand betweenhotelsinincomparablecities.Thisaccumulationoperatoriscancellativeandposition-invariant,butitisnotfinite.

3. Possibilityandcertainty

Evaluatinga PosRAor PosRAaccqueryQ onapo-databaseD yieldsasetofpossibleresults:for PosRAacc,ityieldsanexplicit setofaccumulationresults,andfor PosRA,ityieldsapo-relationthatrepresentsasetofpossibleworlds(listrelations).The uncertaintyontheresultmaycomefromuncertaintyontheorderoftheinputrelations(i.e.,iftheyarepo-relationswith multiplepossibleworlds),butitmayalsobecausedbythequery,e.g.,theunionoftwonon-emptytotallyorderedrelations isnottotallyordered.Insomecases,however,thereisonlyonepossibleresulttothequery,i.e., acertainanswer.Inother cases,wemaywishtoexaminemultiplepossibleanswers.Wethusdefinethecorrespondingproblems.

Definition14 (PossibilityandCertainty). Let Q be a PosRAquery, D be a po-database, and L a list relation.The possibil- ityproblem (POSS) asksif Lpw

(

Q

(

D

))

, i.e., if L is a possible resultof Q on D. The certaintyproblem (CERT) asksif pw

(

Q

(

D

))

= {L},i.e.,ifListheonlypossibleresultofQ onD.

Likewise,ifQ isa PosRAaccquerywithanaccumulationmonoidM,foraresultvM,thePOSSproblemaskswhether vQ

(

D

)

,andCERTaskswhetherQ

(

D

)

= {v}.

For PosRAacc, ourdefinition followsthe usual notion ofpossible andcertain answers indata integration [11] andin- complete information[12].For PosRA,we askforpossibilityorcertaintyofanentireoutputlistrelationoftupleswithout identifiers:indeed,asweexplainedabove,theidentifiersareonlyinternallygeneratedandthusexpectedtobeunknownto the user.Theseproblemscorrespondtoinstancepossibilityandcertainty[13]. Wenowjustifythat thesenotions areuseful anddiscussmore“local”alternatives.

First,asweexemplifybelow,theoutputofaquerymaybecertain evenforacomplexqueryanduncertaininput.Itis importanttoidentifysuchcasesandpresenttheuserwiththecertainanswerinfull,likeorder-byqueryresultsincurrent DBMSs.OurCERTproblemisusefulforthistask,becausewecanuseittodecideifacertainoutputexists:andifitisthe case,thenwecancomputethecertainoutputinpolynomialtime,bychoosinganarbitrarylinearextensionandcomputing the corresponding possibleworld.However, CERTisa challenging problemtosolve,because ofduplicatevalues(see the

“Technicaldifficulties”paragraphbelow).

Example15.Considerthepo-databaseD ofFig.1withrelationsRestaurantandHotel2.Tofindrecommendedpairsofhotels andrestaurantsinthesamedistrict,wecanwrite Q ··=

σ

Restaurant.district=Hotel2.district

(

Restaurant×DIRHotel2

)

.EvaluatingQ

(

D

)

yieldsthelistrelation

(

G

,

8

,

B

,

8,TA

,

5

,

M

,

5)asauniquepossibleworld:itisacertainresult.

Wemayalsoobtainacertainresultincaseswhentheinputrelationsarelarger.Imagineforexamplethatwejoinhotels andrestaurantstofindpairsofahotelandarestaurantlocatedinthathotel.Theresultcanbecertainiftherelativeranking ofthehotelsandoftheirrestaurantsagree.

Ifthereisnocertainanswer,wecaninsteadtrytodecidewhethersomelistrelationsareapossibleanswer.Thiscanbe useful,e.g.,tocheckifalistrelation(obtainedfromanothersource)isconsistentwithaqueryresult.Forexample,wemay wish to check ifawebsite’s rankingofhotel–restaurant pairs isconsistent withthepreferences expressedin itsrankings forhotelsandrestaurants,todetectwhenapairisrankedhigherthanitscomponentswouldwarrant:thiscanbedoneby checkingiftherankingonthepairsisapossibleresultofthequerythatunifiesthehotelrankingandrestaurantranking.

When thereisnooverall certainanswer, orwhenwewantto checkthepossibility ofsome aggregate propertyofthe relation,we canusea PosRAaccquery.Inparticular,inaddition totheapplicationsofExample13,accumulationallowsus toencode alternativenotionsofPOSSandCERTforPosRAqueries,andtoexpressthemasPOSSandCERTfor PosRAacc. Forexample,insteadofpossibilityorcertainty forafullrelation,wecanexpresspossibilityorcertaintyoftheposition1 of particulartuplesofinterest.

One particular application of accumulation is to model position-basedselection queries. Consider for instance a top-k operator, definedonlist relations,whichretrieves alistrelationofthe firstk tuples.Letusextendthetop-k operator to po-relationsintheexpectedway:thesetoftop-kresultsonapo-relation

isthesetoftop-kresultsonthelistrelationsof pw

()

.Wecanimplementtop-kasaccumh3,concat withh3

(

t

,

n

)

being

(

t

)

fornkand

ε

otherwise,andwithconcat being

1 Rememberthattheexistenceofatupleisnotorder-dependent,soitistrivialtocheckinoursetting.

Références

Documents relatifs

— This result follows at once by combining Theorem 2.2.1 with general theorems of Tumarkin relating the boundary values of meromorphic functions of bounded type and the distribution

With the same query and database, moving to more accurate strategie, that is, from the naive (resp. semi-naive, lazy) evaluation to the semi-naive (lazy, aware) one, users can

Since their computation is a coNP-hard problem, re- cent research has focused on developing evaluation algorithms with correctness guarantees, that is, techniques computing a sound

Logistic Operator Selection with storage and frozen product transport capacities using multicriteria decision2 Regarding the quality analysis, it was developed a model based

The natural way to obtain a calculus on string relations where one can freely compose string operations and relational operators is to start with a decidable structure on strings,

STORMER, E., Types of yon Neumann algebras associated with extremal invariant states... AUTOMORPHISMS AND INVARIANT STATES OF

EPITA — École Pour l’Informatique et les Techniques Avancées. June

EPITA — École Pour l’Informatique et les Techniques Avancées. June