Cleaning data with Llunatic

25  Download (0)

Full text


(will be inserted by the editor)

Cleaning Data with Llunatic

Floris Geerts · Giansalvatore Mecca · Paolo Papotti · Donatello Santoro

October 5, 2019

Abstract Data-cleaning (or data-repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicting values and are specialized toward specific classes of con- straints. In this paper we develop a generalchase-based re- pairing framework, referred to as LLUNATIC, in which re- pairs can be obtained for a large class of constraints and by using different strategies to select preferred values. The framework is based on an elegant formalization in terms of labeled instances and partially ordered preference labels. In this context, we revisit concepts such as upgrades, repairs and the chase. In LLUNATIC, various repairing strategies can be slotted in, without the need for changing the under- lying implementation. Furthermore, LLUNATIC is the first data repairing system which is DBMS-based. We report ex- perimental results that confirm its good scalability and show that various instantiations of the framework result in repairs of good quality.

1 Introduction

In theconstraint-based approach to data quality, a database is said to be dirty if it contains inconsistencies with respect Floris Geerts

University of Antwerp Giansalvatore Mecca Universit`a della Basilicata Paolo Papotti

Eurecom Donatello Santoro Universit`a della Basilicata

to some set of constraints [7, 24, 37]. The correspondingdata- repairingprocess consists in removing these inconsistencies in order to clean the database. The modeling and repairing of dirty data represents a crucial activity in many real-life infor- mation systems. Indeed, unclean data often incurs economic loss and erroneous decisions [21, 24, 37]. For these reasons, several constraint-based data quality approaches have recent- ly been put forward in the database community. These can be distinguished based on the following three facets:

–Facet 1: Data-quality constraints. A plenitude of languages has been devised to capture various aspects of dirty data as inconsistencies of constraints. These constraint languages range from standard database dependency languages such as functional dependencies, to conditional functional depen- dencies [24, 25], to editing-rules [29] and fixing rules [52], among others.

– Facet 2: Conflict resolution. Repairing strategies for in- consistencies (violations of constraints) are based on extra information indicating how to modify the dirty data. In most cases, values are changed into “preferred” values. Preferred values can be found from, e.g., master data [43], tuple-cer- tainty and value-accuracy [30], freshness and currency [28], just to name a few.

– Facet 3: Selecting types of repairs. Repairing strategies also differ in the kind of repairs that they compute. Since the computation of all possible repairs is infeasible in prac- tice, conditions are imposed on the computed repairs to re- strict the search space. These conditions include, e.g., var- ious notions of (cost-based) minimality [9, 11] and certain fixes [29]. Alternatively, sampling techniques are put in place to randomly select repairs [9].

We refer to [24] and [37] for recent overviews of con- straint-based approaches to data quality. We note, however, that there is currentlyno uniform framework to handle all of the three facets in a flexible and efficient way. To remedy


this situation, in this paper we describe a flexible and ef- ficient repairing system referred to as LLUNATIC. A system overview of LLUNATICis shown in Figure 1. We provide de- tailed running examples and descriptions of all ingredients of the system later in the paper. We here briefly highlight the key components of LLUNATIC.

The LLUNATICsystem consists of two core components:

(i)aninitial labeled instanceand(ii)adisk-based chase en- gineover labeled instances. Furthermore, (iii) the constraint language is fixed to a generalization ofequality-generating dependencies.

(i)The initial labeled instance is a generalization of a stan- dard database instance in which additional information is stored alongside the values in the input dirty instance. This information is providedby the user at the start of the repair- ing process. Intuitively, when conflicts need to be resolved at some point, this information allows inferring themost pre- ferred way of modifying the data. More precisely, each lo- cation in the database will be adorned with a set of prefer- ence labels each consisting of a preference level and a value, where preference levels are elements of apartial order. The partial order will allow us to define a notion ofmost pre- ferred value, which will be used to resolve a conflict. This value can either be a normal domain value or, when insuf- ficient information is encoded in the partial order, a special value which we refer to as a llun (llun stands for the reverse of null), hence the name LLUNATIC. The use of partial order information is what allows users toplug-in any kind of pref- erence information(cfr. Facet 2) into the repairing process (e.g., in a conflict, prefer the more recent value). We illus- trate in the next sections how information from constraints, master data, user-input, and special attributes (holding pref- erence information) can be encoded in the partial order and initial labeled instance.

(ii)The second component is the chase engine. A key in- sight underlying LLUNATICis that most data-repairing meth- ods behave like the well-knownchase procedure[2, 4], i.e., as long as violations (conflicts) of data-quality constraints exist, some data updates are triggered to resolve these vio- lations. Since we use labeled instances, wecompletely over- haulthe standard chase procedure such that itworks on la- beled instances. That is, it uses partial order information to resolve conflicts. This requires a revision of the formaliza- tion of the chase to ensure that it will generate repairs, which we develop in this paper. Furthermore, we provide adisk- based implementationof the revised chase procedure, with great attention for optimizations to ensure scalability. To our knowledge, LLUNATICis the first repairing framework that works with data residing on disk. We allow a fine-grained control of the chase by the user by means of a cost man- ager. This manager allows users to effectively reason about the trade-offs between performance and accuracy and let the chase only generate certain kinds of repairs (cfr. Facet 3).

<latexit sha1_base64="6s6N2kxdjpnChcXgatKhAvm+Y/w=">AAABXnicZY09T8MwEIbPBUoplAZYkFgisnRAUZylayUWJlQk+iGRKnLca2XViS3bqVRF/Ses8J/Y+ClQyAJ9pkf33t2baSmsi6IP0jg4PGoet07ap2ed8653cTm2qjQcR1xJZaYZsyhFgSMnnMSpNsjyTOIkW93v8skajRWqeHYbjbOcLQuxEJy571HqeVWSLfxHJezGnzPHtqkXRGH0g78vtJYAaoap10/mipc5Fo5LZu0LjbWbVcw4wSVu20lpUTO+YkusnMjR3uVOGxXvmuj/v/syjkMahfQpDga9urMFN3ALPaDQhwE8wBBGwGENr/AG7/BJmqRDur+rDVLfXMEfyPUXln9VlA==</latexit><latexit sha1_base64="cP7tAvxT5aNz33BRk6qSFjKgl+E=">AAABXnicZY3BTsJAEIZnURFRpOrFxEtjLxxM0+2FKwkXj5hYILGk2S4D2bDtbna3JITwJl71nbz5KIr2onynL/PPzJ9rKayLog/SODo+aZ62ztrnF53Lrnd1PbaqMhwTrqQy05xZlKLExAkncaoNsiKXOMlXw30+WaOxQpXPbqNxVrBlKRaCM/c9yjxvm+YLfyiRlf6cObbLvCAKox/8Q6G1BFAzyrx+Ole8KrB0XDJrX2is3WzLjBNc4q6dVhY14yu2xK0TBdqHwmmj4n0T/f/3UMZxSKOQPsXBoFd3tuAO7qEHFPowgEcYQQIc1vAKb/AOn6RJOqT7u9og9c0N/IHcfgFqT1Vl</latexit><latexit sha1_base64="m3BiODBAceU86sdITJfDXLHwG0c=">AAABWnicZY07T8MwFIWvw6sPHuGxsVRk6YCiOEvXSiwMHYpE2kqkVI65VFad2LIdJBT1f7DCv0Lix5BCFuiZvnvuvedkWgrrouiTeDu7e/sHrXane3h0fOKfnk2sKg3HhCupzCxjFqUoMHHCSZxpgyzPJE6z1c1mP31BY4Uq7t2rxnnOloV4Fpy52nqsUst7o1FZ1CNfL/wgCqMf9baBNhBAo/HCH6RPipc5Fo5LZu0DjbWbV8zUaRLXnbS0qBlfsSVWTuRor3OnjYo3TfR/7jZM4pBGIb2Lg2G/6WzBJVxBHygMYAi3MIYEOBh4g3f4gC/ikTbp/p56pPk5hz8iF99K4VSt</latexit>

<latexit sha1_base64="f2qnTv8+yOa88254sQymyk+EAVM=">AAABf3icZY29TsMwFIWvy18JfwFGlogOdEBRnKXqVokFJIYikbYSqSrHvY2sOo5lu0go6nPwNKzwDLwNbckC/aZzz733nExLYV0UfZPGzu7e/kHz0Ds6Pjk9888vBrZcGI4JL2VpRhmzKIXCxAkncaQNsiKTOMzmd+v98BWNFaV6dm8axwXLlZgJztzKmvg05agcGqHyoEqzWfCghBNMLtPU28yPLEOJ05VvHVMclxO/FYXRhmBb0Fq0oKY/8TvptOSLYtXDJbP2hcbajStmnOASl166sKgZn7McKycKtLeF06aM1030f+62GMQhjUL6FLd67bqzCVdwDW2g0IEe3EMfEuDwDh/wCV+EkBsSkuj3tEHqn0v4A+n+ANDMYhQ=</latexit> <latexit sha1_base64="lmQwmEv7mZU0Hnr7XNc1+CM8Stc=">AAABfnicZY3NTsJAFIXv4B/WH6ou3TQ2MSwE225wSUQTl5hYILGETIdLnTD9yczUxDR9DZ/Grb6Db2PBbpSzOvfce88XZoIr7TjfpLG1vbO719w3Dg6PjlvmyelIpblk6LNUpHISUoWCJ+hrrgVOMok0DgWOw+VgtR+/olQ8TZ70W4bTmEYJX3BGdRXNTCdgmGiUPImsIggX1h1Xy85t1Tgvg8BYR4OXarTuk6hilDPTdrrOWtamcWtjQ63hzOwF85TlcYVhgir17HqZnhZUas4ElkaQK8woW9IIC81jVFexzmTqrUju/95NM/K6rtN1Hz27366ZTTiHC2iDCz3owwMMwQcG7/ABn/BFgFySDrn+PW2Q+ucM/ojc/ADOKGEz</latexit>

<latexit sha1_base64="gIlaopSFQOzUEl37efDJadNoK74=">AAABb3icZY3LSsNAFIZP6q3WW9SFC0Gi2VSQkMmm24IblxFMWzAlTKYnYegkGWYmBQld+zRu9Vl8DN/AVrPRfquPc/n/VAquje9/Wp2t7Z3dve5+7+Dw6PjEPj0b6apWDCNWiUpNUqpR8BIjw43AiVRIi1TgOJ3fr/fjBSrNq/LJvEicFjQvecYZNatRYl/HDEuDipe508Rp5oQKM1QKZ86Cihr1MrFd3/N/cDaFtOJCS5jYg3hWsbpY5TJBtX4mgTTThirDmcBlL641SsrmNMfG8AL1XWGkqoJ1E/mfuymjwCO+Rx4Dd9hvO7twCTfQBwIDGMIDhBABg1d4g3f4gC/rwrqynN/TjtX+nMMfrNtvxdJc/A==</latexit> <latexit sha1_base64="6WtA/L/PReFrmhcuUQQvOnJoe2o=">AAABanicZY27TsMwFIZPyq2UWygTKkNEGDqgKM7StVIXxiKRthKpIsecRlYdO7IdJBR14WlY4W14Bx6CFrJAv+nTufx/VgpubBh+Oq2d3b39g/Zh5+j45PTMPe9OjKo0w5gpofQsowYFlxhbbgXOSo20yAROs+Vos58+ozZcyQf7UuK8oLnkC86oXY9S9yphKC1qLnOvTrKFN1LSWE25tGaVun4YhD9420Ia8aFhnLqD5EmxqlhHMkGNeSRRaec11ZYzgatOUhksKVvSHGvLCzS3hS21ijZN5H/utkyigIQBuY/8Yb/pbEMPrqEPBAYwhDsYQwwMXuEN3uEDvpyuc+n0fk9bTvNzAX9wbr4BRtJbEw==</latexit>

<latexit sha1_base64="uVDMb8RkXZkOYd9f66kxK5/ebAI=">AAABanicZY09TsNAEIVnw18IPzGhQqGwMEUKZNlu0kaCggYpSDiJhCNrvUysVdb2aneNhKw0nIYWbsMdOAQOuIG86ps3M+8lUnBtPO+TtLa2d3b32vudg8Oj46510pvoolQMQ1aIQs0SqlHwHEPDjcCZVEizROA0WV6v99NnVJoX+YN5kTjPaJrzBWfU1FZsnUcMc4OK56ldRcnCvqO6Hu0baugqthzP9X5kb4LfgAONxrE1jJ4KVmZ1JBNU60c/kGZeUWU4E7jqRKVGSdmSplgZnqG+yoxURbBu8v/nbsIkcH3P9e8DZzRoOtvQhwsYgA9DGMEtjCEEBq/wBu/wAV+kR85I//e0RZqfU/gjcvkNv31agQ==</latexit>

<latexit sha1_base64="1rs2/RYTwkXRCFXWtTX7ZGGe4CA=">AAABenicZY3LSsNAFIZP6q3WW9Slm9BsKtaQBKHbght3RjRtwZQwmZ6GoZNkmJkIEvoSPo1bfQvfxYWNBkH7rz7O+c/5EsGZ0q77YbQ2Nre2d9q7nb39g8Mj8/hkpIpSUgxpwQs5SYhCznIMNdMcJ0IiyRKO42RxXe/HTygVK/IH/SxwmpE0Z3NGiV6NYrMfUcw1SpanVhUlcysgUjPCL2/lDKV1L5D+tpexabuO+x1rHbwGbGgSxOYgmhW0zFYKyolSj54v9LSqFZTjshOVCgWhC5JipVmGqp9pIQu/Nnn//67DyHc81/HufHvYa5xtOIMu9MCDAQzhBgIIgcILvMIbvMOn0TXOjYufastobk7hT4yrL5+MYXI=</latexit> <latexit sha1_base64="ihNnnHDV/rxn/Lvwmc9SFFkMXO0=">AAABa3icZY3LSsNAFIZP6q3WS6PdeYFgELqQkMmm20I3boQKpi2YEibjaRg6yYSZiSChK5/GrT6ND+E7mGo22n/1cS7fnxSCa+P7n1Zra3tnd6+93zk4PDru2ienEy1LxTBkUkg1S6hGwXMMDTcCZ4VCmiUCp8lytN5Pn1FpLvMH81LgPKNpzhecUVOPYvsyYpgbVDxPnSpKFs5IauPc0ZymqFax7fqe/xNnE0gDLjQZx/YgepKszGonE1TrRxIUZl5RZTgTuOpEpcaCsmVtrwzPUN9kplAyWDeR/95NmAQe8T1yH7jDftPZhnO4gj4QGMAQbmEMITB4hTd4hw/4snrWmXXxe9qymp8e/Il1/Q1R6lr5</latexit><latexit sha1_base64="VQZaGv7joES+1JNiUpmPVJI8+7E=">AAABeXicZY3LSsNAFIZP6q3GW9Slm2gWBpWSiYtuC25cVjBtwZQwmRzD0EkyzEwKUvoQPo1bfQyfxY1Rg6D9Vx/n/Od8qRRcmyB4tzpr6xubW91te2d3b//AOTwa6apWDCNWiUpNUqpR8BIjw43AiVRIi1TgOJ3dfO3Hc1SaV+W9eZI4LWhe8kfOqGlGiXMZMywNKl7mbqRRnWtXqmrOM8zi2NYS2W9ZJ44X9ILvuKtAWvCgzTBx+nFWsbpoDExQrR9IKM10QZXhTODSjmuNkrIZzXFheIH6qjCNPVw2JvL/7yqMwh4JeuQu9AZ+6+zCCZyBDwT6MIBbGEIEDJ7hBV7hDT6sU8u3Ln6qHau9OYY/sa4/ASxxYSE=</latexit>

<latexit sha1_base64="m3pCVPcBHT7AUFK43+zTWfUYKN4=">AAABZ3icZY3LSsNAFIbP1Futl0YFUdxEu+lCQ5JNtwU3LiuYtmBKmIwnZehkMsxMChKKT+NWn8dH8C00mo32W32c/5zzp0pwY33/g7Q2Nre2d9q7nb39g8Ouc3Q8NkWpGUasEIWeptSg4BIjy63AqdJI81TgJF3c1vlkidrwQj7YZ4WznM4lzzij9nuUOOdVnGZuZFDfcKlK62alZHW0Spye7/k/uOsSNNKDhlHiDOKngpU5SssENeYxCJWdVVRbzgSuOnFpUFG2oHOsLM/RXOdW6SKsm4L/f9dlHHqB7wX3YW/YbzrbcAFX0IcABjCEOxhBBAxe4BXe4B0+SZeckrPf1RZpbk7gD+TyC4SwWaQ=</latexit>

Fig. 1: System overview of LLUNATIC.

(iii)The use of labeled instances and more specifically the partial order information on preference levels allows us to model a variety of constraint formalisms in a uniform way (cfr. Facet 1). More precisely, LLUNATICsupports variable and constantequality-generating dependencies(egds). Con- stant egds are an extension of classical egds [4] which can enforce the presence of certain constants. We provide ample examples of egds and how labeled instances and the chase interact with those constraints in the next sections. In the following it should therefore be understood that when we mention data-quality constraints (or dependencies) we al- ways mean egds. We further emphasize that we assume that the edgs are given, either by a domain expert or automati- cally discovered from the data (see e.g., [10, 26, 35, 45, 46]).

In summary, we propose LLUNATICas a data repairing framework which addresses all three facets in a flexible way.

We remark that this paper is based on our previous work [31]. Due to the complexity of the formalization used in [31]

we believe that some of the key insights behind our approach were unclear. We therefore completely changed the under- lying formalization: Everything is now modeled in terms of labeled instances. Not only does this result in a more elegant way of describing how the partial order information is used to resolve conflicts during the chase, it also provides a more clear semantics of repairs. Moreover, labeled instances and the integration of the partial order in the chase process may be of interest in its own right. The current formalization is closer to the standard chase and it is now clearer how our ideas can be adopted in other contexts as well where now only the standard chase is available. We only consider vari- able and constant egds in this paper. In [32], we extended our approach to more powerful constraints (tuple-generating de- pendencies, tgds). The labeled instance formalization can be extended to this more general setting but for the sake of clar- ity of exposition, we do not consider tgds in this paper. We remark that we provide many more details compared to [31, 32] and also include a more extensive experimental evalua- tion than in our previous work.

Organization of the paper. We start with preliminaries in Section 2. Motivation for using labeled instances and the


chase for repairing can be found in Section 3. We provide a formalization of our approach in Section 4. We detail the chase procedure and how to use LLUNATIC in Section 5.

Optimizations and implementation details are described in Section 6. We discuss how ideas from other approaches can be integrated in Section 7. In Section 8 we report our exper- imental findings. Finally, related work and future directions of research are discussed in Sections 9 and 10, respectively.

We conclude the paper in Section 11.

2 Preliminaries

We fix a countably infinite domain of constant values, de- noted by CONSTS, and a countably infinite set of labeled nulls, denoted by NULLS, distinct from CONSTS.Labeled nulls will be denoted by ⊥0,⊥1,⊥2,. . ., and are used to denote different unknown values, i.e.,⊥i is assumed to be different from⊥j, fori6=j.Furthermore, we fix a countably infinite set, TIDS, oftuple identifiersdistinct fromCONSTS


Database instances and cells.AschemaR is a finite set {R1, . . . ,Rk}ofrelation symbols. Fori∈[1,k], the relation symbolRiinRhas a fixed set of attributes,Tid,A1, . . . ,Ani, whereTidis a special attribute whose domain is TIDS. All other attributes have CONSTSNULLS as domain. Fori∈ [1,k], aninstance of Riis a finite subsetIiof TIDS×(CONSTS

NULLS)ni. A (database)instanceI= (I1, . . . ,Ik)ofR con- sists of instancesIi ofRi, fori∈[1,k]. Atuple tinIis an element in one of the instancesIi inI. Lett be a tuple in Ii andAj an attribute inRi. We denote byt[Aj]the value of tuplet in attributeAj. We assume that every tuple inI has a uniqueTid-value. Ift is a tuple inIwitht[Tid] =tid, then we also refer to this (unique) tuple byttid. Given an instance Ii over Ri, a cellin Ii is a location specified by a tuple id/attribute pairhtid,Aji(orhtid,Tidi), where tidis an identifier for a tuple in Ii andAj is an attribute in Ri. The value of a cell htid,Aji inIi is the value of tuple ttid in attribute Aj, i.e.,ttid[Aj]. Similarly, the value of the cell htid,TidiinIiistid. We denote bycells(Ii)the set of allcells inIi. Similarly, for an instanceI= (I1, . . . ,Ik)ofRwe define cells(I) =S{cells(Ii)|i∈[1,k]}.

Example 1 An instanceI of relationD(Tid,NPI,Name,Sur- name,Spec,Hospital)consisting of tuplest1,t2andt3(with ids 1, 2 and 3, respectively) containing information about doctors is shown in Figure 2. Also depicted is a master data instance J, containing correct data, of schemaM(Tid,NPI, Name,Surname,Spec,Hospital)consisting of a single tuple tm. In instanceIwe also depict the cells incells(I). For cells corresponding to theTid-attribute, we simply denote the cell bycti, whereiis the tuple identifier for the tuple. For exam- ple,ct1=h1,Tidi. For cells corresponding to other attributes, we represent cells byci j. For example,c10=h1,NPIi,c11=

h1,Namei, and so on. We ignore the attributeConf(idence) for the moment. We do not include cells for the master data instance. We show below that we can eliminate master data instances using an encoding in data-quality constraints. ♦ Data-quality constraints.We will useequality-generating dependencies(oregdsfor short) to express data-quality con- straints (also calledquality rulesin the literature). In fact, we use two types of egds:variableandconstantegds. Un- like the seminal paper [4], our egds need not to be typed and can contain constants. A variable egd is defined as fol- lows. A relational atom overR is a formula of the form R(s)¯ withR∈Rand ¯sis a tuple of (not necessarily distinct) constants and variables. Then, avariable egd over R is a formulae of the form∀x(φ(¯ x)¯ →xi=xj), whereφ(x)¯ is a conjunction of relational atoms overRwith variables ¯x, andxiandxj are variables in ¯x. A constant egd eis of the form∀x(φ(¯ x)¯ →x=a)wherexis a variable in ¯xandais a constant inCONSTS. It is common to write an egdewith- out writing the universal quantification. So, in what follows φ(x)¯ →xi=xj corresponds to∀x(φ¯ (x)¯ →xi=xj). Simi- larly for constant egds.

To prevent any interaction of the egds with the tuple identifiers we assume that: (i) every relational atomR(¯s)in φ(x)¯ carries a variable in its first position, i.e.,s1is a vari- able, and this variable does not occur anywhere else in ¯sand also does not occur in any other relational atom inφ(x); and¯ (ii) ifφ(x)¯ →xi=xj or φ(x)¯ →x=a, then neitherxi,xj norxcan be a variable that occurs in the first position of a relational atom inφ(x). These conditions basically indicate¯ that the egds do not pose constraints on the tuple identifiers.

In the constraint-based data quality approach, dependen- cies are used to assess the cleanliness of data [7, 24, 37].

More specifically, an instanceIofRsatisfies an egd e, vari- able or constant, denoted byI|=e, if it satisfieseaccording to satisfaction of first-order logic with (i) a Herbrand inter- pretation of the constants ine, and (ii) the universe of dis- course of the first-order structure being CONSTSNULLS

(and TIDSfor theTid-attributes) [2]. In particular, nulls are interpreted as constants and hence⊥=a is false fora∈

CONSTS. Similarly,⊥i=⊥jis false for two different nulls.

Example 2 Examples of variable egds include functional and (variable) conditional functional dependencies. Constant con- ditional functional dependencies can be expressed as con- stant egds. In Figure 2, the variable egdse1–e4correspond to the functional dependency expressing that attributeNPIis a key for relationD(we again ignore the attributeConfand also do not take into account theTid-attribute). Furthermore, the constant egde5corresponds to the constant conditional functional dependency which requires the standardization of the name “Greg” into “Gregory”. Finally, constant egdse6– e9originate from a so-called editing rule, which states that



Tid NPI Name Surname Spec Conf Hospital

ct1 c10 c11 c12 c13 c14 c15

1 111 Robert Chase surg 0.9 PPTH

ct2 c20 c21 c22 c23 c24 c25

2 111 Frank Chase urol 0.1 1

ct3 c30 c31 c32 c33 c34 c35

3 222 2 House diag 1 1

M(aster Data)

Tid NPI Name Surname Spec Hospital

m 222 Greg House diag PPTH

e1=D(tid,npi,nm,sur,spec,hosp)∧D(tid0,npi,nm0,sur0,spec0,hosp0)nm=nm0 e2=D(tid,npi,nm,sur,spec,hosp)∧D(tid0,npi,nm0,sur0,spec0,hosp0)sur=sur0 e3=D(tid,npi,nm,sur,spec,hosp)∧D(tid0,npi,nm0,sur0,spec0,hosp0)spec=spec0 e4=D(tid,npi,nm,sur,spec,hosp)∧D(tid0,npi,nm0,sur0,spec0,hosp0)hosp=hosp0 e5=D(tid,npi,Greg,sur,spec,hosp)nm=Gregory

e6=D(tid,222,nm,sur,spec,hosp)nm=Greg e7=D(tid,222,nm,sur,spec,hosp)sur=House e8=D(tid,222,nm,sur,spec,hosp)spec=diag e9=D(tid,222,nm,sur,spec,hosp)hosp=PPTH

Fig. 2: Running example: Instances and equality-generating dependencies.

any tuplesinIwhich shares the sameNPI-value with a tu- plet in the master dataJmust agree on all common other attributes with t. Such editing rules are easily seen to be equivalent to a set of constant egds by introducing a con- stant egd for each tuple in the master data and each attribute (except for theTid-attribute) in the master data’s schema. In our example,t3[NPI] =tm[NPI] =222 and the editing rule is translated into the four constant egdse6–e9basically requir- ing the four remaining attributes oft3to take the correspond- ing values fromtm. In this way, master data and editing rules are represented by constant egds. It is readily verified that I|={e2,e7,e8}butI does not satisfy any of the remaining egds. For example,t1[Name] =Robert andt2[Name] =Frank should be equal according to the egde1. Hence,Iis a dirty instance relative to these egds. ♦

The previous example shows that egds are expressive enough to capture a wide variety of existing data-quality formalisms: Functional dependencies [2], conditional func- tional dependencies [25], and editing rules [29]. Further- more, one can also verify that egds can express fixing rules (without negative patterns) [52], and certain classes of denial constraints (basically, denial constraints which are logically equivalent to an egd) [7, 24]. Motivated by this, we focus on egds. Not supported in LLUNATICare, for example, match- ing dependencies [23], metric dependencies [42], differen- tial dependencies [49], and general denial constraints [7, 24].

We defer to future work to include a larger variety of data- quality constraints in LLUNATIC.

3 LLUNATIC: Finding repairs using the chase

With the constraint formalism fixed, we next turn to there- pairingorcleaningof the data. Intuitively, repairing a dirty instance Isuch that I6|=Σ for some setΣ of egds means finding a clean instanceJsuch thatJ|=Σ andIandJare

“closely related”. As already mentioned in the Introduction, in recent years, repairing methods have been proposed for several classes of constraints. These methods typically con- sider only specific types of constraints and different inter- pretations of “closely related”. Furthermore, each of these

methods differs in how conflicting values (such as Robert and Frank in the previous example) are resolved. We refer to the Related Work Section 9 for more details.

With LLUNATIC we aim to provide a single-nodescal- able algorithmic framework for finding repairsof instances for a set of egds, hereby covering different classes of con- straints in auniform way. Moreover,different conflict reso- lution strategiesshould beeasy to incorporatein the frame- work, without the need of changing the underlying algo- rithm (and thus implementation). To this aim, LLUNATIC

uses ageneralization of the standard chase procedureby in- corporatingpreference information on values in cellsand by producing achase treeconsisting ofchase sequences, where each sequence leads to a repair.

We next recall the standard chase procedure and then identify why it needs to be revised in order to become a true workhorse for repairing. In particular, we argue the need for better conflict resolution (Section 3.2), support for constant egds (Section 3.3), backward repairs (Section 3.4) and user provided repairs (Section 3.5). With the help of a motivating example, we informally introduce the main concepts used in LLUNATICto address these issues. A formal account will be given in Sections 4 and 5.

3.1 The standard chase

When Σ consists of variable egds only, the chase proce- dure (or, simply the chase) provides an elegant repairing method [4]. It works as follows. Consider a dirty instance I6|=Σ and variable egde:φ(x)¯ →xi=xj inΣ. A homo- morphismhfromφ(x)¯ toI= (I1, . . . ,Ik)is a mapping which assigns to every variablexin ¯xa value inCONSTSNULLS

(and TIDSfor the variables in Tid-attributes) such that ev- ery relational atomRi(s)¯ inφ(x)¯ maps onto a tupleh(s)¯ ∈Ii, wherehis the identity onCONSTS. Whenh(xi)6=h(xj), we say thate can be applied toIwith homomorphism h. The result of applying e onIwith his defined as follows: Ifh(xi) andh(xj)are two different constants in CONSTS, then the result of applyingeonIwithh is “failure”, and we write I→e,h . Otherwise, the result is a new instanceI0, defined as follows. Whenh(xi)∈CONSTSandh(xj)∈NULLS, the null


valueh(xj)is replaced everywhere inIby the constant value h(xi), resulting inI0. Whenh(xi)andh(xj)are both null val- ues, then one is replaced everywhere by the other, resulting inI0 1. In both cases we writeI→e,hI0. Then, for a setΣ of variable egds, achase sequence ofIwithΣis a sequence of the formIiei,hiIi+1 withi=0,1, . . .,I0=I,ei∈Σ andhi a homomorphism fromeitoIi. Afinite chase ofIwithΣis a finite chase sequenceIiei,hiIi+1,i∈[0,m−1], such that eitherIm= or no egd eexists inΣ for which there is a homomorphismhsuch thatecan be applied toImwithh.

We callImtheresultof such a finite chase and whenIm6= , the instanceImis called theresult of a successful chase. It is known that if Imis the result of a successful chase ofI withΣ, then Im|=Σ andIm is thus clean [4]. The repair Im has many other nice theoretical properties (universality, independence of the order in which egds are applied, ...), see e.g., [22]. Our revised chase does not inherit these prop- erties. Our primary goal, however, is using the chase as a practical way of generating repairs.

3.2 Avoiding failure by conflict resolution

Clearly, when repairingdata, the standard chase will often be unsuccessful because different constants may need to be equated. This is not surprising. After all, the chase was orig- inally designed to reason about dependencies, where the in- put instance does not contain constants, and not for repairing data [4]. As an example, we chase our running example with variable egds.

Example 3 Consider the variable egdse1,e2,e3 ande4 in Figure 2. SinceI|=e2, onlye1,e3 ande4are applicable.

It is readily verified that there is a homomorphismhfore4 such that Ie4,hI0 withI0 obtained from I by replacing the null value⊥1int2[Hospital]andt3[Hospital]by PPTH, i.e., the value fromt1[Hospital]. However, chasingI0further with e1ande3results in a failure. Indeed,e1requirest1[Name] = Robert to be equal tot2[Name] =Frank which are two dif- ferent constants. Similarly,e3requirest1[Spec] =surg to be equal tot2[Spec] =urol. ♦

Instead of simply returning failure ( ) it is desirable, from a data repairing perspective, for the chase to (i) ei- ther report the reasons for failure, or (ii) resolve conflicts between constant values based on some additional informa- tion. In LLUNATIC this is achieved as follows. Let Ibe a database instance.

– The initial step consists of adorning the cells in I with preference levelsfrom apartially ordered set(P,P)and

1 Typically, to make this step deterministic, an ordering on null val- ues is assumed and the smaller null value is replaced by the larger one.

We assume that0<1<2<3<· · ·.

combining these with values of the cells inI. As a result, each cell initially contains apreference labelof the form hp,viwherep∈Pandvis the value in the cell, resulting in theinitial labeled instanceI.Looking ahead, the ini- tial labeled instance will be changed during the (revised) chase process into a labeled instance in which cells may havemultiplepreference labels. Preference labels allow comparing values based on their preference levels and the order between these levels according toP. We use this information to resolve conflicts,when cells have multiple preference labels after chasing, by taking the most pre- ferred value, i.e., the value with the highest preference level. The partial order(P,P)is fixed at the beginning of the repairing process. For preference levelspandp0, we denote byp≺Pp0, ifpPp0andp06Pp.

Example 4 In Figure 3a we show an initial labeled instance Iobtained fromI(cfr. Figure 2) by putting in each cell ofI a single preference level together with the value of that cell.

From here on, we do not depict the attributeTidsince it is only used for identifying tuples and the tuple identifiers do not interact with the egds. InIwe find preference levels p1 andp2 assigned to null values⊥1and⊥2in cellsc25, c35andc31, respectively. We also find preference levelsp0.1, p0.9andp1, associated with urol, surg and diag, in cellsc23, c13andc33, respectively. These preference levels encode the confidence of these values according to theConf attribute.

By imposing p0.1P p0.9P p1we encode that the higher the confidence is, the more preferred the value associated with these levels in the preference labels is. This is an im- portant example showing how external information (in this case confidence) can be encoded in preference levels and la- bels. We generalize the use of attributes that encode some ordering (such asConf) to assign preference levels to values in another attribute (such asSpec) later in Section 4.5 in the form of apartial order specification. All other preference levels in I are chosen arbitrarily, except that we impose that p1Pp15. This is to indicate that when the value in the preference labels of cellsc25andc35(i.e., the value⊥1) needs to be equated later on with the value in the label ofc15 (i.e., PPTH) due to egde4, that PPTH is preferred over⊥1. We here capture the semantics that constants are preferred to null values, just as in the standard chase. From the labeled instanceIwe can obtain a normal instanceinst(I)simply by picking the values in the preference labels in each cell with the highest preference value (Figure 3b). InI, each cell carries a single preference label holding the original value of that cell in the instanceIgiven in Figure 2. Hence,inst(I) = Iin this case. The partial orderPused is shown in Figure 4 and comes into play when chasingI. In this partial order we more generally assume thatpiPpfor any preference level pi associated with null value⊥iand any preference value p associated with a constant inI. Furthermore, we



NPI Name Surname Spec Conf Hospital

c10 c11 c12 c13 c14 c15

t1 {hp10,111i} {hp11,Roberti} {hp12,Chasei} {hp0.9,surgi} {hp14,0.9i} {hp15,PPTHi}

c20 c21 c22 c23 c24 c25

t2 {hp20,111i} {hp21,Franki} {hp22,Chasei} {hp0.1,uroli} {hp24,0.1i} {hp1,⊥1i}

c30 c31 c32 c33 c34 c35

t3 {hp30,222i} {hp2,⊥2i} {hp32,Housei} {hp1,diagi} {hp34,1i} {hp1,⊥1i}

(a) Initial labeled instanceI


NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15 t1 111 Robert Chase surg 0.9 PPTH

c20 c21 c22 c23 c24 c25 t2 111 Frank Chase urol 0.1 1

c30 c31 c32 c33 c34 c35

t3 222 2 House diag 1 1

(b) Corresponding instanceinst(I).


NPI Name Surname Spec Conf Hospital

c10 c11 c12 c13 c14 c15


{hp10,111i} {hp11,Roberti, {hp12,Chasei} {hp0.9,surgi, {hp14,0.9i} {hp15,PPTHi, hp21,Franki} hp0.1,uroli} hp1,⊥1i}

c20 c21 c22 c23 c24 c25


{hp20,111i} {hp11,Roberti,

{hp22,Chasei} {hp0.9,surgi,

{hp24,0.1i} {hp15,PPTHi, hp21,Franki} hp0.1,uroli} hp1,⊥1i}

c30 c31 c32 c33 c34 c35


{hp30,222i} {hp2,⊥2i} {hp32,Housei} {hp1,diagi} {hp34,1i} {hp15,PPTHi, hp1,⊥1i}

(c) Labeled instanceI1?chased using variable egds inΣ.


NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15

t1 111 `0 Chase surg 0.9 PPTH c20 c21 c22 c23 c24 c25

t2 111 `0 Chase surg 0.1 PPTH c30 c31 c32 c33 c34 c35

t3 222 2 House diag 1 PPTH (d) Corresponding instanceinst(I1?).


NPI Name Surname Spec Conf Hospital

c10 c11 c12 c13 c14 c15







{hp15,PPTHi, hp21,Franki} hp0.1,uroli} hp1,⊥1i,


c20 c21 c22 c23 c24 c25







{hp15,PPTHi, hp21,Franki} hp0.1,uroli} hp1,⊥1i}

c30 c31 c32 c33 c34 c35




{hp32,Housei} {hp1,diagi} {hp34,1i}


hpau,Gregoryi, hp1,⊥1i}


(e) Labeled instanceI2?chased usingΣ.


NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15

t1 111 `0 Chase surg 0.9 PPTH c20 c21 c22 c23 c24 c25

t2 111 `0 Chase surg 0.1 PPTH c30 c31 c32 c33 c34 c35 t3 222 `1 House diag 1 PPTH

(f) Corresponding instanceinst(I2?).


NPI Name Surname Spec Conf Hospital

c10 c11 c12 c13 c14 c15

t1 {hp10,111i} {hp11,Roberti} {hp12,Chasei} {hp0.9,surgi} {hp14,0.9i} {hp15,PPTHi}

c20 c21 c22 c23 c24 c25

t2 {hp20,111i, {hp21,Franki} {hp22,Chasei} {hp0.9,uroli} {hp24,0.1i} {hp1,1i,

hp×,×i} hpau,PPTHi}

c30 c31 c32 c33 c34 c35




{hp32,Housei} {hp1,diagi} {hp34,1i}


hpau,Gregoryi, hpau,PPTHi}


(g) Labeled instanceI3?chased usingΣwith a backward repair.


NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15

t1 111 Robert Chase surg 0.9 PPTH c20 c21 c22 c23 c24 c25

t2 `2 Frank Chase urol 0.1 PPTH c30 c31 c32 c33 c34 c35

t3 222 `1 House diag 1 PPTH (h) Corresponding instanceinst(I3?).


NPI Name Surname Spec Conf Hospital

c10 c11 c12 c13 c14 c15


{hp10,111i} {hp11,Roberti} {hp12,Chasei} {hp0.9,surgi} {hp14,0.9i} {hp15,PPTHi}

c20 c21 c22 c23 c24 c25

t2 {hp20,111i,

{hp21,Franki} {hp22,Chasei} {hp0.9,uroli} {hp24,0.1i} {hp1,⊥1i

hp×,×i, {hpau,PPTHi}


c30 c31 c32 c33 c34 c35



{hp2,⊥2i, {hp32,Housei, {hp1,diagi,


{hp1,⊥1i, hpau,Gregoryi, hpau,PPTHi}

hpau,Gregi, hp>,Gregoryi}

(i) Labeled instanceI4?chased withΣand a user repair.


NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15 t1 111 Robert Chase surg 0.9 PPTH

c20 c21 c22 c23 c24 c25 t2 112 Frank Chase urol 0.1 PPTH

c30 c31 c32 c33 c34 c35 t3 222 Gregory House diag 1 PPTH

(j) Corresponding instanceinst(I4?).

Fig. 3: Running example: Labeled instances and their corresponding (standard) instances at different times during the LLU-

NATICchase process.




Related subjects :