(will be inserted by the editor)

### Cleaning Data with Llunatic

Floris Geerts · Giansalvatore Mecca · Paolo Papotti · Donatello Santoro

October 5, 2019

Abstract Data-cleaning (or data-repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicting values and are specialized toward specific classes of con- straints. In this paper we develop a generalchase-based re- pairing framework, referred to as LLUNATIC, in which re- pairs can be obtained for a large class of constraints and by using different strategies to select preferred values. The framework is based on an elegant formalization in terms of labeled instances and partially ordered preference labels. In this context, we revisit concepts such as upgrades, repairs and the chase. In LLUNATIC, various repairing strategies can be slotted in, without the need for changing the under- lying implementation. Furthermore, LLUNATIC is the first data repairing system which is DBMS-based. We report ex- perimental results that confirm its good scalability and show that various instantiations of the framework result in repairs of good quality.

1 Introduction

In theconstraint-based approach to data quality, a database is said to be dirty if it contains inconsistencies with respect Floris Geerts

University of Antwerp Giansalvatore Mecca Universit`a della Basilicata Paolo Papotti

Eurecom Donatello Santoro Universit`a della Basilicata

to some set of constraints [7, 24, 37]. The correspondingdata- repairingprocess consists in removing these inconsistencies in order to clean the database. The modeling and repairing of dirty data represents a crucial activity in many real-life infor- mation systems. Indeed, unclean data often incurs economic loss and erroneous decisions [21, 24, 37]. For these reasons, several constraint-based data quality approaches have recent- ly been put forward in the database community. These can be distinguished based on the following three facets:

–Facet 1: Data-quality constraints. A plenitude of languages has been devised to capture various aspects of dirty data as inconsistencies of constraints. These constraint languages range from standard database dependency languages such as functional dependencies, to conditional functional depen- dencies [24, 25], to editing-rules [29] and fixing rules [52], among others.

– Facet 2: Conflict resolution. Repairing strategies for in- consistencies (violations of constraints) are based on extra information indicating how to modify the dirty data. In most cases, values are changed into “preferred” values. Preferred values can be found from, e.g., master data [43], tuple-cer- tainty and value-accuracy [30], freshness and currency [28], just to name a few.

– Facet 3: Selecting types of repairs. Repairing strategies also differ in the kind of repairs that they compute. Since the computation of all possible repairs is infeasible in prac- tice, conditions are imposed on the computed repairs to re- strict the search space. These conditions include, e.g., var- ious notions of (cost-based) minimality [9, 11] and certain fixes [29]. Alternatively, sampling techniques are put in place to randomly select repairs [9].

We refer to [24] and [37] for recent overviews of con- straint-based approaches to data quality. We note, however, that there is currentlyno uniform framework to handle all of the three facets in a flexible and efficient way. To remedy

this situation, in this paper we describe a flexible and ef- ficient repairing system referred to as LLUNATIC. A system overview of LLUNATICis shown in Figure 1. We provide de- tailed running examples and descriptions of all ingredients of the system later in the paper. We here briefly highlight the key components of LLUNATIC.

The LLUNATICsystem consists of two core components:

(i)aninitial labeled instanceand(ii)adisk-based chase en- gineover labeled instances. Furthermore, (iii) the constraint language is fixed to a generalization ofequality-generating dependencies.

(i)The initial labeled instance is a generalization of a stan- dard database instance in which additional information is stored alongside the values in the input dirty instance. This information is providedby the user at the start of the repair- ing process. Intuitively, when conflicts need to be resolved at some point, this information allows inferring themost pre- ferred way of modifying the data. More precisely, each lo- cation in the database will be adorned with a set of prefer- ence labels each consisting of a preference level and a value, where preference levels are elements of apartial order. The partial order will allow us to define a notion ofmost pre- ferred value, which will be used to resolve a conflict. This value can either be a normal domain value or, when insuf- ficient information is encoded in the partial order, a special value which we refer to as a llun (llun stands for the reverse of null), hence the name LLUNATIC. The use of partial order information is what allows users toplug-in any kind of pref- erence information(cfr. Facet 2) into the repairing process (e.g., in a conflict, prefer the more recent value). We illus- trate in the next sections how information from constraints, master data, user-input, and special attributes (holding pref- erence information) can be encoded in the partial order and initial labeled instance.

(ii)The second component is the chase engine. A key in- sight underlying LLUNATICis that most data-repairing meth- ods behave like the well-knownchase procedure[2, 4], i.e., as long as violations (conflicts) of data-quality constraints exist, some data updates are triggered to resolve these vio- lations. Since we use labeled instances, wecompletely over- haulthe standard chase procedure such that itworks on la- beled instances. That is, it uses partial order information to resolve conflicts. This requires a revision of the formaliza- tion of the chase to ensure that it will generate repairs, which we develop in this paper. Furthermore, we provide adisk- based implementationof the revised chase procedure, with great attention for optimizations to ensure scalability. To our knowledge, LLUNATICis the first repairing framework that works with data residing on disk. We allow a fine-grained control of the chase by the user by means of a cost man- ager. This manager allows users to effectively reason about the trade-offs between performance and accuracy and let the chase only generate certain kinds of repairs (cfr. Facet 3).

<latexit sha1_base64="6s6N2kxdjpnChcXgatKhAvm+Y/w=">AAABXnicZY09T8MwEIbPBUoplAZYkFgisnRAUZylayUWJlQk+iGRKnLca2XViS3bqVRF/Ses8J/Y+ClQyAJ9pkf33t2baSmsi6IP0jg4PGoet07ap2ed8653cTm2qjQcR1xJZaYZsyhFgSMnnMSpNsjyTOIkW93v8skajRWqeHYbjbOcLQuxEJy571HqeVWSLfxHJezGnzPHtqkXRGH0g78vtJYAaoap10/mipc5Fo5LZu0LjbWbVcw4wSVu20lpUTO+YkusnMjR3uVOGxXvmuj/v/syjkMahfQpDga9urMFN3ALPaDQhwE8wBBGwGENr/AG7/BJmqRDur+rDVLfXMEfyPUXln9VlA==</latexit><latexit sha1_base64="cP7tAvxT5aNz33BRk6qSFjKgl+E=">AAABXnicZY3BTsJAEIZnURFRpOrFxEtjLxxM0+2FKwkXj5hYILGk2S4D2bDtbna3JITwJl71nbz5KIr2onynL/PPzJ9rKayLog/SODo+aZ62ztrnF53Lrnd1PbaqMhwTrqQy05xZlKLExAkncaoNsiKXOMlXw30+WaOxQpXPbqNxVrBlKRaCM/c9yjxvm+YLfyiRlf6cObbLvCAKox/8Q6G1BFAzyrx+Ole8KrB0XDJrX2is3WzLjBNc4q6dVhY14yu2xK0TBdqHwmmj4n0T/f/3UMZxSKOQPsXBoFd3tuAO7qEHFPowgEcYQQIc1vAKb/AOn6RJOqT7u9og9c0N/IHcfgFqT1Vl</latexit><latexit sha1_base64="m3BiODBAceU86sdITJfDXLHwG0c=">AAABWnicZY07T8MwFIWvw6sPHuGxsVRk6YCiOEvXSiwMHYpE2kqkVI65VFad2LIdJBT1f7DCv0Lix5BCFuiZvnvuvedkWgrrouiTeDu7e/sHrXane3h0fOKfnk2sKg3HhCupzCxjFqUoMHHCSZxpgyzPJE6z1c1mP31BY4Uq7t2rxnnOloV4Fpy52nqsUst7o1FZ1CNfL/wgCqMf9baBNhBAo/HCH6RPipc5Fo5LZu0DjbWbV8zUaRLXnbS0qBlfsSVWTuRor3OnjYo3TfR/7jZM4pBGIb2Lg2G/6WzBJVxBHygMYAi3MIYEOBh4g3f4gC/ikTbp/p56pPk5hz8iF99K4VSt</latexit>

<latexit sha1_base64="f2qnTv8+yOa88254sQymyk+EAVM=">AAABf3icZY29TsMwFIWvy18JfwFGlogOdEBRnKXqVokFJIYikbYSqSrHvY2sOo5lu0go6nPwNKzwDLwNbckC/aZzz733nExLYV0UfZPGzu7e/kHz0Ds6Pjk9888vBrZcGI4JL2VpRhmzKIXCxAkncaQNsiKTOMzmd+v98BWNFaV6dm8axwXLlZgJztzKmvg05agcGqHyoEqzWfCghBNMLtPU28yPLEOJ05VvHVMclxO/FYXRhmBb0Fq0oKY/8TvptOSLYtXDJbP2hcbajStmnOASl166sKgZn7McKycKtLeF06aM1030f+62GMQhjUL6FLd67bqzCVdwDW2g0IEe3EMfEuDwDh/wCV+EkBsSkuj3tEHqn0v4A+n+ANDMYhQ=</latexit> <latexit sha1_base64="lmQwmEv7mZU0Hnr7XNc1+CM8Stc=">AAABfnicZY3NTsJAFIXv4B/WH6ou3TQ2MSwE225wSUQTl5hYILGETIdLnTD9yczUxDR9DZ/Grb6Db2PBbpSzOvfce88XZoIr7TjfpLG1vbO719w3Dg6PjlvmyelIpblk6LNUpHISUoWCJ+hrrgVOMok0DgWOw+VgtR+/olQ8TZ70W4bTmEYJX3BGdRXNTCdgmGiUPImsIggX1h1Xy85t1Tgvg8BYR4OXarTuk6hilDPTdrrOWtamcWtjQ63hzOwF85TlcYVhgir17HqZnhZUas4ElkaQK8woW9IIC81jVFexzmTqrUju/95NM/K6rtN1Hz27366ZTTiHC2iDCz3owwMMwQcG7/ABn/BFgFySDrn+PW2Q+ucM/ojc/ADOKGEz</latexit>

<latexit sha1_base64="gIlaopSFQOzUEl37efDJadNoK74=">AAABb3icZY3LSsNAFIZP6q3WW9SFC0Gi2VSQkMmm24IblxFMWzAlTKYnYegkGWYmBQld+zRu9Vl8DN/AVrPRfquPc/n/VAquje9/Wp2t7Z3dve5+7+Dw6PjEPj0b6apWDCNWiUpNUqpR8BIjw43AiVRIi1TgOJ3fr/fjBSrNq/LJvEicFjQvecYZNatRYl/HDEuDipe508Rp5oQKM1QKZ86Cihr1MrFd3/N/cDaFtOJCS5jYg3hWsbpY5TJBtX4mgTTThirDmcBlL641SsrmNMfG8AL1XWGkqoJ1E/mfuymjwCO+Rx4Dd9hvO7twCTfQBwIDGMIDhBABg1d4g3f4gC/rwrqynN/TjtX+nMMfrNtvxdJc/A==</latexit> <latexit sha1_base64="6WtA/L/PReFrmhcuUQQvOnJoe2o=">AAABanicZY27TsMwFIZPyq2UWygTKkNEGDqgKM7StVIXxiKRthKpIsecRlYdO7IdJBR14WlY4W14Bx6CFrJAv+nTufx/VgpubBh+Oq2d3b39g/Zh5+j45PTMPe9OjKo0w5gpofQsowYFlxhbbgXOSo20yAROs+Vos58+ozZcyQf7UuK8oLnkC86oXY9S9yphKC1qLnOvTrKFN1LSWE25tGaVun4YhD9420Ia8aFhnLqD5EmxqlhHMkGNeSRRaec11ZYzgatOUhksKVvSHGvLCzS3hS21ijZN5H/utkyigIQBuY/8Yb/pbEMPrqEPBAYwhDsYQwwMXuEN3uEDvpyuc+n0fk9bTvNzAX9wbr4BRtJbEw==</latexit>

<latexit sha1_base64="uVDMb8RkXZkOYd9f66kxK5/ebAI=">AAABanicZY09TsNAEIVnw18IPzGhQqGwMEUKZNlu0kaCggYpSDiJhCNrvUysVdb2aneNhKw0nIYWbsMdOAQOuIG86ps3M+8lUnBtPO+TtLa2d3b32vudg8Oj46510pvoolQMQ1aIQs0SqlHwHEPDjcCZVEizROA0WV6v99NnVJoX+YN5kTjPaJrzBWfU1FZsnUcMc4OK56ldRcnCvqO6Hu0baugqthzP9X5kb4LfgAONxrE1jJ4KVmZ1JBNU60c/kGZeUWU4E7jqRKVGSdmSplgZnqG+yoxURbBu8v/nbsIkcH3P9e8DZzRoOtvQhwsYgA9DGMEtjCEEBq/wBu/wAV+kR85I//e0RZqfU/gjcvkNv31agQ==</latexit>

<latexit sha1_base64="1rs2/RYTwkXRCFXWtTX7ZGGe4CA=">AAABenicZY3LSsNAFIZP6q3WW9Slm9BsKtaQBKHbght3RjRtwZQwmZ6GoZNkmJkIEvoSPo1bfQvfxYWNBkH7rz7O+c/5EsGZ0q77YbQ2Nre2d9q7nb39g8Mj8/hkpIpSUgxpwQs5SYhCznIMNdMcJ0IiyRKO42RxXe/HTygVK/IH/SxwmpE0Z3NGiV6NYrMfUcw1SpanVhUlcysgUjPCL2/lDKV1L5D+tpexabuO+x1rHbwGbGgSxOYgmhW0zFYKyolSj54v9LSqFZTjshOVCgWhC5JipVmGqp9pIQu/Nnn//67DyHc81/HufHvYa5xtOIMu9MCDAQzhBgIIgcILvMIbvMOn0TXOjYufastobk7hT4yrL5+MYXI=</latexit> <latexit sha1_base64="ihNnnHDV/rxn/Lvwmc9SFFkMXO0=">AAABa3icZY3LSsNAFIZP6q3WS6PdeYFgELqQkMmm20I3boQKpi2YEibjaRg6yYSZiSChK5/GrT6ND+E7mGo22n/1cS7fnxSCa+P7n1Zra3tnd6+93zk4PDru2ienEy1LxTBkUkg1S6hGwXMMDTcCZ4VCmiUCp8lytN5Pn1FpLvMH81LgPKNpzhecUVOPYvsyYpgbVDxPnSpKFs5IauPc0ZymqFax7fqe/xNnE0gDLjQZx/YgepKszGonE1TrRxIUZl5RZTgTuOpEpcaCsmVtrwzPUN9kplAyWDeR/95NmAQe8T1yH7jDftPZhnO4gj4QGMAQbmEMITB4hTd4hw/4snrWmXXxe9qymp8e/Il1/Q1R6lr5</latexit><latexit sha1_base64="VQZaGv7joES+1JNiUpmPVJI8+7E=">AAABeXicZY3LSsNAFIZP6q3GW9Slm2gWBpWSiYtuC25cVjBtwZQwmRzD0EkyzEwKUvoQPo1bfQyfxY1Rg6D9Vx/n/Od8qRRcmyB4tzpr6xubW91te2d3b//AOTwa6apWDCNWiUpNUqpR8BIjw43AiVRIi1TgOJ3dfO3Hc1SaV+W9eZI4LWhe8kfOqGlGiXMZMywNKl7mbqRRnWtXqmrOM8zi2NYS2W9ZJ44X9ILvuKtAWvCgzTBx+nFWsbpoDExQrR9IKM10QZXhTODSjmuNkrIZzXFheIH6qjCNPVw2JvL/7yqMwh4JeuQu9AZ+6+zCCZyBDwT6MIBbGEIEDJ7hBV7hDT6sU8u3Ln6qHau9OYY/sa4/ASxxYSE=</latexit>

<latexit sha1_base64="m3pCVPcBHT7AUFK43+zTWfUYKN4=">AAABZ3icZY3LSsNAFIbP1Futl0YFUdxEu+lCQ5JNtwU3LiuYtmBKmIwnZehkMsxMChKKT+NWn8dH8C00mo32W32c/5zzp0pwY33/g7Q2Nre2d9q7nb39g8Ouc3Q8NkWpGUasEIWeptSg4BIjy63AqdJI81TgJF3c1vlkidrwQj7YZ4WznM4lzzij9nuUOOdVnGZuZFDfcKlK62alZHW0Spye7/k/uOsSNNKDhlHiDOKngpU5SssENeYxCJWdVVRbzgSuOnFpUFG2oHOsLM/RXOdW6SKsm4L/f9dlHHqB7wX3YW/YbzrbcAFX0IcABjCEOxhBBAxe4BXe4B0+SZeckrPf1RZpbk7gD+TyC4SwWaQ=</latexit>

Fig. 1: System overview of LLUNATIC.

(iii)The use of labeled instances and more specifically the partial order information on preference levels allows us to model a variety of constraint formalisms in a uniform way (cfr. Facet 1). More precisely, LLUNATICsupports variable and constantequality-generating dependencies(egds). Con- stant egds are an extension of classical egds [4] which can enforce the presence of certain constants. We provide ample examples of egds and how labeled instances and the chase interact with those constraints in the next sections. In the following it should therefore be understood that when we mention data-quality constraints (or dependencies) we al- ways mean egds. We further emphasize that we assume that the edgs are given, either by a domain expert or automati- cally discovered from the data (see e.g., [10, 26, 35, 45, 46]).

In summary, we propose LLUNATICas a data repairing framework which addresses all three facets in a flexible way.

We remark that this paper is based on our previous work [31]. Due to the complexity of the formalization used in [31]

we believe that some of the key insights behind our approach were unclear. We therefore completely changed the under- lying formalization: Everything is now modeled in terms of labeled instances. Not only does this result in a more elegant way of describing how the partial order information is used to resolve conflicts during the chase, it also provides a more clear semantics of repairs. Moreover, labeled instances and the integration of the partial order in the chase process may be of interest in its own right. The current formalization is closer to the standard chase and it is now clearer how our ideas can be adopted in other contexts as well where now only the standard chase is available. We only consider vari- able and constant egds in this paper. In [32], we extended our approach to more powerful constraints (tuple-generating de- pendencies, tgds). The labeled instance formalization can be extended to this more general setting but for the sake of clar- ity of exposition, we do not consider tgds in this paper. We remark that we provide many more details compared to [31, 32] and also include a more extensive experimental evalua- tion than in our previous work.

Organization of the paper. We start with preliminaries in Section 2. Motivation for using labeled instances and the

chase for repairing can be found in Section 3. We provide a formalization of our approach in Section 4. We detail the chase procedure and how to use LLUNATIC in Section 5.

Optimizations and implementation details are described in Section 6. We discuss how ideas from other approaches can be integrated in Section 7. In Section 8 we report our exper- imental findings. Finally, related work and future directions of research are discussed in Sections 9 and 10, respectively.

We conclude the paper in Section 11.

2 Preliminaries

We fix a countably infinite domain of constant values, de-
noted by CONSTS, and a countably infinite set of labeled
nulls, denoted by NULLS, distinct from CONSTS.Labeled
nulls will be denoted by ⊥_{0},⊥_{1},⊥_{2},. . ., and are used to
denote different unknown values, i.e.,⊥_{i} is assumed to be
different from⊥_{j}, fori6=j.Furthermore, we fix a countably
infinite set, TIDS, oftuple identifiersdistinct fromCONSTS

andNULLS.

Database instances and cells.AschemaR is a finite set
{R_{1}, . . . ,R_{k}}ofrelation symbols. Fori∈[1,k], the relation
symbolR_{i}inRhas a fixed set of attributes,Tid,A_{1}, . . . ,A_{n}_{i},
whereTidis a special attribute whose domain is TIDS. All
other attributes have CONSTS∪NULLS as domain. Fori∈
[1,k], aninstance of R_{i}is a finite subsetI_{i}of TIDS×(CONSTS

∪NULLS)^{n}^{i}. A (database)instanceI= (I_{1}, . . . ,I_{k})ofR con-
sists of instancesI_{i} ofR_{i}, fori∈[1,k]. Atuple tinIis an
element in one of the instancesI_{i} inI. Lett be a tuple in
I_{i} andA_{j} an attribute inR_{i}. We denote byt[A_{j}]the value
of tuplet in attributeA_{j}. We assume that every tuple inI
has a uniqueTid-value. Ift is a tuple inIwitht[Tid] =tid,
then we also refer to this (unique) tuple byt_{tid}. Given an
instance I_{i} over R_{i}, a cellin I_{i} is a location specified by
a tuple id/attribute pairhtid,A_{j}i(orhtid,Tidi), where tidis
an identifier for a tuple in I_{i} andA_{j} is an attribute in R_{i}.
The value of a cell htid,A_{j}i inI_{i} is the value of tuple t_{tid}
in attribute Aj, i.e.,ttid[A_{j}]. Similarly, the value of the cell
htid,TidiinI_{i}istid. We denote bycells(I_{i})the set of allcells
inI_{i}. Similarly, for an instanceI= (I_{1}, . . . ,I_{k})ofRwe define
cells(I) =^{S}{cells(I_{i})|i∈[1,k]}.

Example 1 An instanceI of relationD(Tid,NPI,Name,Sur-
name,Spec,Hospital)consisting of tuplest_{1},t_{2}andt_{3}(with
ids 1, 2 and 3, respectively) containing information about
doctors is shown in Figure 2. Also depicted is a master data
instance J, containing correct data, of schemaM(Tid,NPI,
Name,Surname,Spec,Hospital)consisting of a single tuple
t_{m}. In instanceIwe also depict the cells incells(I). For cells
corresponding to theTid-attribute, we simply denote the cell
byc^{t}_{i}, whereiis the tuple identifier for the tuple. For exam-
ple,c^{t}_{1}=h1,Tidi. For cells corresponding to other attributes,
we represent cells byc_{i j}. For example,c_{10}=h1,NPIi,c_{11}=

h1,Namei, and so on. We ignore the attributeConf(idence)
for the moment. We do not include cells for the master data
instance. We show below that we can eliminate master data
instances using an encoding in data-quality constraints. ♦
Data-quality constraints.We will useequality-generating
dependencies(oregdsfor short) to express data-quality con-
straints (also calledquality rulesin the literature). In fact,
we use two types of egds:variableandconstantegds. Un-
like the seminal paper [4], our egds need not to be typed
and can contain constants. A variable egd is defined as fol-
lows. A relational atom overR is a formula of the form
R(s)¯ withR∈Rand ¯sis a tuple of (not necessarily distinct)
constants and variables. Then, avariable egd over R is a
formulae of the form∀x(φ(¯ x)¯ →x_{i}=x_{j}), whereφ(x)¯ is
a conjunction of relational atoms overRwith variables ¯x,
andx_{i}andx_{j} are variables in ¯x. A constant egd eis of the
form∀x(φ(¯ x)¯ →x=a)wherexis a variable in ¯xandais
a constant inCONSTS. It is common to write an egdewith-
out writing the universal quantification. So, in what follows
φ(x)¯ →x_{i}=x_{j} corresponds to∀x(φ¯ (x)¯ →x_{i}=x_{j}). Simi-
larly for constant egds.

To prevent any interaction of the egds with the tuple
identifiers we assume that: (i) every relational atomR(¯s)in
φ(x)¯ carries a variable in its first position, i.e.,s_{1}is a vari-
able, and this variable does not occur anywhere else in ¯sand
also does not occur in any other relational atom inφ(x); and¯
(ii) ifφ(x)¯ →x_{i}=x_{j} or φ(x)¯ →x=a, then neitherx_{i},x_{j}
norxcan be a variable that occurs in the first position of a
relational atom inφ(x). These conditions basically indicate¯
that the egds do not pose constraints on the tuple identifiers.

In the constraint-based data quality approach, dependen- cies are used to assess the cleanliness of data [7, 24, 37].

More specifically, an instanceIofRsatisfies an egd e, vari- able or constant, denoted byI|=e, if it satisfieseaccording to satisfaction of first-order logic with (i) a Herbrand inter- pretation of the constants ine, and (ii) the universe of dis- course of the first-order structure being CONSTS∪NULLS

(and TIDSfor theTid-attributes) [2]. In particular, nulls are interpreted as constants and hence⊥=a is false fora∈

CONSTS. Similarly,⊥_{i}=⊥_{j}is false for two different nulls.

Example 2 Examples of variable egds include functional and
(variable) conditional functional dependencies. Constant con-
ditional functional dependencies can be expressed as con-
stant egds. In Figure 2, the variable egdse_{1}–e_{4}correspond
to the functional dependency expressing that attributeNPIis
a key for relationD(we again ignore the attributeConfand
also do not take into account theTid-attribute). Furthermore,
the constant egde_{5}corresponds to the constant conditional
functional dependency which requires the standardization of
the name “Greg” into “Gregory”. Finally, constant egdse_{6}–
e_{9}originate from a so-called editing rule, which states that

D(octors)

Tid NPI Name Surname Spec Conf Hospital

c^{t}_{1} c10 c11 c12 c13 c14 c15

1 111 Robert Chase surg 0.9 PPTH

c^{t}_{2} c20 c21 c22 c23 c24 c_{25}

2 111 Frank Chase urol 0.1 ⊥1

c^{t}_{3} c_{30} c_{31} c_{32} c_{33} c_{34} c_{35}

3 222 ⊥2 House diag 1 ⊥1

M(aster Data)

Tid NPI Name Surname Spec Hospital

m 222 Greg House diag PPTH

e_{1}=D(tid,npi,nm,sur,spec,hosp)∧D(tid^{0},npi,nm^{0},sur^{0},spec^{0},hosp^{0})→nm=nm^{0}
e_{2}=D(tid,npi,nm,sur,spec,hosp)∧D(tid^{0},npi,nm^{0},sur^{0},spec^{0},hosp^{0})→sur=sur^{0}
e_{3}=D(tid,npi,nm,sur,spec,hosp)∧D(tid^{0},npi,nm^{0},sur^{0},spec^{0},hosp^{0})→spec=spec^{0}
e_{4}=D(tid,npi,nm,sur,spec,hosp)∧D(tid^{0},npi,nm^{0},sur^{0},spec^{0},hosp^{0})→hosp=hosp^{0}
e_{5}=D(tid,npi,Greg,sur,spec,hosp)→nm=Gregory

e_{6}=D(tid,222,nm,sur,spec,hosp)→nm=Greg
e_{7}=D(tid,222,nm,sur,spec,hosp)→sur=House
e_{8}=D(tid,222,nm,sur,spec,hosp)→spec=diag
e_{9}=D(tid,222,nm,sur,spec,hosp)→hosp=PPTH

Fig. 2: Running example: Instances and equality-generating dependencies.

any tuplesinIwhich shares the sameNPI-value with a tu-
plet in the master dataJmust agree on all common other
attributes with t. Such editing rules are easily seen to be
equivalent to a set of constant egds by introducing a con-
stant egd for each tuple in the master data and each attribute
(except for theTid-attribute) in the master data’s schema. In
our example,t3[NPI] =tm[NPI] =222 and the editing rule is
translated into the four constant egdse_{6}–e_{9}basically requir-
ing the four remaining attributes oft_{3}to take the correspond-
ing values fromt_{m}. In this way, master data and editing rules
are represented by constant egds. It is readily verified that
I|={e_{2},e_{7},e_{8}}butI does not satisfy any of the remaining
egds. For example,t1[Name] =Robert andt2[Name] =Frank
should be equal according to the egde_{1}. Hence,Iis a dirty
instance relative to these egds. ♦

The previous example shows that egds are expressive enough to capture a wide variety of existing data-quality formalisms: Functional dependencies [2], conditional func- tional dependencies [25], and editing rules [29]. Further- more, one can also verify that egds can express fixing rules (without negative patterns) [52], and certain classes of denial constraints (basically, denial constraints which are logically equivalent to an egd) [7, 24]. Motivated by this, we focus on egds. Not supported in LLUNATICare, for example, match- ing dependencies [23], metric dependencies [42], differen- tial dependencies [49], and general denial constraints [7, 24].

We defer to future work to include a larger variety of data- quality constraints in LLUNATIC.

3 LLUNATIC: Finding repairs using the chase

With the constraint formalism fixed, we next turn to there- pairingorcleaningof the data. Intuitively, repairing a dirty instance Isuch that I6|=Σ for some setΣ of egds means finding a clean instanceJsuch thatJ|=Σ andIandJare

“closely related”. As already mentioned in the Introduction, in recent years, repairing methods have been proposed for several classes of constraints. These methods typically con- sider only specific types of constraints and different inter- pretations of “closely related”. Furthermore, each of these

methods differs in how conflicting values (such as Robert and Frank in the previous example) are resolved. We refer to the Related Work Section 9 for more details.

With LLUNATIC we aim to provide a single-nodescal- able algorithmic framework for finding repairsof instances for a set of egds, hereby covering different classes of con- straints in auniform way. Moreover,different conflict reso- lution strategiesshould beeasy to incorporatein the frame- work, without the need of changing the underlying algo- rithm (and thus implementation). To this aim, LLUNATIC

uses ageneralization of the standard chase procedureby in- corporatingpreference information on values in cellsand by producing achase treeconsisting ofchase sequences, where each sequence leads to a repair.

We next recall the standard chase procedure and then identify why it needs to be revised in order to become a true workhorse for repairing. In particular, we argue the need for better conflict resolution (Section 3.2), support for constant egds (Section 3.3), backward repairs (Section 3.4) and user provided repairs (Section 3.5). With the help of a motivating example, we informally introduce the main concepts used in LLUNATICto address these issues. A formal account will be given in Sections 4 and 5.

3.1 The standard chase

When Σ consists of variable egds only, the chase proce-
dure (or, simply the chase) provides an elegant repairing
method [4]. It works as follows. Consider a dirty instance
I6|=Σ and variable egde:φ(x)¯ →x_{i}=x_{j} inΣ. A homo-
morphismhfromφ(x)¯ toI= (I_{1}, . . . ,I_{k})is a mapping which
assigns to every variablexin ¯xa value inCONSTS∪NULLS

(and TIDSfor the variables in Tid-attributes) such that ev-
ery relational atomR_{i}(s)¯ inφ(x)¯ maps onto a tupleh(s)¯ ∈I_{i},
wherehis the identity onCONSTS. Whenh(x_{i})6=h(x_{j}), we
say thate can be applied toIwith homomorphism h. The
result of applying e onIwith his defined as follows: Ifh(x_{i})
andh(x_{j})are two different constants in CONSTS, then the
result of applyingeonIwithh is “failure”, and we write
I→^{e,h} . Otherwise, the result is a new instanceI^{0}, defined as
follows. Whenh(x_{i})∈CONSTSandh(x_{j})∈NULLS, the null

valueh(x_{j})is replaced everywhere inIby the constant value
h(x_{i}), resulting inI^{0}. Whenh(x_{i})andh(x_{j})are both null val-
ues, then one is replaced everywhere by the other, resulting
inI^{0} ^{1}. In both cases we writeI→^{e,h}I^{0}. Then, for a setΣ of
variable egds, achase sequence ofIwithΣis a sequence of
the formI_{i}^{e}→^{i}^{,h}^{i}I_{i+1} withi=0,1, . . .,I_{0}=I,e_{i}∈Σ andh_{i}
a homomorphism frome_{i}toI_{i}. Afinite chase ofIwithΣis
a finite chase sequenceI_{i}^{e}→^{i}^{,h}^{i}I_{i+1},i∈[0,m−1], such that
eitherI_{m}= or no egd eexists inΣ for which there is a
homomorphismhsuch thatecan be applied toI_{m}withh.

We callI_{m}theresultof such a finite chase and whenI_{m}6= ,
the instanceI_{m}is called theresult of a successful chase. It
is known that if I_{m}is the result of a successful chase ofI
withΣ, then I_{m}|=Σ andI_{m} is thus clean [4]. The repair
I_{m} has many other nice theoretical properties (universality,
independence of the order in which egds are applied, ...),
see e.g., [22]. Our revised chase does not inherit these prop-
erties. Our primary goal, however, is using the chase as a
practical way of generating repairs.

3.2 Avoiding failure by conflict resolution

Clearly, when repairingdata, the standard chase will often be unsuccessful because different constants may need to be equated. This is not surprising. After all, the chase was orig- inally designed to reason about dependencies, where the in- put instance does not contain constants, and not for repairing data [4]. As an example, we chase our running example with variable egds.

Example 3 Consider the variable egdse_{1},e_{2},e_{3} ande_{4} in
Figure 2. SinceI|=e_{2}, onlye_{1},e_{3} ande_{4}are applicable.

It is readily verified that there is a homomorphismhfore_{4}
such that I^{e}→^{4}^{,h}I^{0} withI^{0} obtained from I by replacing the
null value⊥_{1}int_{2}[Hospital]andt_{3}[Hospital]by PPTH, i.e.,
the value fromt_{1}[Hospital]. However, chasingI^{0}further with
e1ande3results in a failure. Indeed,e1requirest1[Name] =
Robert to be equal tot_{2}[Name] =Frank which are two dif-
ferent constants. Similarly,e_{3}requirest_{1}[Spec] =surg to be
equal tot2[Spec] =urol. ♦

Instead of simply returning failure ( ) it is desirable, from a data repairing perspective, for the chase to (i) ei- ther report the reasons for failure, or (ii) resolve conflicts between constant values based on some additional informa- tion. In LLUNATIC this is achieved as follows. Let Ibe a database instance.

– The initial step consists of adorning the cells in I with preference levelsfrom apartially ordered set(P,P)and

1 Typically, to make this step deterministic, an ordering on null val- ues is assumed and the smaller null value is replaced by the larger one.

We assume that⊥0<⊥1<⊥2<⊥3<· · ·.

combining these with values of the cells inI. As a result,
each cell initially contains apreference labelof the form
hp,viwherep∈Pandvis the value in the cell, resulting
in theinitial labeled instanceI^{◦}.Looking ahead, the ini-
tial labeled instance will be changed during the (revised)
chase process into a labeled instance in which cells may
havemultiplepreference labels. Preference labels allow
comparing values based on their preference levels and the
order between these levels according to_{P}. We use this
information to resolve conflicts,when cells have multiple
preference labels after chasing, by taking the most pre-
ferred value, i.e., the value with the highest preference
level. The partial order(P,_{P})is fixed at the beginning of
the repairing process. For preference levelspandp^{0}, we
denote byp≺_{P}p^{0}, ifp_{P}p^{0}andp^{0}6_{P}p.

Example 4 In Figure 3a we show an initial labeled instance
I^{◦}obtained fromI(cfr. Figure 2) by putting in each cell ofI
a single preference level together with the value of that cell.

From here on, we do not depict the attributeTidsince it is
only used for identifying tuples and the tuple identifiers do
not interact with the egds. InI^{◦}we find preference levels
p⊥_{1} andp⊥_{2} assigned to null values⊥1and⊥2in cellsc_{25},
c_{35}andc_{31}, respectively. We also find preference levelsp_{0.1},
p_{0.9}andp_{1}, associated with urol, surg and diag, in cellsc_{23},
c_{13}andc_{33}, respectively. These preference levels encode the
confidence of these values according to theConf attribute.

By imposing p_{0.1}≺_{P} p_{0.9}≺_{P} p_{1}we encode that the higher
the confidence is, the more preferred the value associated
with these levels in the preference labels is. This is an im-
portant example showing how external information (in this
case confidence) can be encoded in preference levels and la-
bels. We generalize the use of attributes that encode some
ordering (such asConf) to assign preference levels to values
in another attribute (such asSpec) later in Section 4.5 in the
form of apartial order specification. All other preference
levels in I^{◦} are chosen arbitrarily, except that we impose
that p⊥_{1} ≺_{P}p_{15}. This is to indicate that when the value in
the preference labels of cellsc_{25}andc_{35}(i.e., the value⊥_{1})
needs to be equated later on with the value in the label ofc_{15}
(i.e., PPTH) due to egde_{4}, that PPTH is preferred over⊥_{1}.
We here capture the semantics that constants are preferred to
null values, just as in the standard chase. From the labeled
instanceI^{◦}we can obtain a normal instanceinst(I^{◦})simply
by picking the values in the preference labels in each cell
with the highest preference value (Figure 3b). InI^{◦}, each cell
carries a single preference label holding the original value of
that cell in the instanceIgiven in Figure 2. Hence,inst(I^{◦}) =
Iin this case. The partial order_{P}used is shown in Figure 4
and comes into play when chasingI^{◦}. In this partial order
we more generally assume thatp⊥_{i}≺Ppfor any preference
level p_{⊥}_{i} associated with null value⊥_{i}and any preference
value p associated with a constant inI^{◦}. Furthermore, we

D(octors)

NPI Name Surname Spec Conf Hospital

c_{10} c_{11} c_{12} c_{13} c_{14} c_{15}

t_{1} {hp_{10},111i} {hp_{11},Roberti} {hp_{12},Chasei} {hp_{0.9},surgi} {hp_{14},0.9i} {hp_{15},PPTHi}

c_{20} c_{21} c_{22} c_{23} c_{24} c_{25}

t_{2} {hp20,111i} {hp21,Franki} {hp22,Chasei} {hp0.1,uroli} {hp24,0.1i} {hp⊥1,⊥1i}

c30 c31 c32 c33 c34 c35

t3 {hp30,222i} {hp⊥2,⊥2i} {hp32,Housei} {hp1,diagi} {hp34,1i} {hp⊥1,⊥1i}

(a) Initial labeled instanceI^{◦}

D(octors)

NPI Name Surname Spec Conf Hospital
c_{10} c_{11} c_{12} c_{13} c_{14} c_{15}
t_{1} 111 Robert Chase surg 0.9 PPTH

c_{20} c_{21} c_{22} c_{23} c_{24} c_{25}
t_{2} 111 Frank Chase urol 0.1 ⊥1

c30 c31 c32 c33 c34 c35

t3 222 ⊥2 House diag 1 ⊥1

(b) Corresponding instanceinst(I^{◦}).

D(octors)

NPI Name Surname Spec Conf Hospital

c_{10} c_{11} c_{12} c_{13} c_{14} c_{15}

t_{1}

{hp_{10},111i} {hp11,Roberti, {hp_{12},Chasei} {hp0.9,surgi, {hp_{14},0.9i} {hp15,PPTHi,
hp_{21},Franki} hp_{0.1},uroli} hp_{⊥}_{1},⊥_{1}i}

c20 c21 c22 c23 c24 c25

t_{2}

{hp_{20},111i} {hp_{11},Roberti,

{hp_{22},Chasei} {hp_{0.9},surgi,

{hp_{24},0.1i} {hp_{15},PPTHi,
hp21,Franki} hp0.1,uroli} hp⊥_{1},⊥1i}

c_{30} c_{31} c_{32} c_{33} c_{34} c_{35}

t_{3}

{hp_{30},222i} {hp_{⊥}_{2},⊥_{2}i} {hp_{32},Housei} {hp_{1},diagi} {hp_{34},1i} {hp_{15},PPTHi,
hp⊥_{1},⊥_{1}i}

(c) Labeled instanceI_{1}^{?}chased using variable egds inΣ.

D(octors)

NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15

t1 111 `0 Chase surg 0.9 PPTH c20 c21 c22 c23 c24 c25

t2 111 `0 Chase surg 0.1 PPTH c30 c31 c32 c33 c34 c35

t3 222 ⊥2 House diag 1 PPTH
(d) Corresponding instanceinst(I_{1}^{?}).

D(octors)

NPI Name Surname Spec Conf Hospital

c_{10} c_{11} c_{12} c_{13} c_{14} c_{15}

t1

{hp_{10},111i}

{hp11,Roberti,

{hp_{12},Chasei}

{hp0.9,surgi,

{hp_{14},0.9i}

{hp15,PPTHi,
hp_{21},Franki} hp_{0.1},uroli} hp_{⊥}_{1},⊥_{1}i,

hpau,PPTHi}

c_{20} c_{21} c_{22} c_{23} c_{24} c_{25}

t2

{hp_{20},111i}

{hp11,Roberti,

{hp22,Chasei}

{hp0.9,surgi,

{hp24,0.1i}

{hp_{15},PPTHi,
hp21,Franki} hp0.1,uroli} hp⊥1,⊥1i}

c30 c31 c32 c33 c34 c_{35}

t_{3}

{hp_{30},222i}

{hp⊥_{2},⊥_{2}i,

{hp_{32},Housei} {hp_{1},diagi} {hp_{34},1i}

{hp_{15},PPTHi,

hpau,Gregoryi, hp⊥_{1},⊥_{1}i}

hpau,Gregi}

(e) Labeled instanceI_{2}^{?}chased usingΣ.

D(octors)

NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15

t1 111 `0 Chase surg 0.9 PPTH c20 c21 c22 c23 c24 c25

t2 111 `0 Chase surg 0.1 PPTH
c30 c31 c32 c33 c34 c_{35}
t3 222 `_{1} House diag 1 PPTH

(f) Corresponding instanceinst(I_{2}^{?}).

D(octors)

NPI Name Surname Spec Conf Hospital

c10 c11 c12 c13 c14 c15

t1 {hp10,111i} {hp11,Roberti} {hp12,Chasei} {hp0.9,surgi} {hp14,0.9i} {hp15,PPTHi}

c_{20} c_{21} c_{22} c_{23} c_{24} c_{25}

t_{2} {hp20,111i, {hp_{21},Franki} {hp_{22},Chasei} {hp_{0.9},uroli} {hp_{24},0.1i} {hp⊥1,⊥1i,

hp×,×i} hpau,PPTHi}

c_{30} c_{31} c_{32} c_{33} c_{34} c_{35}

t_{3}

{hp30,222i}

{hp⊥_{2},⊥_{2}i,

{hp32,Housei} {hp1,diagi} {hp34,1i}

{hp⊥_{1},⊥_{1}i,

hpau,Gregoryi, hpau,PPTHi}

hpau,Gregi}

(g) Labeled instanceI_{3}^{?}chased usingΣwith a backward repair.

D(octors)

NPI Name Surname Spec Conf Hospital c10 c11 c12 c13 c14 c15

t1 111 Robert Chase surg 0.9 PPTH c20 c21 c22 c23 c24 c25

t2 `2 Frank Chase urol 0.1 PPTH c30 c31 c32 c33 c34 c35

t3 222 `1 House diag 1 PPTH
(h) Corresponding instanceinst(I_{3}^{?}).

D(octors)

NPI Name Surname Spec Conf Hospital

c_{10} c_{11} c_{12} c_{13} c_{14} c_{15}

t_{1}

{hp10,111i} {hp11,Roberti} {hp12,Chasei} {hp0.9,surgi} {hp14,0.9i} {hp_{15},PPTHi}

c20 c21 c22 c23 c24 c25

t2 {hp20,111i,

{hp21,Franki} {hp22,Chasei} {hp0.9,uroli} {hp24,0.1i} {hp⊥1,⊥1i

hp×,×i, {hp_{au},PPTHi}

hp>,112i}

c_{30} c_{31} c_{32} c_{33} c_{34} c_{35}

t_{3}

{hp30,222i}

{hp⊥_{2},⊥_{2}i, {hp_{32},Housei, {hp_{1},diagi,

{hp34,1i}

{hp⊥_{1},⊥_{1}i,
hpau,Gregoryi, hpau,PPTHi}

hp_{au},Gregi,
hp>,Gregoryi}

(i) Labeled instanceI_{4}^{?}chased withΣand a user repair.

D(octors)

NPI Name Surname Spec Conf Hospital
c10 c11 c12 c13 c14 c_{15}
t_{1} 111 Robert Chase surg 0.9 PPTH

c_{20} c_{21} c_{22} c_{23} c_{24} c_{25}
t_{2} 112 Frank Chase urol 0.1 PPTH

c_{30} c_{31} c_{32} c_{33} c_{34} c_{35}
t_{3} 222 Gregory House diag 1 PPTH

(j) Corresponding instanceinst(I_{4}^{?}).

Fig. 3: Running example: Labeled instances and their corresponding (standard) instances at different times during the LLU-

NATICchase process.