A Factorized Version Space Algorithm for "Human-In-the-Loop" Data Exploration

(1)

HAL Id: hal-02274497

https://hal.inria.fr/hal-02274497v2

Submitted on 3 Sep 2019

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Factorized Version Space Algorithm for

”Human-In-the-Loop” Data Exploration

Luciano Di Palma, Yanlei Diao, Anna Liu

To cite this version:

Luciano Di Palma, Yanlei Diao, Anna Liu. A Factorized Version Space Algorithm for ”Human-In-

the-Loop” Data Exploration. ICDM - 19th IEEE International Conference in Data Mining, Nov 2019,

Beijing, China. �hal-02274497v2�

(2)

A Factorized Version Space Algorithm for

“Human-In-the-Loop” Data Exploration

Luciano Di Palma LIX Ecole Polytechnique

Palaiseau, France

luciano.di-palma@polytechnique.edu

Yanlei Diao LIX Ecole Polytechnique

Palaiseau, France yanlei.diao@polytechnique.edu

Anna Liu

Dept. of Mathematics and Statistics University of Massachusetts Amherst

Amherst, USA anna@math.umass.edu

Abstract—While active learning (AL) has been recently applied to help the user explore a large database to retrieve data instances of interest, existing methods often require a large number of instances to be labeled in order to achieve good accuracy. To address this slow convergence problem, our work augments version space-based AL algorithms, which have strong theoretical results on convergence but are very costly to run, with additional insights obtained in the user labeling process. These insights lead to a novel algorithm that factorizes the version space to perform active learning in a set of subspaces. Our work offers theoretical results on optimality and approximation for this algorithm, as well as optimizations for better performance. Evaluation results show that our factorized version space algorithm significantly outperforms other version space algorithms, as well as a recent factorization-aware algorithm, for large database exploration.

Index Terms—Active learning, version space, data exploration

I. INTRODUCTION

In the setting of interactive data exploration (IDE), a user that comes to explore a large database is often driven by the goal of understanding some particular phenomenon. Such goals are treated as building a classification model over all objects in the database from no or very few labeled instances [1]–[4]; examples include classifying all the models in a car database as relevant or irrelevant to the interest of a customer, or classifying all the observations in a digital sky survey as relevant or irrelevant to the interest of a scientist. For IDE, active learning [5] has been explored to derive an accurate model with minimum user labeling effort while offering interactive performance in presenting next data instances for the user to label [2], [3].

In IDE, however, existing active learning techniques [5]–

[8] often fail to provide satisfactory performance when such models need to be built over large databases. For example, our evaluation results show that on a Sloan Digital Sky Survey (SDSS) dataset of 1.9 million data instances, the state-of- the-art technique for IDE [3] requires the user to label 200- 300 instances in order to learn a classification model over 6 attributes with F-score¹ of 95%. Similar results can be

1F-score is a better measure for exploring a large database than classification error given that the user interest, i.e., the positive class, often covers only a small fraction of the database; classifying all instances to the negative class has low classification error but poor F-score.

observed for other active learning techniques [5]–[7]. Asking the user to label a large number of instances to achieve accuracy, referred to asslow convergence, is undesirable.

In this work, we aim to design new techniques to overcome the slow convergence problem by exploring two ideas:

Version Space Algorithms: First, we would like to leverage Version Space (VS) algorithms [5], [7], [9], [10] because they present a strong theoretical foundation for convergence. These algorithms model all possible configurations of a classifier that can correctly classify the current set of labeled data as a set of hypotheses forming a version space V, and aim to seek the next instance such that its acquired label will enable V to be reduced. The most well known example is the bisection rule [9], [10], which among all unlabeled instances, looks for the one whose label allows V to be reduced by half or most close to that. It is shown theoretically to have near-optimal performance for convergence.

Implementing the bisection rule, however, is prohibitively expensive due to the exponential size of a version space.

To reduce cost, various approximations have been proposed.

Some of the most popular techniques are Simple Margin [7], Query-by-Disagreement [5], Query-by-Committee (QBC) [11], and ALuMA [6]. The first two methods often suffer from suboptimal performance since they are very rough approximations of the bisection rule. On the other hand, QBC and ALuMA can better approximate the bisection rule by sampling the version space, which, however, is very costly to run on large databases. For example, ALuMA runs Hit- and-Run sampling with thousands of steps to sample a single hypothesis, and repeats this procedure to sample 1000 hypotheses for estimating the instance’s reduction power of the version space. As can be seen, new techniques are needed to enable VS algorithms to achieve both fast convergence and high efficiency on large databases.

Factorization. To make VS algorithms practical for large databases, our second idea is to augment them with additional insights obtained in the user labeling process. In particular, we observe that when a user labels a data instance, her decision making process often can be broken into a set of smaller questions, and the answers to these questions can be combined to derive the final answer. For example, when a customer decides whether a car model is of interest, she may have a

(3)

set of questions in mind: “Is gas mileage good enough? Is the vehicle spacious enough? Is the color a preferred one?” While the user may have high-level intuition that each question is related to a subset of attributes (e.g., the space depends on the body type, length, width, and height), she is not able to specify these questions precisely. It is because she may not know the exact threshold value or the exact relation between a question and related attributes (e.g., how the space can be specified as a function of body type, length, width, and height). The above insight allows us to design new version space algorithms that leverage the high-level intuition a user has for breaking the decision making process, formally called a factorization structure, to combat the slow convergence problem.

More specifically, we make the following contributions:

1. A Factorized Version Space Algorithm (Section III):

We propose a novel algorithm that leverages the factorization structure to create subspaces and factorizes the version space accordingly to perform active learning in the subspaces.

Compared to recent work [3] that also used factorization for active learning, our work explores it in the new setting of VS algorithms and completely eliminates the strong assumptions made in [3] such as convex and conjunctive properties of user interest patterns, resulting in significant performance improvement.

2.Theoretical results(Section IV): We also provide theoretical results on the optimality of our factorized VS algorithm.

3.Optimizations(Section V): We further provide optimizations for sampling over a large version space.

4. Evaluation (Section VI): Evaluation results show that 1) for low dimensional problems, our optimized VS algorithm, without factorization, already outperforms existing VS algorithms including Simple Margin [7], Query-by-Disagreement [5], and ALuMA [6]; 2) for higher dimensional problems, our factorized VS algorithm outperforms VS algorithms [5]–

[7], as well as DSM [3], a factorization-aware algorithm, often by a wide margin while maintaining interactive speed.

For example, for a complex user interest pattern tested, our algorithm achieves F-score of over 90% after 100 iterations, while DSM is still at 40%and all other VS algorithms are at 10% or lower.

II. RELATEDWORK

In this section, we present most relevant results in Active Learning and Data Exploration Systems.

Active Learning. The recent results on active learning are surveyed in [5]. Our work focuses on a common form called pool-based active learning. In this setting, there is a small set of labeled data L and a large pool of unlabeled data U available. In active learning, an example is chosen from the pool in a greedy fashion, according to a utility measure used to evaluate all instances in the pool (or, ifU is large, a subsample thereof). In our setting of database exploration, the labeled dataLis what the user has provided thus far. The poolU is a subsample of sizemof the unlabeled part of the database. The utility measure varies with the classifier and algorithm used.

We focus on version space algorithms below.

Version Space Algorithms.Version Space (VS) algorithms are a particular class of active learning procedures. In such a procedure, the learner starts with a set of possible classifiers (hypotheses), which we denote by H. The Version Space V is defined as the subset of classifiers h∈ H consistent with the labeled data, i.e.h(x) =yfor all labeled points(x, y). As new labeled data is obtained, the set V will shrink, until we are left with a single hypothesis. Various strategies have been developed to select new instances for labeling.

Generalized Binary Search[9], [10]: Also called theVersion Space Bisectionrule, this algorithm searches for a pointxfor which the classifiers in V disagree the most; that is, the sets V^x,y = {h ∈ V : h(x) = y} have roughly the same size, for all possible labels y. It has strong theoretical guarantees on convergence: the expected number of iterations needed to reach 100%accuracy is at most a constant factor larger than the optimal algorithm on average. Implementing the bisection rule, however, is prohibitively expensive: for each unlabeled instance x in the database, one must evaluate h(x) for each hypothesishin the version space, which is exponential in size O(m^d), where m is number of unlabeled instances and d is the VC dimension [12]. A number of approximations of the bisection rule have been introduced in the literature:

Simple Margin[7]: As a rough approximation of the bisection rule for SVM classifiers, it leverages an heuristic that data points close to the SVM’s decision boundary closely bisect the version space,V. However, it can suffer from suboptimal performance, specially whenV is very asymmetric.

Query By Disagreement [5]: This algorithm approximates the version space V by a positively biased and a negatively biased hypothesis, and selects a data point for which these two biased hypothesis disagree. Again, it also suffers from suboptimal performance since the selected point is only guaranteed to “cut” V, but possibly not by a large amount.

Query by Committee(QBC) [11] andALuMA[6]: QBC [11]

attempts to estimate how much each point reduces the version space V through a sampling technique. It is a much better approximation of the bisection rule, but works by taking a single pass of the entire dataset, hence not suitable for pool- based sampling. ALuMA [6] is a “pool-based” version of QBC and uses a different technique for sampling the version space.

It is shown to outperform QBC. Hence, we use ALuMA as a baseline version space algorithm in this work.

Data Exploration Systems.In the data exploration domain, a main objective is to design a database system that guides the user towards discovering relevant records in a large database.

One example is Snorkel [13], adata programmingframework where an expert user writes several labeling functionsrepre- senting simple heuristics used for labeling a data point. By leveraging the prediction of several such functions, Snorkel is capable of building an accurate classifier without having the user manually label any data point.

Another line of work is “explore-by-example” systems [1]–

[4], which leverage active learning to obtain a small set of user labeled data points for building an accurate model of the user interest. These systems require minimizing both the user

(4)

labeling effort and running time in each iteration to find a new data point for labeling. In particular, recent work [3] is shown to outperform prior work [1], [2] via the following technique:

Dual-Space Model(DSM) [3]: In this work, the user interest patternPis assumed to form a convex object in the data space.

For such a pattern, this work proposes a “dual-space model”

(DSM), which builds not only a classifier but also a polytope model of the data spaceD, offering information including the areas known to be positive and areas known to be negative.

It uses both the polytope model and the classifier to decide the best instance to choose for labeling next. In addition, DSM explores factorization in a limited form: when the user pattern P is a conjunction of sub-patterns on non-overlapping attributes, it factorizes the data space into low-dimensional spaces and runs the dual-space model in each subspace.

III. A FACTORIZEDVERSIONSPACEALGORITHM

To improve the efficiency of version space (VS) algorithms on large databases, we aim to augment them with additional insights obtained in the user labeling process. In particular, we observe that when a user labels a data instance, her decision making process often can be broken into a set of smaller “yes”

or “no” questions, which can be answered independently, and these answers can be combined to derive the final answer. Re- visit the previous example: when a customer decides whether a car model is of interest, she has three questions in mind:

Q1: Is gas mileage good enough?

Q2: Is the vehicle spacious enough?

Q3: Is the color a preferred one?

We do not expect the user to specify her questions precisely as classification models (which requires knowing the exact shape of the decision function and all constants used), but rather to have a high-level intuition of the set of questions and to which attributes each question is related.

Factorization Structure. Formally, we model such an intuition of the set of questions and the relevant attributes as a factorization structure: Let us model the decision making process using a complex questionQdefined on an attribute set Aof sized. Based on the user intuition,Qcan be broken into smaller independent questions,Q₁, . . . , Q_K, where eachQ_kis posed on a subset of attributes A^k ={A_k1,· · ·, A_kd_k}. The family of attribute sets,(A¹,· · ·,A^K),|A¹∪ · · · ∪A^K| ≤d may be disjoint, or overlapping in attributes as long as the user believes that decisions for these smaller questions can be made independently. In our work (A¹,· · ·,A^K) is referred to as thefactorization structure.

Note that the factorization structure needs to be provided by the user as it reflects her understanding of her own decision making process. The independence assumption in the decision process should not be confused with the data correlation issue.

For example, the color and the size of cars can be statistically correlated, e.g., large cars are often black in color. But the user decision does not have to follow the data characteristics;

e.g., the user may be interested in large cars that are red. As long as the user believes that her decision for the color and that for the size are independent, the factorization structure

({color},{size}) applies. To the contrary, if the two attributes are not independent in the decision making process, e.g., the user prefers red if the car is small and black if the car is large, but the year of production is an independent concern, then the factorization structure can be ({color size},{year}).

Decision functions. Given a data instance x, we denote x^k=proj(x,A^k)the projection ofxover attributesA^k. We use Qk(x^k) → {−,+} to denote the user decision on x^k. Then we assume that the final decision from the answers to these questions is a boolean functionF :{−,+}^K→ {−,+}:

Q(x) =F(Q1(x¹), . . . , QK(x^K)) (1) In this work, we assume that the decision function F is given by the user. The most common example is the conjunctive form, Q1(x¹)∧. . .∧QK(x^K), meaning that the user requires each small question to be+to label the overall instance with +. Given that any boolean expression can be written in the conjunctive normal form,Q₁(x¹)∧. . .∧QK(x^K) already covers a large class of decision problems, while our work also supports other decision functions that use ∨.

Given the decision function F, we aim to learn the subspatial decision functions,{Q₁(x¹), . . . , Q_K(x^K)}, efficiently from a small set of labeled instances. For a labeled instance x, the user provides a collection of subspatial labels (y1, . . . , yK)∈ {−,+}^K to enable learning.

Generality. We note the differences of our factorization framework from [3]: First, one of the main assumptions in [3]

is that the set{x^k :Qk(x^k) = +} or the set{x^k :Qk(x^k) =

−}must be a convex object, which is eliminated in this work.

Second, the global decision function F must be conjunctive in [3], which is relaxed to any boolean expression in our work.

Third, our factorization is applied to version space algorithms, as shown below, while [3] does not consider them at all.

A. Introduction to Factorized Version Space

We now give an intuitive description of factorized version space (while we defer a formal description to Section IV).

Without factorization, our problem is to learn a classifier C (e.g., a SVM classifier) on the attribute set A from a labeled data set, L={(x_i, y_i)}, where x_i is a data instance containing values of A, andy_i∈ {−,+}. The version space V includes all possible configurations ofC (e.g., all possible weight vectors of the SVM) that are consistent withL.

Given a factorization structure (A¹,· · · ,A^K), we define K subspaces, where the k^th subspace includes all the data instances projected onA^k. Then our goal is to learn a classifier C^kfor each subspacekfrom its labeled set,L^k ={(x^k_i, y^k_i)}, wherey^k_i is the subspatial label forx^k_i. For the classifierC^k, its version space,V^k, includes all possible configurations of C^k that are consistent with L^k. Across all subspaces, we can reconstruct the version space via V˜ =V¹×. . . V^K. For any unlabeled instance, x, we can use F(C¹(x¹), . . . , C^K(x^K)) to predict a label.

At this point, the reader may wonder what benefit factorization provides in the learning process. We use the following example to show that factorization may enable faster reduction

(5)

Color (B/R) Size (S/L) Label

B(lack) L(arge) {-, +}

B(lack) S(mall) {-, +}

R(ed) L(arge) {-, +}

R(ed) S(mall) {-, +}

Version Space V (size = 16)

A labeled example (BL, ‘–’)

Color (B/R) Label B(lack) {-, +}

R(ed) {-, +}

Size (S/L) Label L(arge) {-, +}

S(mall) {-, +}

A labeled example

((B, ‘–’), (L, ‘+’)) Color (B/R) Label B(lack) { - }

R(ed) {-, +}

Size (S/L) Label L(arge) { + } S(mall) {-, +}

Version Space V (size = 4) Color (B/R) Size (S/L) Label

B(lack) L(arge) { - }

B(lack) S(mall) {-, +}

R(ed) L(arge) {-, +}

R(ed) S(mall) {-, +}

(a) Without factorization

(b) With factorization

Fig. 1. Illustration of factorization on the version space.

of the version space, hence enabling faster convergence to the correct classification model.

Example. Figure 1 shows an example that the user considers the color and the size of cars, where the color can be black (B) or red (R) and the size can be large (L) or small (S).

Therefore, there can be four types of cars corresponding to different color and size combinations. Figure 1(a) shows that without factorization, and in the absence of any user labeled data, the version space contains 16 possible classifiers that correspond to 16 combinations of the {−,+} labels assigned to the four types of cars. Once we obtain the ‘−’ label for the type BL (color = Black and size = Large), the version space is reduced to 8 classifiers that assign the ‘−’ label to BL.

Next consider factorization. In the absence of labeled data, Figure 1(b) shows that the subspace for color includes two types of cars, B and R, and its version space includes 4 possible classifiers that correspond to 4 combinations of the{−,+}

labels of these two types of cars. Similarly, the subspace for size also includes 4 classifiers. Combining the color and size, we have 16 possible classifiers. Once BL is labeled, this time, with two subspatial labels, ((B,−), (L,+)), each subspace has only two classifiers left, yielding 4 remaining classifiers across the two subspaces. As can be seen, with factorization each labeled instance offers more information and hence can lead to faster reduction to the version space.

B. Overview of A Factorized Version Space Algorithm Based on the above insight, we propose a new active learning algorithm, called a factorized version space algorithm. It leverages the factorization structure provided by the user to create subspaces, and factorizes the version space accordingly to perform active learning in the subspaces.

Algorithm 1 shows the pseudo-code of our algorithm. It starts by taking a labeled dataset (which can be empty), and creating a memory-resident sample from the underlying large database as an unlabeled pool U (line 1) to enable efficiency for interactive performance. It then proceeds to an iterative procedure (lines 2-11): In each iteration, it may further subsample the unlabeled pool to obtain U⁰ to expedite learning (line 3). Then the algorithm considers each subspace (lines 4-8), including the both labeled instances, (x^k, y^k) ∈ L^k, and unlabeled instances, x^k∈ U⁰, projected to this subspace.

The key step is to compute for each unlabeled instance, x,

Algorithm 1 A Factorized Version Space Algorithm

Input: database D, initial labeled set L0, per iteration subsample sizem, version space sample sizeM

1: L ← L0,U ← D

2: while user is still willingdo

3: U⁰← subsample(U, m)

4: foreach subspacek do

5: U^k ← {x^k, for x∈ U⁰}

6: L^k← {(x^k, y^k), for (x, y)∈ L}

7: {pk(x)}_x∈U⁰ ←positive cut probability(U^k,L^k, M)

8: end for

9: x^∗←arg min_x∈U0Q

k(1−2(1−pk(x))pk(x))

10: y^∗←get labels from user(x^∗)

11: L ← L ∪ {(x^∗, y^∗)},U ← U/{x^∗}

12: end while

13: C^k ←train majority vote classifier(L^k),k= 1, . . . , K

14: return x7→F(C¹(x¹), . . . , C^K(x^K))

how much its projection x^k can reduce the version space V^k once its label is acquired (line 7). This step requires efficient sampling of the version space,V^k, which is a main focus of Section V. Once the above computation completes for all subspaces, the algorithm chooses the next unlabeled instance that can offer best reduction of the factorized version space, V˜ = V¹ ×. . . V^K (line 9). The derivation of this strategy, as well as the proof of its optimality, is detailed in Section IV. The selected instance is then presented to the user for labeling and the unlabeled pool is updated. The algorithm then proceeds to the next iteration. Once the user wishes to stop exploring, we train a majority vote classifier [6] for each subspacek. The majority vote classifier,C, is constructed by first computing a sample of hypotheses from the version space.

Then for any pointx,C(x)is computed as the most frequent label across the sample. Given the classifiers,(C¹, . . . , C^K), for the subspaces, we build a final classifierF(C¹, . . . , C^K) (line 14), which can then be used to retrieve all the data instances of interest to the user from the databaseD.

IV. THEORETICALANALYSIS

A. Bisection Rule over Factorized Version Space

LetX ={xi}^N_i=1 be the collection of unlabeled data points, and let yi ∈ Y represent the unknown label of xi. The user interest pattern can be modeled as a hypothesis function h: X → Y, and we denote byHthe set of all hypotheses. We also assume a known probability distribution π(h), representing our prior knowledge over which hypotheses are more likely to match the user preference.

Theversion spaceis the set of all hypotheses consistent with the labeled set L:V ={h∈ H:h(x) =y,∀(x, y)∈ L}.

Our factorized version space algorithm is based on a well known concept calledVersion Space Bisectionrule (or Gen- eralized Binary Search). It searches for the pointxwhich most evenly divides the version space across all classes:

arg max

x

1−X

y∈Y

p²_x,y (2)

(6)

wherepx,y =πV(Vx,y). HereπV is the probability distribution π normalized over the version space V and Vx,y = {h ∈ V : h(x) = y}. Thus, the greedy strategy searches for the point xfor which the setsVx,y have approximately the same probability mass for every possible labely of x.

Now, let’s suppose that a factorization structure (A¹,· · · , A^K) is given. For each subspace k, the user labels the projection x^k of x over A^k based on a hypothesis from a hypothesis classH^kwith prior probability distributionπ^k. The user then provides a binary label {−,+} for each subspace;

in other words, for each x a collection of subspatial labels (y1, . . . , yK)∈ {−,+}^K is provided by the user.

Definition 1: Factorized hypothesis function and factorized hypothesis space: Let Yf = {−,+}^K. We define the factorized hypothesis function as the function H : X → Yf

such that

H(x) = (h1, . . . , hK)(x) = (h1(x¹), . . . , hK(x^K)).

H belongs to the product space H˜ =H¹×. . .× H^K, which we call the factorized hypothesis space.

We assume that the user labels the subspaces independently and consistently within each subspace.

Definition 2:Factorized version space: LetLbe the set of instances that have been labeled over subspaces at any iteration of active learning. Define the factorized version space asV˜ = {H ∈H˜ :H(x) =y for all(x, y)∈ L}. We can see thatV˜ = Q

kV^k =V¹×. . .×V^K, whereV^k ={h∈ H^k :h(x^k) = y^k for all(x, y)∈ L}is the version space at subspacek.

Given the assumption of independent labeling among the subspaces, the prior probability distribution over the factorized hypothesis space is π˜=π¹×. . .×π^K.

Definition 3: Factorized greedy selection strategy: Let px,y = ˜πV˜( ˜Vx,y), the strategy is defined as

arg max

x

1− X

y∈Yf

p²_x,y. (3) B. Optimal properties

For the version space bisection rule, it has been shown [10]

that the greedy strategy in (2) enjoys a near-optimal performance guarantee: the average number of labeled instances by the greedy algorithm is no larger than:

OP T ·

1 + ln 1 minhπ(h)

2

, (4)

where OP T is the minimum expected number of iterations across all active learning algorithms that continue until the hypothesis matching the user interest has been found.

Factorized greedy selection strategy.Applying the bisection rule to (X,Y,H,˜ ˜π), and noting that min_H∈H˜˜π(H) = Q

kmin_h_k_∈Hkπ^k(hk), the following theorem is the direct extension of the above result.

Theorem 1 (Number of iterations with factorization): The factorized greedy strategy in (3) takes at most:

OP Tf· 1 +X

k

ln

1 minh_kπk(hk)

!²

(5)

iterations in expectation to identify a hypothesis randomly drawn from π, where˜ OP Tf is the minimum number of iterations across all strategies that continue until the true hypothesis over all subspaces has been found.

In the following, we derive a simplified computation of the factorization greedy strategy (3).

Theorem 2:Letp_xk,+=π_V^kk(V_x^kk,+), the factorized greedy selection strategy (3) is equivalent to

arg max

x

1−Y

k

(1−2p_xk,+(1−p_xk,+)). (6) Proof:First, by noting that V˜x,y =Q

kV_x^kk,y^k, it implies

˜

πV˜( ˜Vx,y) =Q

kπ_V^kk(V_x^kk,y^k).Therefore:

X

y∈Y_f

p²_x,y = X

y₁=±

. . . X

y_K=±

Y

k

π_V^kk(V_x^kk,y^k)²

=Y

k

X

y_k=±

π_V^kk(V_x^kk,y^k)²

=Y

k

(p²_xk,++ (1−p_xk,+)²)

=Y

k

(1−2p_xk,+(1−p_xk,+))

V. OPTIMIZATIONS

The main difficulty in implementing the greedy selection strategies, as discussed in the previous section, lies in efficient computation of the cut probabilitypx,+=πV(Vx,+). For this, we adopt asampling approximation [6], [8], [11] :

p_x,+=P(H(x) = +)≈ 1 M

M

X

i=1

1(H_i(x) = +) (7) where{Hi}^M_i=1is a i.i.d. sample fromπV. Thus, our problem is to develop a sampling algorithm for hypotheses.

Our sampling procedure closely follows [6]. We first consider the class of homogeneous linear classifiers:

H^lin={hw:hw(x) = sign(w^Tx)for kwk ≤1} (8) We can then improve the generalization power of this class in two ways. First, we can add a bias b to the linear classifier by simply adding a dummy feature of value 1 to each data point. Second, we can choose a kernel function k(·,·) and replace each data point x by its kernel representation (k(x, x1), . . . , k(x, xt)), where{x1, . . . , xt} is the collection of data points labeled so far.

Now, assume we have a labeled set L = {(xi, yi)}, with yi ∈ {−1,1}. The version space is the set of all hw withw restricted to the convex set:

W ={w∈R^d:kwk ≤1∧yix^T_iw >0} (9) Luckily, sampling from convex sets is a well-know problem in computational geometry. It can be solved by the Hit-and- Run algorithm [14], which creates a random-walk inside the polytope which converges to the uniform distribution. A more

(7)

detailed explanation on how to implement this step can be found in the appendix A.

Rounding. The hit-and-run algorithm itself can have a large mixing time, specially in cases where the convex body is very elongated in one direction. To solve this problem, we adopt a rounding procedure[15], [16]. Basically, it consists of finding a linear transformation T for which the convex body T(W) is more evenly elongated across all directions. The Hit-and- Run sampling is run overT(W), with the final sampling over W being obtained through the inverse transformation T⁻¹. Computing the rounding transformation is done in two steps:

1) Find (an approximation of) the ellipsoidE of minimum volume containingW.

2) SetT as any linear isomorphism takingE into a ball of radius1.

More details on how to implement these steps can be found in the appendix B.

Hit-and-Run’s starting point. Hit-and-Run starts by computing a pointW₀ insideW. Although this could be done by solving a linear programming task, it can take a significant amount of time. Instead, we rely on the rounding procedure:

the ellipsoid E computed satisfies γE ⊂ W ⊂ E, where γE is obtained by shrinking all axes of E by some 0 < γ < 1.

This property guarantees that the center of E must be inside W, which can then be used as initial point.

VI. EXPERIMENTS

We implemented all of our techniques in a Java-based proto- type, which connects to a PostgreSQL database. In this section, we evaluate these techniques against state-of-the-art active learning algorithms [3], [5]–[7] in terms of accuracy (using F-score) and efficiency (execution time in each iteration).

A. Experimental Setup

We evaluate our techniques using the Sloan Digital Sky Survey dataset (SDSS, 190 million tuples). This dataset contains the “PhotoObjAll” table with 510 attributes² and 190 million sky observations. We used a 1% sample (1.9 million tuples, 4.9GB) for running active learning algorithms. SDSS also offers a query release, where the SQL queries reflect the data interest of scientists. We extracted 11 queries to build a benchmark, which can be found in Appendix C, and treated them as the ground truth of the positive classes of 11 classifiers to be learned — our system does not need to know these queries in advance, but can learn them via active learning. These queries reflect different dimensionality (2D- 7D) and complexity of the decision boundary (e.g., various combinations of linear, quadratic, log patterns).

Algorithms:We compare our algorithms to state-of-the-art VS algorithms, Simple Margin (SM) [7], Query By Disagreement (QBD) [5], and ALuMA [6], and another factorization-aware algorithm, DSM [3]. In our experiments, each active learning algorithm starts with one positive example and one negative example, and runs up to 100 additional labeled examples.

2http://www.sdss3.org/dr8/

Servers: Our experiments were run on four servers, each with 40-core Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 128GB memory, OpenJDK 1.8.0 on CentOS 7. Our factorized algorithm used one core for sampling in each subspace.

B. Evaluating Our Techniques using SDSS

Expt 1 (Optimization of Hit-and-Run): We first investi- gate the effect of Rounding on the Hit-and-Run sampling, as proposed in Section V. The resulting algorithm, denoted as

“Opt VS”, is compared to a baseline without such optimization, which is ALuMA [6]. For all 11 SDSS queries, Opt VS significantly outperforms ALuMa, while Figure 2(a) shows the F-scores of Q2 and Q3.

Expt 2 (Factorization): We next study the effect of factorization, by comparing Opt VS with its extension to factorization, denoted as “Fact VS”. For factorization, each predicate in the target query corresponds to its own factorized subspace. Q1-Q4 are 2D queries and not factorized. Q5-Q7 are factorized where each predicate corresponds to each subspace with no overlap of attributes across subspaces, while Q8- Q11 are also factorized but with overlap of attributes across subspaces. For Q5-Q11, Fact VS outperforms Opt VS, often by a wide margin. Figure 2(b) show the F-score for Q6 and Q10.

In the particular case of Q6, the factorized version reaches

>90% F-Score after 100 iterations while the non-factorized version is still at less<10%.

C. Comparing to Other Methods using SDSS

Expt 3 (A comparative study): In figures 2(c)-2(e), we compare to three VS algorithms, Simple Margin (SM) [7], Query By Disagreement (QBD) [5], and ALuMA [6], as well as DSM [3], a factorization-aware algorithm.

We first consider the case without factorization. Our Opt VS algorithm outperforms the three VS algorithms [5]–[7] and DSM [3] most time for the 2D queries, Q1-Q4, that do not use factorization. Figure 2(c) show the results for Q2. During the initial iterations, Opt VS improves much faster than other algorithms, and remains the best across all iterations.

We next consider factorization. Our Fact VS almost always outperforms others, including DSM that uses factorization under stronger assumptions. Figures 2(d) and 2(e) show the results for Q6 and Q10. In general, Fact VS outperforms DSM, which in turn outperforms all other (non-factorized) algorithms. In the particular case of 2(d), after100 iterations Fact VS is at >90% accuracy, while DSM is still at merely 40% and all of the other alternatives are at10%or lower.

Finally, Figure 2(f) shows the running time of Fact VS, DSM, and ALuMA for Q10, our most expensive query.

ALuMA is consistently more expensive than Fact VS. The two factorized algorithms take at most a couple of seconds per iteration, thus better suiting the interactive data exploration scenario. Moreover, DSM exhibits a large warm-up time and is slower than Fact VS during the first 60 iterations. Thus, Fact VS may be preferred to DSM and ALuMA given its better accuracy and lower time per iteration in initial iterations.

(8)

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration Q2 - Opt VS Q2 - ALuMA Q3 - Opt VS Q3 - ALuMA

(a)Opt VS versus ALuMa (Q2 - Q3, F-score)

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration Q6 - Fact VS Q10 - Fact VS Q6 - Opt VS Q10 - Opt VS

(b)Factorization effects (Q6 - Q10, F-score)

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration Opt VS ALuMA DSM QBDSM

(c)Opt VS versus others (Q2, F-score)

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration Fact VS

DSM Opt VSSM ALuMA QBD

(d)Fact VS versus others (Q6, F-score)

0 0.2 0.4 0.6 0.8 1

0 20 40 60 80 100

F-score

Iteration Fact VS

Opt VS DSMSM ALuMA QBD

(e)Fact VS versus others (Q10, F-score)

0 2000 4000 6000 8000

0 20 40 60 80 100

TimeperIteration(ms)

Iteration ALuMA

Fact VS DSM

(f)Fact VS versus others (Q10, time) Fig. 2. Comparison between our proposed algorithms and the state-of-the-art over SDSS dataset

VII. CONCLUSIONS

To overcome the slow convergence of active learning (AL) in large database exploration, we presented a new algorithm that augments version space-based AL algorithms, which have strong theoretical results on convergence but are costly to run, with insights on the factorization structure employed in the user labeling process. The resulting algorithm factorizes the version space to perform active learning in a set of subspaces, with provable results on optimality and optimizations for performance. Evaluation results using real world datasets show that our algorithm significantly outperforms state-of-the-art version space algorithms, as well as a recent factorization- aware algorithm, for large database exploration.

REFERENCES

[1] K. Dimitriadou, O. Papaemmanouil, and Y. Diao, “Explore-by-example:

an automatic query steering framework for interactive data exploration,”

inSIGMOD Conference, 2014.

[2] K. Dimitriadou, O. Papaemanouil, and Y. Diao, “AIDE: An active learning-based approach for interactive data exploration,”TKDE, 2016.

[3] E. Huang, L. Peng, L. D. Palma, A. Abdelkafi, A. Liu, and Y. Diao, “Op- timization for active learning-based interactive database exploration,”

Proceedings of the VLDB Endowment, 2018.

[4] W. Liu, Y. Diao, and A. Liu, “An analysis of query-agnostic sampling for interactive data exploration,”Communications in Statistics – Theory and Methods, 2017.

[5] B. Settles,Active Learning, 2016.

[6] A. Gonen, S. Sabato, and S. Shalev-Shwartz, “Efficient Active Learning of Halfspaces: an Aggressive Approach,”JMLR, 2013.

[7] S. Tong and D. Koller, “Support Vector Machine Active Learning with Applications to Text Classification,”JMLR, 2001.

[8] K. Trapeznikov, V. Saligrama, and D. Casta˜n´on, “Active Boosted Learn- ing (ActBoost),”Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.

[9] S. Dasgupta, “Analysis of a greedy active learning strategy,”NIPS, 2005.

[10] D. Golovin and A. Krause, “Adaptive Submodularity : A New Approach to Active Learning and Stochastic Optimization,”Time, 2007.

[11] R. Gilad-Bachrach, A. Navot, and N. Tishby, “Query by Committee Made Real,”NIPS, 2005.

[12] M. J. Kearns and U. Vazirani,An Introduction to Computational Learn- ing Theory, 2018.

[13] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. R´e, “Data programming: Creating large training sets, quickly,” inNIPS, 2016.

[14] L. Lov´asz, “Hit-and-run mixes fast,”Mathematical Programming, 1999.

[15] L. Lov´asz,An Algorithmic Theory of Numbers, Graphs and Convexity, 1986.

[16] D. De Martino, M. Mori, and V. Parisi, “Uniform sampling of steady states in metabolic networks:Heterogeneous scales and rounding,”PLoS ONE, 2015.

[17] G. Marsaglia, “Choosing a Point from the Surface of a Sphere,” The Annals of Mathematical Statistics, 2007.

[18] H. S. Haraldsd´ottir, B. Cousins, I. Thiele, R. M. Fleming, and S. Vem- pala, “CHRR: Coordinate hit-and-run with rounding for uniform sampling of constraint-based models,”Bioinformatics, 2017.

APPENDIXA

HIT-AND-RUN IMPLEMENTATION

LetW ⊂R^d be a convex body. The Hit-and-Run algorithm is a randomized algorithm for sampling a point x ∈ W uniformly at random. More precisely, it generates a Markov Chain insideW which converges to the uniform distribution;

starting at any given point X0 ∈ W, it iteratively performs two steps:

1) Sample a direction vector D uniformly at random over the unit sphere

2) SetXt+1 as a random point on the line segment {s ∈ R:Xt+sD} ∩W

Implementing step 1 can be easily done through the Marsaglia Algorithm [17]: simply sample D∼ N(0, I_d)and setD←D/kDk.

As for step2, the main difficulty is to find the intersections of the lineL={s∈R:Xt+sD}with the boundary of W. Although there are methods for finding these extremes for a general convex bodyW, we will focus to the particular case of a polytope P ={x : Ax ≥ 0} intersected with the unit ball:W =P∩B(0,1). In this case, it is easy to see that for the polytope:

(9)

Xt+sD∈P ⇐⇒ A(Xt+sD)≥0

⇐⇒ sAD≥ −AXt

⇐⇒ sa^T_i D≥ −a^T_iXt, for alli

⇐⇒ max

a^T_iD>0

−a^T_i Xt

a^T_iD ≤s≤ min

a^T_iD<0

−a^T_iXt

a^T_i D and, for the unit ball (keeping in mind that kDk= 1):

Xt+sD∈B(0,1) ⇐⇒ kXt+sDk²≤1

⇐⇒ s²+ 2(X_t^TD)s+kXtk²−1≤0

⇐⇒ −X_t^TD−√

∆≤s≤ −X_t^TD+

√

∆ where ∆ = (X_t^TD)² − kX_tk² + 1. Note that ∆ > 0 since X_t ∈ B(0,1). Finally, we simply need to sample S uniformly from the intersection of the above two intervals and setXt+1=Xt+SD .

APPENDIXB

ROUNDING IMPLEMENTATION

The Rounding algorithm [15], [16], [18] is a preprocessing method devised to improve the mixing time of the Hit-and- Run Markov chain. Essentially, the mixing time tends to be very high when a convex bodyW ⊂R^dis very elongated into some direction. In order to counter this problem, the rounding procedure looks for a linear transformation T :R^d→R^d for which the imageT(W)is “rounder”, i.e. evenly elongated into all directions. With this, Hit-and-Run can be run over T(W) and the final sampling over W can be retrieved via T⁻¹.

First, let’s see how the rounding transformation T affects the hit-and-run chain generation. Let X0 ∈W be the chain’s starting point, and let’s define Y₀ = T(X₀) ∈ T(W).

The usual Hit-and-Run algorithm over T(W) gives rise to a chain {Yt}, which is incrementally defined by Y_t+1 = Y_t+s_t+1D_t+1. By applying T⁻¹ on the previous equation, and settingX_t=T⁻¹(Y_t), we obtain a revised version of the Hit-and-Run update rule:

Xt+1=Xt+st+1T⁻¹Dt+1 (10) Now, all it remains is how to compute T⁻¹. For this, we follow the algorithm described in [15], [16]. In general terms, this algorithm finds an approximation of the minimum volume ellipsoid E containing W. The rounding transformation can then be chosen as any linear transformation taking E into a unit-radius ball. Implementation details can be found on algorithm 2.

As a last remark, this algorithm assumes that the convex body W possesses a separation oracle; in other words, for any point x /∈ W, we can find a hyperplane H(x) = {y : b_x+w^T_xy= 0}such thatW ⊂H(x)⁺={y:b_x+w_x^Ty≥0}

andx∈H(x)⁻={y:b_x+w_x^Ty≤0}. In the particular case of W ={x:Ax≥0} ∩B(0,1),H(x)is given by:

Algorithm 2 Rounding algorithm

Input: convex body W ⊂R^d, any R≥sup_w∈Wkwk Output: T⁻¹, the inverse of the rounding transformation

1: z←0, P ← _R¹Id 2: converged←f alse

3: while not convergeddo

4: converged←true

5: if z /∈W then

6: converged←f alse

7: H ←get separating hyperplane(W, z)

8: z, P ← ellipsoid method update(z, P, H)

9: else

10: Q, D←eigendecomposition(P)

11: a^±_k ←z±_d+1¹ q_k

12: if anya^±_k ∈/W then

13: converged←f alse

14: H ←get separating hyperplane(W, a^±_k)

15: z, P ←ellipsoid method update(z, P, H)

16: end if

17: end if

18: end while

19: return Cholesky(P)

Algorithm 3 Ellipsoid Method update

Input: ellipsoid E(z, P)inR^d, cutting hyperplane H(b, w) Output: The minimum volume ellipsoid containing E ∩H⁻

1: α← ^√^b+w^T^z

w^TP w

2: z⁰←z−^dα−1_d+1 _w^{P w}_T_{P w}

3: P⁰← ^d²^(1−α_d2−1²⁾

P−_{(d+1)(1−α)}^2(1−dα) P ww^TP

4: return E(z⁰, P⁰)

H(x) =

({y:−1 +_kxk¹ x^Ty= 0}, ifkxk>1

{y:a^T_i y= 0}, if a^T_ix <0 (11) APPENDIXC

SDSS QUERIES

Q1 (2D, selectivity 0.1%): rowc ∈ (662.5,702.5) AND colc ∈ (991.5,1053.5)

Q2(2D, 0.1%):(rowc−682.5)²+ (colc−1022.5)²<29² Q3(2D, 0.1%):ra∈(190,200)ANDdec∈(53,57) Q4(2D, 0.1%):rowv²+colv²>0.5²

Q5 (4D, 0.01%): (rowc−682.5)² + (colc−1022.5)² < 90² AND ra∈(180,210)ANDdec∈(50,60)

Q6 (6D, 0.01%): (rowc−682.5)² + (colc−1022.5)² < 280² AND ra∈(150,240)ANDdec∈(40,70)ANDrowv²+colv²>0.2² Q7(4D, 7.8%) :x1>(1.35 + 0.25∗x2)ANDx3+ 2.5∗log(2∗3.1415∗ petror50 r²)<23.3

Q8 (7D, 5.5%): (dered r − dered i) < 2 AND cmodelmag i − extinction i ∈ [17.5,19.9] AND (dered r−dered i)−(dered g− dered r)/8. >0.55ANDf iber2mag i <21.7 ANDdevrad i <20.

AND dered i < 19.86 + 1.60∗((dered r−dered i)−(dered g− dered r)/8.−0.80)

(10)

Q9(5D, 1.5%):u−g <0.4ANDg−r <0.7ANDr−i >0.4AND i−z >0.4

Q10(5D, 0.5%):(g <= 22)AND(u−g∈[−0.27,0.71])AND(g−r∈ [−0.24,0.35])AND(r−i∈[−0.27,0.57])AND(i−z∈[−0.35,0.70]) Q11 (5D, 0.1%):((u−g > 2.0) OR(u > 22.3))AND (i ∈ [0,19]) AND(g−r >1.0)AND((r−i <0.08 + 0.42∗(g−r−0.96))OR (g−r >2.26))AND(i−z <0.25)