2 Related Work - Challenges in Computational Statistics and Data Mining

Uplift modeling has received relatively little attention in the literature. The first paper mentioning it explicitly was [10] where decision trees designed specifically for that problem were discussed. More details on the method were later presented in [11].

A trivial approach to uplift modeling uses two probabilistic classifiers, one built on the treatment dataset, the other on control, whose predicted probabilities are then subtracted. The approach may however suffer from a serious drawback: both models will focus on predicting class probabilities in both groups and ignore the (usually much weaker) differences between them. A good example can be found in [11].

Therefore, most research in uplift modeling has been concerned with approaches which model the conditional difference in success probabilities directly.

Many such approaches are based on adaptations of decision trees. For example, in [11] uplift trees have been proposed which are based on a statistical test of success rate differences after the split and in [15,16] trees based on information theoretical divergences between treatment and control class probabilities. Ensembles of deci-sion trees have been described in [2]; a more thorough analysis of various types of ensembles in the uplift setting can be found in [17].

Regression based techniques can be found for example in [4], where a class variable transformation is presented which allows for converting uplift modeling problems into classification problems. Similar techniques have been discussed in the statistical literature [12,21].

In [22] Uplift Support Vector Machines have been proposed which allow for explicit identification of cases for which the action is positive, neutral and negative.

The model is described in the next section. Another type of uplift SVMs was proposed in [8] and is based on direct maximization of the area under the uplift curve.

Good overviews of uplift modeling can be found in [11] and in [16]. Procedures for correcting treatment assignment bias will be discussed in Sect.5.3.

2.1 Uplift Support Vector Machines

Let us first introduce some notation. Vectors will be denoted with boldface, lowercase letters,x,w. A dataset is a collection of records(xi,yi)wherexi is theith feature vector andyi ∈ {−1,1}the class value for theith record. The outcome 1 is considered the successful or desired outcome. The superscript^T will be used to denote terms related to the treatment group and the superscript^Cterms related to the control. For

example, the treatment training dataset isD^T = {(x_i^T,y_i^T):i =1, . . . ,n^T}and the control training set isD^C = {(x^C_i ,y_i^C):i=1, . . . ,n^C}.

We now discuss in more details the Uplift Support Vector Machines presented in [22] which we will use as the starting point for further developments. The machine is based on two separating hyperplanes

H1: w,x −b1=0, H2: w,x −b2=0.

The model predictions are made according to the following formula:

M(x)=

⎧⎪

⎨

⎪⎩

+1 ifw,x>b1andw,x>b2, 0 ifw,x ≤b1andw,x>b2,

−1 ifw,x ≤b1andw,x ≤b2,

(1)

that is, the model classifies the effect of the action on a pointx as positive (+1), negative (−1), or neutral (0). A graphical interpreation of the model is shown in Fig.1 (taken from [22]). The hyperplaneH1separates positive and neutral predictions and the hyperplane H2separates neutral and negative predictions.

T+

ξi,1

C+

ξi,2

T+

C−

T+

ξi,1

ξi₂

T−

ξi,2

ξi,1

C+

+1

0 − 1

Fig. 1 The Uplift SVM optimization problem. Example points belonging to the positive class in the treatment and control groups are marked respectively withT₊andC₊. Analogous notation is used for points in the negative class. The figure shows penalties incurred by points with respect to the two hyperplanes of the USVM. Positive sides of hyperplanes are indicated by small arrows at the right ends of lines in the image.Red solid arrowsdenote the penalties incurred by points which lie on the wrong side of a single hyperplane,blue dashed arrowsdenote additional penalties for being misclassified also by the second hyperplane

Székely Regularization for Uplift Modeling 139 Let us now formulate the optimization task which allows for finding the model’s parametersw,b1,b2. We will useD^T₊= {(xi,yi)∈D^T :yi = +1}to denote data points belonging to the positive class in the treatment group andD^T₋ = {(xi,yi)∈ D^T : yi = −1}to denote points in that group belonging to the negative class.

Analogous notation is used for points in the control group.¹

The version presented here is slightly different than that given in [22]: the soft margin penalties are averaged separately over the treatment and control groups. As a result both groups have the same impact on the optimized risk. The optimization problem is to find weightswmaximizing the functionR(w)defined as

R(w)=1

Note that the model has two penalty coefficients,C1andC2. The properties of the model are given in detail in [22], here we only review the main results without proofs, which easily carry over to the modified formulation given in this paper. First, the model is valid iffb1≥b2, this is the case whenC2≥C1which puts a constraint on the values ofC1andC2. The role of the coefficientC1is the same as in classical SVMs. From Fig.1it is clear that the coefficientC2determines the additional penalty for points which are on the wrong side of both hyperplanes (e.g. a treatment point with negative outcome which is classified as positive). It turns out that the ratio C2/C1determines the proportion of neutral predictions. ForC1=C2no points are classified as neutral and for a sufficiently large value ofC2/C1almost all points are.

In the following sections we will add an additional penalty term to (2) which will force similar model behavior in the treatment and control groups.

1The values of the class variable should not be confused with model predictions defined in (1). For example, a model prediction of+1 means that we expect the class variable to take the value of+1 if the action is performed (y^T = +1) and to take the value of−1 if the action is not performed (y^C= −1).

Dans le document Challenges in Computational Statistics and Data Mining (Page 146-149)