RETURNING HIERARCHICAL ROWSETS

Three Steps of Data Mining

Similar to PredictHistogram, a few data mining prediction functions return hierarchical rowsets (nested rowsets). The concept of hierarchical rowsets is defined in OLE DB. It provides a way to represent the results in a compact format. The data type for the column containing a nested table is Chapter.

While retrieving the query result rowsets, you need to verify the column data type. If it is a Chapter type, you need to retrieve its contents in an inner loop.

You can also flatten the query result by indicating the Flattened keyword in the Selectclause as follows:

Select Flattened CustomerId, PredictHistogram(MemberCard) as Histogram...

In this case, you get query result in flat format without columns with Chapter type. Note, the flatten operation could be inefficient if you have multiple nested tables in a rowsets.

In the previous MarketBasketModelexample, we can get the list of prod-ucts a customer may be interested by using the following query:

Select CustomerID, gender, PredictAssociation(Purchases) From MarketBasketModel Prediction Join

Openrowsets(...)

PredictAssociation(Purchase)returns a table column. The columns of the table are the same as the columns in the nested part of the model. The number of rows in the table equals the number of nested keys, in this case, the number of different products. Table 2.5 shows the resulting rowsets:

The Predictfunction is a special prediction function. It supports polymor-phism. Based on the underlying algorithm, the Predict function delegates the request to Predict, PredictTimeSeries, PredictSequence, Predict Association, and so on. For the previous query, we can use Predict (Purchases)instead of PredictAssociation(Purchases).

Sometimes a customer may have already purchased some products. We want to have prediction results for one of the following three cases:

1. The prediction contains the complete list of products that the store offers, with associated predicted quantities.

2. The prediction shows what other products a customer is likely to buy based on the products the customer has already bought. The reported list should not include the product from the input case.

3. The prediction contains just the predicted Quantity value associated with the products from the input case, or perhaps just the likelihood of each product in the input case. No other products should appear in the output table.

To express these three different cases, we can specify, respectively, one of the following options in the Predictfunction:

■■ INCLUSIVE, which represents the behavior in case 1.

■■ EXCLUSIVE(default option), which causes behavior number 2.

■■ INPUT_ONLY, which ensures that the predicted table contains only the rows supplied by the input (behavior number 3).

Table 2.5 Query result of PredictAssociation

CUSOMTERID GENDER PURCHASES

1 Male ProductName Quantity

Cake 1

Coke 1

Table 2.5 (continued)

CUSOMTERID GENDER PURCHASES

Milk 3

Pepsi 4

Wine 1

. . .

Each row in the Purchase table may contain statistics, for example, the sup-port and the probability. This information is stored in the derived columns

$Support and $Probability, respectively. The support is the number of cases similar to the given customer in the training dataset (for example, those who bought a particular product and has the same demographic information). The probability indicates how likely it is for a given customer to buy a particular product and is not concerned with the quantity of the product the customer may buy. Be aware that probabilities for different products don’t sum to 1, because the customer may buy a number of products. To include these statis-tics in the result rowset, we add another parameter for the Predictfunction:

INLCUDE_STATISTICS.

Select CustomerId, gender, Predict(Purchase, Include_Statistics, Input_Only) . . .

The result of the previous query is shown in Table 2.6.

Table 2.6 Query result using Include_Statistics CUSOMTER_ID GENDER PURCHASES

1 Male ProductName Quantity $Support $Probability

Milk 3 300 0.32

Pepsi 2 320 0.34

Wine 1 200 0.21

Sometimes we need more information than just the purchase probability for each product. We want to know, for example, how likely it is that the customer will buy 1, 2 or 3 units of the product. Conceptually, all this information is included in the Predict function. In order to get the probability for each quantity, we can use the PredictHistogram function on the Quantity col-umn and use the nested Select statement to select from the results of Predictfunction. The PredictHistogramfunction returns the histogram of all quantities and their associated probabilities. These probabilities sum to 1.

Select CustomerId, gender,

(Select (ProductName, PredictHistogram(Quantity) As QuantityHistogram From Predict(Purchase, Include_Statistics) ...

The preceding query returns the data shown in Table 2.7.

Table 2.7 Result for Nested Query CUSTOMERID GENDER PURCHASES

1 Male Product QuantityHistogram $Support $Probability

Name

Cake Quantity $Variance $Probability 300 0.32

1 0 0.60

2 0 0.20

3 0 0.20

Coke Quantity $Variance $Probability 320 0.34

1 0 0.70

2 0 0.20

3 0 0.10

Milk . . . 200 0.21

The result rowset from Predict table column could be very large, especially when the query uses options, such as INCLUSIVE, INCLUDE_STATITICS, or uses PredictHistogram on nested columns. To solve the problem, DMX introduces the TopX and BottomX family of functions, which operate on nested tables (including those resulting from PredictHistogram, a nested Selector any other expression returning a table). These functions order the records of the nested table by a specified column’s value and then truncate the sorted list to a specified length.

For example, the following query uses the TopCountfunction to retrieve the top two most likely membership cards that may interest the customer:

Select CustomerId, TopCount(PredictHistogram(MemberCard), $Probability, 2). . .

The following query retrieves the top 10 products the customers is predicted to buy in the largest quantities:

Select CustomerId, TopCount(Predict(Purchase, EXCLUSIVE), quantity, 10)...

The following query retrieves the top five products the customers is most likely to buy:

Select CustomerId, TopCount(Predict(Purchase, EXCLUSIVE), $Probability, 5)...

DMX 2.0 proposes a shortcut for the previous query using TopCounton the Predictfunction with $Probabilityas the select criteria. The following is the equivalent query using the shortcut.

Select CustomerId, Predict(Purchase, 5)...

If you want to return the top n recommendations based on an adjusted prob-ability instead of probprob-ability, you can use the following query:

Select CustomerId, Predict(Purchase, 5, $AdjustedProbability)...

N OT E AdjustedProbabilityis derived from Probability, with adjustment based on some math functions. It usually penalizes most popular states (items). For example, if every customer buys Pepsi, recommending Pepsi to new customer doesn’t add too much value. In this case, you may try use AdjustedProbability, which penalizes common items such as Pepsi.

As the result of TopCountis a table column, we can apply a subelect clause to the result of TopCountfunction (or any prediction function returns table column). The following is an example:

Select CustomerId, Gender, (Select MemberCard, $Probability as Proba From TopCount(PredictHistogram(MemberCard), $Probability, 2) As ProbabilityHistogram). . .

The result rowset of the previous query is shown in Table 2.8.

Table 2.8 Query Result

CUSTOMERID GENDER PROBABILITYHISTOGRAM

1 Male MemberCard Proba

Gold 0.70

Silver 0.10

2 Female MemberCard Proba

Gold 0.25

Silver 0.15

We can also add a Whereclause to pull out certain records from a nested table. For example, if instead of always getting the “best” prediction for a membership card, suppose that a query only wants to get the probability of

having a silver membership card for each customer. We can express this query using the following Selectclause:

Select CustomerId, (Select $Probability From

PredictHistogram(MemberCard) Where MemberCard=’Silver’) As SilverCardProba)...

Similar to the query in the MarketBasketAnlysis model, the following query retrieves only the probability for each customer of purchasing Pepsi.

Select CustomerId, (Select $Probability From PredictHistogram(Purchase, INCLUSIVE, INCLUDE_STATISTICS) Where ProductName=’Pepsi’) As

PepsiPurchaseProba)...

Table 2.9 contains a list of prediction functions that operates on table columns. A brief description is also given in the table.

Table 2.9 Prediction Function on Table Columns

FUNCTION RETURN DESCRIPTION

VALUE

Predict(<table Scare value When the parameter of a column reference>, Predictfunction is a table options, . . .) column, it predicts the

information related to the nested table. Possible options include EXCLUDE_NULL,

INCLUDE_NULL, INCLUSIVE, EXCLUSIVE, INPUT_ONLY, and INCLUDE_STATISTICS.

TopCount(<table expr>, <table expr> Return the first <n-items>rows

<rank expr >, in a decreasing order of <rank

<n-items>) expr >.

TopSum(<table expr>, <table expr> Return the first n rows in a

<rank expr >, decreasing order of <rank

<sum>) expr> such that the sum of the

values is at least <sum>. TopPercent(<table expr>, <table expr> Return the first n rows in a

<rank expr >, decreasing order of <rank

<percent>) expr>such that the sum of the

<rank expr>values is at least the given percentage of the total sum of <rank expr>values.

Table 2.9 (continued)

FUNCTION RETURN DESCRIPTION

VALUE

PredictTimeSeries <table expr> Returns the predicted value of (<table expr>, the next n time slices in a table.

<n1 >, <n2>)

PredictSequence <table expr> Returns the sequence of states (<table expr>, of next n steps in a table.

<n1 >, <n2>)

PredictAssociation <table expr> Returns n most likely associated (<table expr>, <n>) item.

Singleton Queries

Sometimes, the input case for prediction doesn’t exist in a database table. For example, suppose that an insurance company wants to give a quote to a cus-tomer over the phone or that an online retail site wants to recommend prod-ucts based on the list of prodprod-ucts a customer has already chosen in the shopping cart. In these cases, the customer data may not yet be recorded in the database. To make a prediction in these cases, DMX provides the syntax for a singleton prediction query. The singleton query allows sets of constant values to be used in place of the <source data query>for the prediction join query.

The following is an example of singleton prediction query using the Member Card_Prediction model:

Select CustomerId, MemberCard

From MemberCard_Prediction m Prediction Join

(Select ‘Male’ As gender, ‘35’ As age, ‘Engineer’ As Profession,’60000’

As Income, ‘Yes’ As HouseOwner) As customer On m.gender = customer.gender

...

It is also possible to construct a singleton query with a nested table. The fol-lowing is an example of prediction against MarketBasketAnalysis model based on the items in a customer’s shopping cart:

Select CustomerId, Predict(Purchase, 5) From MarketBasketAnalysis m Prediction Join

(Select ‘Male’ As gender, ‘60000’ As Income, ‘Silver’ As MemberCard, (Select ‘Cake’ As ProductName, 1 As Quantity

Union Select ‘Milk’ As Product Name, 2 As Quantity) as Purchase) As customer

On m.gender = customer.gender

And m.Income = customer.Income

And m.MemberCard = customer.MemberCard

And m.Purchase.ProductName = customer.Purchase.ProductName And m.Purchase.Quantity = customer.Purchase.Quantity

Making Predictions Using Content Only

One of the initial steps in training a mining model is to gather marginal statis-tics, for example, counting the number of customers with gold, silver, bronze member cards. DMX allows us to make predictions based on the marginal sta-tistics of model content.

The following queries return the most popular member card: Bronze.

Select Predict(MemberCard) From MemberCard_Prediction

Select MemberCard

From MemberCard_Prediction

You can also query nonpredictable columns. The following query returns the distinct values of Genderin the training dataset.

Select Distinct Gender From MemberCard

The following query returns the most popular three products based on cus-tomer purchases:

Select Predict(Purchase, 3) From MarketBasketAnalysis

Dans le document Data Mining with (Page 80-87)