DMX Queries - Data Mining with

Microsoft Decision Trees can be used for three different data mining tasks:

classification, regression, and association. It is a very unique and powerful algorithm. In this section, we will build three different models using DMX to illustrate these usages.

Classification Model

The first model, shown in the following code, predicts College Plans based on Gender, IQ, Parents’ Income, and Parental Encouragement. The DMX for the model creation is the following:

Create mining model CollegePlan ( StudentId long Key,

Gender text discrete, ParentIncome long discrete, IQ long continuous,

ParentEncouragement text discrete, CollegePlans text discrete predict )

Using Microsoft_Decision_Trees (Complexity_Penalty=0.5)

After the model is created, we can process the model. To train the model, we need a training dataset. The training dataset can be stored in any data source as long as you have a right OLE DB driver. This feature is called in-place min-ing. The following training statement uses data stored in an Access table:

INSERT INTO CollegePlan

(StudentId, Gender, Iq, ParentEncouragement, ParentIncome, CollegePlans)

OPENROWSET(‘Microsoft.Jet.OLEDB.4.0’,

‘Data Source=C:\data\CollegePlan.mdb;’,

‘select StudentId, Gender, IQ, ParentEncouragement, ParentIncome, CollegePlans from CollegePlans’)

After training, we can browse the model using a content query:

Select * from CollegePlan.Content

This query returns the model’s content in a tabular format. In the case of the decision tree model, each row in the query result represents a node in the deci-sion tree. Each row contains a column with data type Chapter, which is a nested table storing the statistics (node distribution). You can write your own decision tree viewer based on the results of this query.

Now, we will apply the model to predict the College Plans for new students through the following query:

‘SELECT StudentID, Gender, IQ, ParentEncouragement, ParentIncome FROM NewStudents’) AS T1

ON CollegePlan.ParentIncome = T1.ParentIncome AND CollegePlan.IQ = T1.IQ AND

CollegePlan.Gender = T1.Gender AND

CollegePlan.ParentEncouragement = T1.ParentEncouragement

This query returns three columns: StudentID, CollegePlans, and Proba.

As explained in Chapter 2, a data mining query result may contain nested tables and sometimes even multiple levels of nesting. The following query returns the histogram of College Plans’ predictions in the form of a nested table:

‘SELECT StudentID, Gender, IQ, ParentEncouragement, ParentIncome FROM NewStudents’) AS T1

ON CollegePlan.ParentIncome = T1.ParentIncome AND CollegePlan.IQ = T1.IQ AND

CollegePlan.Gender = T1.Gender AND

CollegePlan.ParentEncouragement = T1.ParentEncouragement

The result of the query is displayed in Table 5.1. The Histogram column embeds a nested table. Besides College Plans, there is a set of predefined columns in the nested table, including $Support, $Probability, $AdjustedProb-ability, and so on. Each row represents a state of the College Plan, and the last row represents the missing state.

Table 5.1 Query Results

STUDENTID COLLEGEPLANS

1 CollegePlans $Support $Probability $Adjusted ....

Probability

Yes 1175 0.91665 0.00052

No 106 0.08301 0.00593

Missing 0 0.00034 0.00034

2 CollegePlans $Support $Probability $Adjusted ....

Probability

Yes 327 0.81992 0.00047

No 71 0.17893 0.01280

Missing 0 0.00115 0.00115

3 ...

Regression Model

Regression predicts continuous variables using regression formulas. The regres-sor must have a continuous content type. Normally, a regression formula con-tains one or more regressors. When there is no regressor in the formula, the result tree contains a constant in each leaf node.

The following predicts Parents’ Income using IQ, Gender, Parental Encour-agement, and College Plans. IQ is used as a regressor.

Create mining model ParentIncomePrediction ( StudentId Text Key,

Gender text discrete,

IQ long regressor continuous ,

ParentEncouragement long continuous, CollegePlans text discrete predict, ParentIncome long continuous predict )

Using Microsoft_Decision_Trees

The training statement is independent of the algorithm and the column usage settings. It mainly specifies the binding of input columns and the min-ing model columns. The ParentIncomePredictiontraining statement is the same as in the College Plans model given in the previous example.

After training, we can query the content schema rowset of the model. The overall content structure of a regression model is similar to a classification model, with each node representing a tree node. The difference is in the nested distribution table. For the regression model, each row in the distribution table represents a coefficient for the regressor or the intercept.

The following query predicts the Parents’ Income for new students. It also returns the standard deviation for each prediction. The smaller the deviation, the more accuracy the prediction has.

SELECT T1.StudentID, ParentIncomePrediction.ParentIncome, PredictStdev(ParentIncome) as deviation

FROM ParentIncomePrediction PREDICTION JOIN

OPENROWSET(‘Microsoft.Jet.OLEDB.4.0’,

‘Data Source=C:\data\CollegePlan.mdb’,

‘SELECT StudentID, Gender, IQ, ParentEncouragement, CollegePlans FROM NewStudents’) AS T1

ON ParentIncomePrediction.CollegePlans = T1.CollegePlans AND ParentIncomePrediction.IQ = T1.IQ AND

ParentIncomePrediction.Gender = T1.Gender AND

ParentIncomePrediction.ParentEncouragement = T1.ParentEncouragement

We can also apply the PredictHistogramfunction on the continuous col-umn; it returns two rows in the nested table: the predicted mean value and the missing state, and each is associated with a probability. If the probability of missing state is over 50%, the predicted value for Parents’ Income is missing.

SELECT T1.StudentID, PredictHistogram(ParentIncome) as Histogram FROM CollegePlan

PREDICTION JOIN ...

The previous query returns the results shown in Table 5.2.

Table 5.2 Query Result STUDENTID HISTOGRAM

1 Parents’ $Support $Probability $Adjusted $Variance $Stdev

Income Probability

33679 4336 0.99977 0 225949017 15031.6

Missing 0 0.00023 0.00023 0 0

1 Parents’ $Support $Probability $Adjusted $Variance $Stdev

Income Probability

49082 2672 0.99962 0 225151659 15005.1

Missing 0 0.00037 0.00037 0 0

...

Association Model

As explained in the previous section, we can use Microsoft Decision Trees for association tasks. The model builds a set of trees for each predictable attribute and calculates the relationship among these trees.

The association model usually contains a nested table, and the nested key is the attribute to use for the association analysis. It is also possible to have only a case table for an association model, with each column representing one item.

However, when there are large amount of items, it is difficult to store this information in a single table because most databases have limitations the on number of columns a table can contain. The following is an example of an association model built on the movie dataset:

Create mining model MovieAssociation (

CustomerID long key,

Gender text discrete, MaritalStatus text discrete, Movies table Predict (

MovieName text key )

)

Using Microsoft_Decision_Trees

This query analyzes the associations among all movies together with cus-tomer’s gender and marital status. It builds a decision tree for each movie up to 255 trees. Each movie is considered an attribute with binary states: existing or missing. Trees may have splits on movie name, gender, and marital status.

Because the model contains a nested table, the training statement involves the Shape provider.

Insert into MovieAssociation ( CustomerId, Gender, MaritalStatus, Movies (SKIP, MovieName))

OPENROWSET(‘MSDataShape’, ‘data provider=Microsoft.Jet.OLEDB.4.0; data source=C:\data\moviesurvey.mdb’ ,

‘SHAPE { select CustomerId, Gender, MaritalStatus From Customer Order by CustomerID}

APPEND ({select CustomerId, MovieName from Movies order by CustomerID}

RELATE CustomerID to CustomerID) AS Movies’ )

Suppose that there is a married male customer who likes the movie Termi-nator. The following singleton query returns the other five movies this cus-tomer is most likely to find appealing:

SELECT

CustomerID,

Predict(MovieAssociation.Movies,5) as Recommendation From

MovieAssociation NATURAL PREDICTION JOIN

(SELECT ‘101’ as CustomerID, ‘Female’ AS Gender,

‘Married’ AS MaritalStatus,

(SELECT ‘Terminator’ AS [ovie) AS Movies) AS t

The result of the query is shown in Table 5.3, with Recommendation as a nested table containing the five movies to recommend.

Table 5.3 The Five Recommended Movies

CUSTOMERID RECOMMENDATION

101 A Beautiful Mind

Lord of the Rings: The Fellowship of the Ring Princess Bride, The

Star Wars Apollo 13 ...

Dans le document Data Mining with (Page 180-185)