Using the Data Mining Wizard - Data Mining with

The Data Mining Wizard creates two objects for you: the mining structure that describes the columns and training data you will use for mining, and a mining

model, which takes those columns, applies an algorithm, and defines the usage of each column for that algorithm. The wizard wraps the creation of these two objects into one simple set of steps.

The steps of the wizard are to select your algorithm, select the source tables and specify how they are used, select the columns from those tables and spec-ify how they are used, name the model, and you’re done. At that point, you can process and analyze the results of your model without further adieu. Analysis Services makes it that simple to get started. The wizard also allows you to cre-ate models from multidimensional, that is, OLAP, sources. This topic is covered in Chapter 11, so we will focus only on relational sources for the time being.

Using the wizard is simple because it performs several steps automatically, based on the input you provide. As a data miner, it is important that you understand these steps and how and when decisions that impact your model are made.

On the first page in the wizard, you choose whether you are creating a model from a relational or multidimensional source, as shown in Figure 3.10. Although in the end a model created from one source appears identical to those created from another, the creation process is slightly different, so there are different wiz-ard paths for each option. Also, a particular mining algorithm may not support creating models from OLAP sources, so this question is asked first.

The next page asks you which algorithm to use to create your initial mining model. The list of algorithms is determined by the capabilities of your target server and may contain more or less than the list of algorithms covered in this book. The reasons for and process by which this occurs are described in Chap-ter 13. If you cannot connect to a server at the time the wizard is run, you get the default list of algorithms provided with SQL Server Data Mining, as shown in Figure 3.11. Choosing which algorithm you are going to use is dependant on the business problem you are trying to solve. The application of each algo-rithm is described in its respective chapter.

On the next two pages, you indicate the data you will be mining. You choose the DSV containing the tables, then you specify the actual tables themselves.

When choosing the tables, you have to specify whether each table is the case table or if it is a nested table, as shown in Figure 3.12. As described in Chapter 2, the case table is the case that contains the entities you want to analyze, and a nested table contains additional, usually transactional, information about each case.

T I P Sometimes determining which table is the casetable can be a bit confusing. For example, if you want to analyze how products are purchased together, you may naïvely choose products as the case table. However, you are actually analyzing the groups of those products that were purchased by a single customer. In this case, the customerbecomes the case with the transaction table containing the product purchases as a nested table.

When you have only a transaction table, the table can be used as both the case table and the nested table by specifying the transaction ID as the case-level key and the other columns as columns in the nested table.

Figure 3.10 Select Method screen of the Data Mining Wizard

Figure 3.11 Select Data Mining Technique page of the Data Mining Wizard

Figure 3.12 Specifying table types in the Data Mining Wizard

On the next two pages, you indicate which columns you are using plus how you want the mining algorithms to interpret each one. First, you specify which columns are used in the model, plus whether they are key, input, and/or pre-dictable. Then you specify the data and content types for each of the columns.

You must specify a key for the case table and each nested table in your model, as shown in Figure 3.13. Remember that the key of a nested table in DMX is not the foreign key that relates the nested table to the case table, rather it is the key in the context that you have it nested. The wizard enforces this relationship by not presenting the foreign key as a choice and warning the user if a key is not specified. For example, a nested table representing a customer’s shopping cart comes from a table that may have a row ID as a key, plus trans-action ID, product name, quantity, and price. The nested table in our model would only have the product, quantity, and price columns, because the row ID isn’t of interest in our model, and the transaction ID is the foreign key to the case table. In this reduced context, you can see that the quantity and price relate to the product, which becomes the key of the nested table. Sequence clus-tering and time series models have special rules regarding the specification of keys. See Chapters 8 and 10, respectively, for specific details.

T I P One thing to consider when determining the correct column to be a nested key is that data mining finds patterns by examining similarities and differences between cases. If you chose a column as a nested key such that the values in that column would only show up in a single case, the data mining algorithms would find no patterns relative to that column. This logic summarily dismisses the use of transaction IDs or row IDs as nested keys.

Figure 3.13 Indicating column usages in the Data Mining Wizard, showing the specification of nested keys

Which columns you specify as input and which as predictable depends on your business problem, the hypothesis you are trying to test, and the algo-rithm you chose. In general, specifying a column as Input indicates that the algorithm will use that column to determine the columns marked as Pre-dictable, or an output. The exact way that each algorithm uses this information varies somewhat, so you should familiarize yourself with the specific seman-tics detailed in each algorithm chapter. One fact that remains constant among all algorithms is that if you want to be able to select a column from the model in a PREDICTION JOINstatement, the column must be predictable. To predict a nested table, check the box in the Predict column next to its key.

T I P If you have many columns in your table, it can be difficult to know which to choose as inputs. You can always use all the columns, but this involves additional processing power and, depending on the algorithm, may make your model difficult to interpret.

The Suggest button on the Specify Column Usage page of the wizard performs a quick entropy-based analysis to indicate which columns are likely to provide information toward a selected output, thereby reducing the number of columns in your final model. Note that this feature only considers case-level columns in its analysis and is not a guarantee that the selected columns will impact or that the nonselected columns will not impact your target variable.

Next, you are presented with the list of columns you have chosen and their respective data and content types, as shown in Figure 3.14. Indicating the cor-rect content type is crucial to the performance and accuracy of your model. If you had a field such as Income marked as DISCRETE, for instance, the algo-rithm would assume that each possible income value was a distinct category and would likely spend extra processing power to learn absolutely nothing.

On the flip side, if you had a categorical column where the categories were indicated by integers (for example, 1–Blue, 2–Yellow, 3–Red, 4–Green, and so on) marked as CONTINUOUS, the algorithm would assume that it could com-pare them and measure distances between points, in this case creating the bizarre logic that Green(4) – Red(3) = Blue(1)! Luckily, the Data Mining Wizard has the ability to automatically detect whether a numeric column is categorical (discrete) or continuous. Clicking the Detect button on this page causes the wizard to sample and analyze the source data and choose an appropriate content type. If a continuous type is determined and your selected algorithm does not support continuous columns, the content type will be specified as DISCRETIZED. You can set discretization parameters in the designer, as spec-ified in the next section. Before moving on with the wizard, you should verify that the content types were assigned correctly and modify any that were not.

The final page of the wizard, shown in Figure 3.15, allows you to specify the names of the structure and model and enable the drill-through feature if it is supported by the algorithm. When completed, the wizard creates a mining structure containing a mining model and launches the Data Mining Designer.

Figure 3.14 Specifying content and data types in the Data Mining Wizard

Figure 3.15 Naming objects in the Data Mining Wizard

Dans le document Data Mining with (Page 124-130)