• Aucun résultat trouvé

Time Constraints tab

Dans le document Data Mining Using (Page 98-101)

The Time Constraints tab is designed to specify the length of the time sequence in the analysis. The Time Constraints tab is based on sequence discovery. If you are performing association analysis, then the Time Constraints tab is grayed-out and unavailable for viewing.

2.4 Associution Node 83 Transaction Window Length: This option defines the time interval, that is. the time window

length, of the time sequence. Select the Specify duration to use radio button to specify the time window o f the desired numeric time range of the time sequence. The default is the Maximum duration, where the node automatically determines the maximum time range based on the values o f the sequence variable. The node is designed so that any difference betm.een successive observations in the sequence variable that is less than or equal to the value entered in the Transaction Window Length entry field is considered the same time and the same transaction.

Conversely, any time difference greater than the window length value does not constitute as a sequence and is, therefore, removed from the analysis.

0 Consolidate time differences: This option is designed to collapse sequences that occur at

different times. For example, consider a customer arriving at a restaurant in four separate visits. At the beginning, the customer goes to the restaurant to buy breakfast in the morning, then the same customer returns in the afternoon to purchase lunch, and, finally, returns in the evening to buy dinner. This same customer returns two days later to buy soda and ice cream. I n this example, the sequence variable is defined as visits with four separate entries. Therefore, this option is designed for you to perform sequence discovery on a daily basis by consolidating the multiple visits into a single visit. If you want to perform sequence discovery on a daily basis, then you would need to enter 24 (hours) in the Consolidate time differences entry field. Again, this option would instruct the node to consolidate these multiple visits on the same day into a single visit.

Sort tab

The Sort tab is designed to specify the primary and secondary sort keys for the id Lariable in the analysis. The output data set hill be sorted by the primary sort key within each corresponding secondary key.

The tab displays both an Available list box and a Selected list box. The variables listed in the Available list box are all the variables in the active training data set that have been assigned a \.ariable role of id with a variable status of use. Simply select the variables listed in the Available list box to the Selected list box i n order to assign primary and secondary sort keys to the data mining analysis by clicking the right arrow control button to move the selected variables from one list box to the other. In first step, you must select the id variable that represents the primary sort key for the data set. The primary sort key hill be the first id variable displayed i n the Selected list box. All other id variables listed in the Selected list box are assigned as the secondary sort keys. The arrow button located at the bottom o f t h e Selected list box will allou you to change the sort order of the selected variables one at a time. You must set at least one of the id variables to a variable status of use if you h a \ e defined several id variables in the Input Data Source node. By default. the node sets all nonselected id variables to variable attribute status of don’t use.

Output tab

The Output tab is designed to browse the output scored data set from the PROC ASSOC, SEQUENCE. and RL'LEGEN data mining procedure output listing. Which procedures are executed depends on the type of anal) sis that is specified, that is, association discovery or sequence discovery. The PROC ASSOC data mining procedure is applied in both types of analysis. However, association discovery applies the PROC RULEGEN data mining procedure in creating the rules, whereas sequence discovery uses the PROC SEQUENCE data mining procedure in creating the rules within each sequence.

The PROC ASSOC procedure determines the various items that are related to each other. In other words, the PROC RULEGEN procedure generates the association rules. The PROC SEQUENCE procedure uses the time stamp \.ariable to construct the sequence rules. The output is saved from these procedures as SAS data sets after you execute the node. The data sets will allow you to observe the various evaluation criteria statistics from the various if-then rules that are listed.

Select the association Properties..

.

button to vie\$, the Information tab that displays the administrative intorination about the data set, such as the name, type, created date, date last modified, coluinns. rows, and deleted rows. Alternatively, select the Table View tab to browse the output data set that is created from the PROC ASSOC procedure by running the node. The output data set displays a table view of the frequency count b) all the possible n-way associations or interactions of the items or events. The scored data set will display separate columns called ITEM], ITEM2, and so on, based on the number of associations that are specified from the Maximum number of items in an association option within the General tab. The frequency listing to association discovery is displayed in the following diagram.

Select the rules Properties

...

button to display the file administrative information and a table view of the rules output data set. The options are similar to the previous association options. Froin the tab, you may view the file information of the rules data set. However, the file layout of the rules data set is a bit more detailed. Select the Table \'iew tab to view the rules data set that contains the evaluation criterion statistics such as the frequencq count, support, confidence, and lift values. based on the various rules that were specified froin the General tab.

The table listing is sorted i n descending order by the support probability between each item or event in the association analysis. Froin the following listing, focusing on the combination of Heineken beer and crackers then purchased, of the I .OO 1 customers purchasing i t e m approximately 36.56% of these customers purchased both Heineken beer and crackers, customers purchasing Heineken beer resulted i n 61 .OO% of these same customers then purchasing crackers, u.ith a lift value of 1.25 indicates the reliabilit). of the association and that the rule is better at predicting the association betmeen both items as opposed to randomly guessing the

association of both items purchased. The table listing will provide you with the list of all combinations of iteins

\i ith a lom support and a high confidence probability that sliould be interpreted Lvith caution. Although it is not s h o ~ n in the follouiig table listing, there \vere four separate combinations of items where customers first

2.4 Aysociution Node 85 purchased Bordeaux, then oli\ es, herring, crackers or Heineken beer, u i t h a lou support ranging betv,een 3.2 and 4.3 and a high confidence probability ranging beti\een 43.25 and 59.36.

Conversely, selecting the sequence rule will display the procedure output listing from the PROC SEQUENCE:

data mining procedure. From the following diagram, the same combination of Heineken beer and crackers are the most popular items purchased from the sequence analysis results. Notice that the support probability is sinaller in the following table listing from sequence analysis among customers purchasing crackers and Heinekeii beer since the order in which these items are purchased are taken into account.

Dans le document Data Mining Using (Page 98-101)