• Aucun résultat trouvé

Common Predictive

ENSEMBLE METHODS

An ensemble model is the combination of two or more models. Ensembles can be combined in several ways. A common and my personal prefer-ence is a voting strategy, where each model votes on the classifi cation of an observation and then in a democratic fashion the classifi cation with the most votes win. Another strategy is to average the predictions from the different methods to make a numeric prediction.

Boosting and bagging are accumulation model techniques that resample the training data to create new models for each sample that is drawn.

Bagging is the most straightforward of ensemble methods. Simply stated, it takes repeated unweighted samples from the data. Each ob-servation can be included in a sample only once (without replacement within a sample), but observations are eligible to be included in all samples (sampling with replacement across samples). With bagging, it

C O M M O N P R E D I C T I V E M O D E L I N G T E C H N I Q U E S 125

is possible that an observation is included in each sample or an obser-vation is included in none. Under normal conditions, neither of these cases is likely. Because the samples are unweighted, no weight is given to misclassifi ed observations from previous samples. To implement a bagging strategy, you decide how many samples to generate and how large each sample should be. There is a trade‐off in these two values. If you take a small percentage of the observations, you will need to take a larger number of samples. I usually take about a 30% sample and take seven samples so that each observation is likely to be selected into a couple of samples. After the samples have been taken and a model fi t, those models are then accumulated to provide the fi nal model. This fi nal model is generally more stable than the single model using all the data. Bagging is particularly effective when the underlying model is a decision tree. I have not found benefi t to bagging regression or neural network models.

Why does bagging work? In the case of the decision tree model, by taking a sample of the data, you will likely fi nd different splits and a different‐looking tree for each sample. None of these tree models is as good as the tree fi t on all the data, but when the weaker models are combined, their predictive power is often better than the single model built on the full data. This is because excluding some observations al-lows weaker relationships to be exploited.

Boosting is very similar to bagging but with the major difference that in boosting, reweighting occurs after each sample model is fi t to

“boost” the performance of the weak model built on the sampled data.

The weights of the misclassifi ed observations are increased so that they are more likely to be selected and therefore be exploited by a subtle relationship in the data. This normally has an improved effect on the overall model when comparing boosting results with bagging results.

The trade‐off is that boosting must be done in a sequential manner be-cause of the reweighting while bagging can be done in parallel, assum-ing you have suffi cient system resources. In my personal experience, I tend to use boosting for classifi cation models and bagging for interval prediction models

Random forest is an ensemble tree technique that builds a large number of trees that are weak classifi ers and then are used to vote in some manner to build a stable and strong classifi er that is better than

126 B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

the average tree created in the forest. This falls under the axiom that the whole is greater than the sum of its parts. One of the fundamental ideas in random forest is that a subset of the observations and a subset of the variables are taken. This sampling in both dimensions ensures that all variables are considered, not just the dominant few as is usu-ally seen in decision trees. What is the right number of variables to consider in each tree? A good rule of thumb is to use is the square root of the number of candidate variables.

There are a number of tuning parameters to be considered in the training of a random forest model. Here are the high‐level dimen-sions and my recommendations for defaults, given my experience in modeling. Specifi c problems might require deviations from these recommendations, but I have found success across many industries and problem types.

Number of trees

Variables to try

Number of splits to consider

Number of observations in each tree

127

S

egmentation is a collection of methods used to divide items, usually people or products, into logical mutually exclusive groups called segments. These segmentation methods include several functions for identifying the best variables to be used in defi ning the groups, the variables to be used in analyzing these groups, assigning observations to the clusters, analyzing/validating these clusters, and profi ling them to see the common characteristics of the groups.

A well‐defi ned segment has the following features:

It is homogeneous with respect to a similarity measure within itself.

It is distinct from other segments with respect to that similarity measure.

Rummikub is a tile-based game for two to four players that was invented by Ephraim Hertzano.1 Hertzano was born in Romania and developed the game when card games were outlawed under

C H A P T E R

6

Segmentation

The data analysis for this chapter was generated using SAS Enterprise Miner software, Version 13.1 of the SAS System for Windows. Copyright © 2013 SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.

1Rummikub won the Spiel des Jahres (Germany’s Game of the Year) in 1980.

128 B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

communist rule. The game consists of 104 numbered tiles plus two jokers. The tiles are divided into four unique color groups (usually red, blue, green, and orange) and numbered 1 to 13 so that there are two tiles for each number color combination. The game begins by each player drawing tiles from the 106 facedown tiles—I do not know the offi cial number you are supposed to draw but in my family we draw 20.

After the tiles are drawn, each player organizes their tiles without revealing them to the other players. Once players have had a chance to organize their tiles, play begins. Dean house rules require the oldest player to go fi rst. During a player’s fi rst turn, they must create valid sets exclusively from their tiles. A set is either a run of at least three consecutive tiles of the same color or at least three unique colored tiles with the same number. For example, tiles 4, 5, and 6 in blue is a valid run and three 11 tiles, one red, blue, and orange, is a valid set, as shown in Figure 6.1 . The offi cial rules require that the initial meld must have a certain value of the summed numbers played. The initial meld threshold was 50 before 1988, then it was dropped to 30; in my

Figure 6.1Example of Valid Sets in Rummikub

S E G M E N T A T I O N 129

family, kids under 5 do not have an initial threshold until they have beaten their father. If a player cannot meet the initial meld threshold, then they draw a tile and the play passes to the next younger player.

After a player has completed the initial meld (hopefully on the fi rst turn), then play continues with players laying down valid sets or runs from their tiles or reorganizing the existing tiles to place their tiles. The objective of the game is to place all of your tiles. The fi rst person to do so is declared the winner. The jokers can be used at any time in place of any numbered tile.

The game of Rummikub is a hands‐on experience in segmenta-tion. The tiles can be part of only one group of tiles (either a set or a run) at any particular time, just as segmentations must create mutu-ally exclusive groups. The strategy for partitioning the tiles is different among different players. Each player in my family has a slightly differ-ent system for organizing their tiles and how they choose to play their titles through the game.

In segmentation for business or Rummikub, there is not a single

“right” answer, and the best segmentation strategy can change through the course of the game just as segmentation strategies in organizations must adapt to changing market conditions to be successful. I often have three segments to my tiles initially. I have tiles lower than 5 for making sets, tiles higher than 9 for making sets, and tiles between 5 and 9 for making runs. I always hope for an additional segment of jokers, which I keep separate until I can use them to place all of my remaining tiles and win the game. I have these three segments because of the challenges of forming runs. In order to create a run with a black 1 tile, I must have the black 2 and black 3 tiles; at the beginning of the game, I have only a 3.7% (4/106 because there are two tiles for each color) chance of getting that combination. In contrast, to get a set of 1 tiles, I have a 5.6% chance (6/106 because there are eight tiles for each number, but a set cannot contain two tiles of the same color so that leaves two tiles for 1 orange, 1 red, and 1 blue). The same proba-bility situation occurs at the upper extreme of the numbers. The middle numbers segment has better odds for runs because there are multiple numbers that can be drawn or used from the existing board to create those runs. In the case of an 8 red, I can use 6 red, 7 red, 9 red, or 10 red to create a run.

130 B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

My youngest son, who has just started playing Rummikub, has a simpler strategy with just two segments, jokers and everything else. He is much stronger at putting runs together than sets so runs dominate his play. (I think this is due to the challenge of not hav-ing the same color in a set, which he sometimes forgets and his older siblings are quick to point out). This highlights another paral-lel between segmentation in the game and business: For most ap-plications, a simple segmentation strategy is better than no strategy at all. If my youngest son were to have just one segment (all tiles played equally), the joker would be played very early in the game and provide him little advantage to the point of being essentially wasted.

The segmentation of the jokers from the numbered tiles points out the difference between clustering and segmentation. The terms are often used interchangeably because they are so closely related, but segmentation is a superset of clustering. Clustering is done by an algorithmic method, and it is mechanical. Segmentation often uses clustering techniques and then applies business rules like the treatment of keeping jokers in a special segment until the end of the game.

Many companies build their own segments but others prefer to buy them to aid primarily in their marketing activities. One commer-cial set of segments is done by PRIZM (which is currently owned by Nielsen). When I was considering moving from northern Virginia to Cary, North Carolina, I spent hours reading the descriptions of the different segments and imagining the people who lived there. PRIZM uses 66 segments to describe the people across America. Here is an ex-ample of three segments to give you an idea of the type of description and characterization given:

19 Home Sweet Home

Widely scattered across the nation’s suburbs, the residents of Home Sweet Home tend to be upper‐middle‐class married couples living in mid‐sized homes with few children. The adults in the

segment, mostly between the ages of 25 and 54, have gone to college and hold

S E G M E N T A T I O N 131

professional and white‐collar jobs. With their upscale incomes and small families, these folks have fashioned comfortable lifestyles, fi lling their homes with toys, TV sets and pets.

07 Money & Brains

The residents of Money & Brains seem to have it all: high incomes, advanced degrees and sophisticated tastes to match their credentials. Many of these city dwellers, predominantly white with a high concentration of Asian Americans, are married couples with few children who live in fashionable homes on small, manicured lots.

51 Shotguns & Pickups

The segment known as Shotguns & Pickups came by its moniker honestly: It scores near the top of all lifestyles for owning hunting rifl es and pickup trucks. These Americans tend to be young, working‐class couples with large families—more than half have two or more kids—living in small homes and manufactured housing. Nearly a third of residents live in mobile homes, more than anywhere else in the nation.

These segments are designed to be informative yet concise and allow organizations to match the goods or services with those people most likely to respond positively. For certain specialized applications, each segment may have additional constraints that would need to be satisfi ed as well.

Since each segment is homogeneous, which means lower variance within the individual segments, it is possible to get better predictive models for each segment than you could achieve by building a predic-tive model for the entire population. Plainly stated, segmentation will often improve your ability to make correct predictions or classifi ca-tions during predictive modeling activities, assuming each segment is large enough to build quality predictive models.

132 B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

Feature fi lms are another example of segmentation. Movies are classifi ed into categories for many reasons. Perhaps the most impor-tant is that if they were not, the list of fi lm choices would overwhelm most people and be too diffi cult to navigate effectively. We segment movies by their genre—horror, sci‐fi , action, romantic comedy, and so on. Movies are also segmented by their rating and several other cat-egories. I doubt there is a movie that appeals to the entire population, so production studios and sponsors want to understand the specifi c demographic for which this fi lm is intended. Fast food chains would never want to distribute action fi gures from an R‐rated horror fi lm with their children’s menu. Likewise, PG‐rated animation fi lms of a classic Grimm fairy tale would not be expected to be popular in the 21‐ to 25‐year‐old male demographic.