• Aucun résultat trouvé

SELECTING DRIVER VARIABLES THAT PROVIDE EXPLANATORY INFORMATION

Dans le document Big Data, Mining, and Analytics (Page 129-132)

Data Management and the Model Creation Process of Structured

SELECTING DRIVER VARIABLES THAT PROVIDE EXPLANATORY INFORMATION

Identifying driver or independent data in the mining process involves the selection of variables that can possibly impact a given target or performance metric. In other words, the analyst must identify those variables that impact the movement or variance of a target/performance metric. The era of big data has added great value and complexity to this process. As we mentioned previously, big data is not just volume of data, but also the emergence of new data variables that describe/underpin processes. The value comes in the idea that decision makers can achieve a greater understanding as to what drives metric variances given the introduction of new data variables.

Consider the process of determining car insurance premiums. Traditional variables that usually are incorporated in this include driver demographic information, driving history, geographic location of driver activity, and automobile type, to name a few. The big data era with GPS data sources can perhaps augment this process given available information regarding acci-dent rates at particular geographic locations. Insurance providers could potentially better estimate the risk of drivers who may routinely travel in such areas. Simply put, the existence of new driver variables from the big data era can provide greater explanatory information for decision mak-ers. Chapter 6 illustrates the evolution of modeling consumer activities

TABLE 5.3

Time Elements of Train Ridership

Date Day Origin/Destination Train Riders

03/15/13 Friday Summit/Penn Peak 7:00 a.m. 8,800

03/18/13 Monday Summit/Penn Peak 7:00 a.m. 11,350

03/19/13 Tuesday Summit/Penn Peak 7:00 a.m. 13,210 03/20/13 Wednesday Summit/Penn Peak 7:00 a.m. 13,000 03/21/13 Thursday Summit/Penn Peak 7:00 a.m. 13,350

03/22/13 Friday Summit/Penn Peak 7:00 a.m. 9,100

with the introduction of such new data sources including weather and web-related searches on relevant topics to add explanatory power to the analysis. However, with the emergence of new data variables comes the complexity of identifying these new data sources and merging them with existing data resources. This last concept requires thorough consideration of the data formatting section just described.

When selecting driver variables a number of issues must be investigated where again, the era of big data adds complexity to the process. A major technical modeling issue that arises when selecting and adding driver vari-ables to an analysis is the existence of strong relationships between driver variables and also the potential of selecting a driver that is a proxy variable of your performance metric. In mining applications, strong relationships between driver variables result in collinearity, which renders the identifica-tion of a particular driver variable’s impact on a performance metric unsta-ble and of little value to the final analysis. The second point, which involves the selection of a proxy variable, is equally as detrimental to the analysis, where the resulting model incorporating a proxy would simply depict a near-perfect statistic on explained variance of the performance metric (e.g., R2 of 0.99), but no real descriptive value to the decision maker as to the reasoning for the movement in the performance metric. Table 5.4 illustrates this.

A model that seeks to analyze the potential drivers of ATM fees that would incorporate the number of out-of-network transactions (number of transactions) as a potential driver has a major flaw. The performance metric of revenue/fees is estimated as a dollar amount charged per out-of-network transactions. It is therefore a direct function of the number of transactions variable. The model could be useful in determining the impact of geographic locations of fees generated by an ATM (e.g., zip code)

TABLE 5.4

Proxy Variables in ATM Usage

Zip Code Location Number of

Transactions Revenue/Fees

06550 Drive-thru 2100 $4,200

06450 Mall 3400 $6,800

06771 Retail outlet 6700 $13,400

06466 In bank 1000 $2,000

06470 Gas station 1200 $2,400

06450 Mall 2400 $4,800

06771 Grocery store 850 $1,700

and a more detailed location variable (e.g., mall); however, the variable number of transactions will negate the impacts of these former variables given its proxy status for fees.

Analysts can perform a quick check to help safeguard against including variables in an analysis that are too closely related. Most mining software applications enable analysts to conduct a correlation analysis between those, where variables with extremely high correlation coefficients should be scru-tinized more closely as potential causes for instable model results.

In the driver variable selection process, the inclusion of process experts is imperative once again. Analysts need to correspond with subject matter experts when seeking to identify the most relevant drivers that affect or impact the performance metrics. These experts not only possess inherent insights as to the scope of variables that come into play that describe tar-gets, but also may be aware of where many of the data variables exist. In fact, the deliberation between the analyst and process expert may result in the creation of new variables.

The healthcare industry is increasing its emphasis on better under-standing what drives patient satisfaction. With adequate data resources, data mining methods could easily be incorporated to provide explanatory information on this topic. The term adequate is the key here. It is only when process experts and analysts deliberate over the inputs to a model does ade-quacy become defined. Robust models that depict the factors that impact a patient’s satisfaction require the input of subject matter experts in a health-care facility who know the subtleties of the processes that affect the patient experience. Deliberations between analysts, healthcare service experts, and data professionals are essential to identifying available/relevant data that comprises the model.

• Decide over what time frame you wish to analyze data (over the past month, quarter, year).

• The area of the hospital to be examined (ER, ICU, LND, cardio, etc.).

• Query data repositories for variables that address the question at hand, such as

• Attending physician

• Diagnostic category of patient

• Staffing of nurses (particular attending nurses and amount staffed)

• Length of stay for the patient

• Room style

Often when defining the important variables for an analysis, reality illus-trates a lack of important data resources. At this point, if a key driver is not available, it needs to be generated.

Dans le document Big Data, Mining, and Analytics (Page 129-132)