Binary Logistic Regression Modeling - Supervised Learning Methods: Prediction

Supervised Learning Methods: Prediction

5.4 Binary Logistic Regression Modeling

Logistic regression is a powerful modeling technique used extensively in data mining applications. It allows more advanced analyses than the chi-square test, which tests for the independence of two categorical variables.

It allows analysts to use binary responses (yes/no, true/false) and both continuous and categorical predictors in modeling. Logistic regression does allow an ordinal variable (e.g., a rank order of the severity of injury from 0 to 4) as the response variable, but only binary logistic regression (BLR) is discussed in this book. BLR allows construction of more complex models than the straight linear models, so interactions among the continuous and categorical predictors can also be explored.

Binary logistic regression uses maximum-likelihood estimation (MLE) after converting the binary response into a logit value (the natural log of the odds of the response occurring or not) and estimates the prob-ability of a given event occurring. MLE relies on large-sample asymptotic property, which means that the reliability of the estimates declines when only a few cases for each observed combination of X variables ar e available. In BLR, changes in the log odds of the response, not changes in the response itself, are modeled. BLR does not assume linearity of the relationship between the predictors, and the residuals do not require normality or homoscedasticity. The success of the BLR could be evaluated

3456_Book.book Page 163 Wednesday, November 20, 2002 11:34 AM

by investigating the classification table, showing correct and incorrect classifications of the binary response. Also, goodness-of-fit tests such as model chi-square are available as indicators of model appropriateness, as is the Wald statistic to test the significance of individual parameters.

The BLR model assumes the following:

Inclusion of all relevant variables in the regression model No multicollinearity among the continuous predictor variable Independent error terms

Predictor variables measured without errors No overdispersion

The statistical theory, methods, and computation aspects of BLR ar e presented in detail elsewhere.^19–21

5.4.1 Terminology and Key Concepts

Probability. Probability is the chance of an occurrence of an event.

The probabilities are frequency of one category divided by the total. Note that they always sum to 1. The odds are the ratio of the probabilities of the binary event. Thus, if there is a 25% chance of rain, then the odds of rain will be 0.25/0.75 = 1/3.

Odds ratio.^22,23 Odds and probability describe how often some-thing happens relative to its opposite happening. Odds can range from zero to plus infinity, with the odds of 1 indicating neutrality, or no difference. Briefly, an odds ratio is the ratio of two odds, and relative risk is the ratio of two probabilities.

The odds ratio is the ratio of two odds and is a comparative measure (effect size) between two levels of a categorical variable or a unit change in the continuous variable. An odds ratio of 1.0 indicates the two variables are statistically independent. The odds ratio of summer to winter means that the odds for summer are the denominator and the odds for winter are the numerator, and the odds ratio describes the change in the odds of rain from summer months to winter months. In this case, the odds ratio is a measure of the strength and direction of the relation-ship between rain and season. If the 95% confi dence interval of the odds ratio includes the value of 1.0, the predictor is not considered significant. Odds ratios can be computed for both categorical and continuous data. Also, odds ratios for negative effects can vary only from 0 to 0.999, while for the case of increase it can vary from 1.001 to infinity. With these measures,

3456_Book.book Page 164 Wednesday, November 20, 2002 11:34 AM

one must be very careful when coding the values for the response and the predictor variables. Reversing the coding for the binary response inverts the interpretation. The interpretation of the odds ratio is valid only when a unit change in the predictor variable is relevant. If a predictor variable is involved in a significant quadratic relationship or is interacting with other predictors, then the interpretation of the odds ratio is not valid.

Logits. Logits are used in the BLR equation to estimate (predict) the log odds that the response equals 1 and contain exactly the same information as odds ratios. They range from minus infinity to plus infinity, but, because they are logarithms, the numbers usually range from –5 to +5, even when dealing with very rare occurrences. Unlike the odds ratio, a logit is symmetrical and therefore can be compared more easily. A positive logit means that, when that independent variable increases, the odds that the dependent variable equals 1 increase. A negative logit means that when the independent variable decreases, the odds that the depen-dent variable equals 1 decrease. The logit can be converted easily into an odds ratio of the response simply by using the exponential function (raising the natural log e to the b1 power). For instance, if the logit b1 = 2.303, then its log odds ratio (the exponential function, e^b¹) is 10, and we may say that when the response variable increases one unit the odds that the response event = 1 increase by a factor of 10, when other variables are controlled.

Percent increase in odds. Once the logit has been transformed back into an odds ratio, it may be expressed as a percent increase in odds. Let the logit coefficient for “current asset/net sales” be 1.52, where the response variable is bankruptcy. The odds ratio, which corresponds to a logit of +1.52, is approximately 4.57 (e^1.52).

Therefore, we can conclude that for each additional unit increase in “current asset/net sales” the odds of bankruptcy increase by 357% — (4.57 – 1) ¥ 100% — while controlling for other inputs in the model. But, saying that the pr obability of bankruptcy increases by 357% is incorrect.

Standardized logit coefficients. Standardized logit coefficients correspond to b (standardized regression) coefficients. These coef-ficients may be used to compare the relative importance of the predictor variables. Odds ratios are preferred for this purpose, however, because when we use standardized logit coefficients we are measuring the relative importance of the predictor in terms of effect on the log odds of the response variable, which is less intuitive than using the actual odds of the response variable, which is measured when odds ratios are used.

3456_Book.book Page 165 Wednesday, November 20, 2002 11:34 AM

5.4.1.1 Testing the Model Fit^24,25

Wald statistic. The Wald statistic is used to test the significance of individual logistic regression coefficients to test the null hypoth-esis that a particular logit (effect) coefficient is zero. It is the ratio of the unstandardized logit coefficient to its standard error.

Log-likelihood ratio tests. Log-likelihood ratio tests are an alter-native to the Wald statistic. If the log-likelihood test statistic is significant, the Wald statistic can be ignored. Log-likelihood tests are also useful in model selection. Models ar e run with and without the variable in question, for instance, and the difference in –2 log likelihood (–2LL) between the two models is assessed by the chi-square statistic, with the degrees of freedom being equal to the difference in the number of parameters between the two models.

Deviance. Because –2LL has approximately a chi-square distribu-tion, –2LL can be used for assessing the significance of logistic regression, analogous to use of the sum of squar ed errors in ordinary least-squares (OLS) regression. The –2LL statistic is the scaled deviance statistic for logistic regression. Deviance measures error associated with the model after all the predictors are included in the model. It thus describes the unexplained variance in the response. Deviance for the null model describes the error associ-ated with the model when only the intercept is included in the model — that is, –2LL for the model that accepts the null hypothesis that all the b coefficients are 0.

5.4.1.2 Assessing the Model Fit

Hosmer and Lemeshow’s goodness of fit test.^26,27 This test divides subjects into deciles based on predicted probabilities and then com-putes a chi-square from observed and expected frequencies. Then a p value is computed from the chi-square distribution with 8 degrees of freedom to test the fit of the logistic model. If the Hosmer and Lemeshow (H–L) goodness of fit test statistic is 0.05 or less, we reject the null hypothesis that no difference exists between the observed and model-predicted values of the response. If the H–L goodness of fit test statistic is greater than 0.05, we fail to reject the null hypothesis that there is no difference, implying that the model’s estimates fit the data at an acceptable level. This does not mean that the model necessarily explains much of the variance in the dependent, only that however much or little it does explain is significant. As with other tests, as the sample size gets larger, the power of the H–L test to detect differences from the null hypothesis improves.

3456_Book.book Page 166 Wednesday, November 20, 2002 11:34 AM

Brier score.²⁷ This is a unitless measure of predictive accuracy computed from the classification table based on a cutpoint prob-ability of 0.5. It ranges from 0 to 1. The smaller the score the better the predictive ability of the model. Brier score is useful in model selection and assessing model validity based on an independent validation dataset.

Adjusted generalized coefficient of determination (R²). This is a model assessment statistic similar to the R² in OLS regression and can reach a maximum value of 1. The statistic is computed using the ratio between the –2LL statistic for the null model and full model adjusted the sample size.^27,28

c statistic and ROC curve. The c statistic and receiver operating characteristic (ROC) curve measures the classification power of the logistic equation. The area under the ROC curve varies from 0.5 (the predictions of the model are no better than chance) to 1.0 (the model always assigns higher probabilities to correct cases than to incorrect cases). The c statistic is the percent of all possible pairs of cases in which the model assigns a higher probability to a correct case than to an incorrect case.²⁹ The receiver operating characteristic (ROC) curve is a graphical display of the predictive accuracy of the logistic curve. The ROC curve is constructed by plotting the sensitivity (measure of accuracy of predicting events) vs. 1-specificity (measure of error in predicting non-events).^29,30 The area under the ROC curve is equal to the c statistic. The ROC curve rises quickly and the area under the ROC is larger for a model with high predictive accuracy (see Figure 5.36 [top]

for an example of a ROC curve). An overlay plot between the percentages of false positives and false negatives vs. the cutpoint probability could reveal the optimum cutpoint probability when both false positives and false negatives could be minimized (see Figure 5.36 [bottom] for an example of an overlay plot of false positives and negatives).

5.4.2 Exploratory Analysis Using Diagnostic Plots

Simple logit plots are very useful in exploring the relationship between a binary response and a single continuous predictor variable in a BLR with a single predictor variable, but these plots are not effective in revealing the complex relationships among the predictor variables or data problems in BLR with many predictors. The partial delta logit plots proposed here, however, are useful in detecting significant predictors, nonlinearity, and multicollinearity. The partial delta logit plot illustrates the effects of a given continuous predictor variable after adjusting for all

3456_Book.book Page 167 Wednesday, November 20, 2002 11:34 AM

other predictor variables on the change in the logit estimate when the variable in question is dropped from the BLR. By overlaying the simple logit and partial delta logit plots, many features of the BLR could be revealed. The mechanics of these two logit plots are described using a two-variable BLR model.

1. Determine a simple logit model for the binary response of the predictor variable X1.

Dans le document Data Mining Using (Page 172-177)