4 Process Simulation - Challenges in Computational Statistics and Data Mining

The general model described in Sect.3does not allow, due to its complexity, to infer about important characteristics of the inspection process. A large computer simula-tion program that can be used for the Monte Carlo evaluasimula-tion of many characteristics which are important to practitioners has been developed. In the current version the program implements the model described in Sects.3and4for only four predictors.

Moreover, this implementation is somewhat restricted as some of its important fea-tures are typical for the case when reliability characteristics of produced items have to be predicted in the inspection process.

The simulation program consists of three modules. First module generates a stream of data points, i.e. the values of predictors, the value of the unobserved output variable, and the class associated with this value. The second module builds classification (prediction) algorithms, and computes the predicted class of the output variable using the values of predictors as its input. The third module simulates the process of inspection, i.e. building and operating a control chart.

In the first module of the simulation program the probability distributions of pre-dictors defined by a user on the first level of the model can be chosen from a set of five distributions: uniform, normal, exponential, Weibull, and log-normal. Each of these distributions is represented either as a two-parameter parameter single distrib-ution or as a mixture of two such distribdistrib-utions. This second option enables a user to model more complex distributions such as, e.g., bimodal ones. For the second level of the model a user can choose the probability distributions of the hidden variables from a set of distributions which are typically used for the description of reliability data (i.e., defined on the positive part of the real line): exponential, Weibull, and log-normal. The dependence between the pairs of predictors, and between predictors and associated hidden variables can be described by the following copulas: indepen-dent, normal, Clayton, Gumbel, Frank, and Fairlie-Gumbel-Morgenstern (FGM).

The detailed description of these copulas can be found, e.g., in [7]. The strength of this dependence is defined by the value of Kendall’s coefficient of associationτ. The expected values of the distributions of the hidden variables in this simulation model depend in a linear way on the expected values of its related predictors. Finally, on the third level, hidden random variables are transformed to the final random variable T. The relation between the hidden variables andT is strongly non-linear, and is described by operators of a “min-max” type.

Several types of classifiers (several linear regression models, two Linear Discrim-inant Analysis procedures, and one decision tree classifier) have been implemented in the second module of the simulation program. The classifiers are built using sam-ples of size nt of training data consisted of the vectors of the values of predictors (x1,x2,x3,x4), and the actual value of the assigned class. In this paper we con-sider only three of them which represent three different general approaches to the classification problem.

The first considered classifier is a simple binary linear regression. We label the considered classes by 0 and 1, respectively, and consider these labels as real numbers,

treating them as observations of a real dependent variable in the linear regression model of the following form:

R=w1∗X1+w2∗X2+w3∗X3+w4∗X4+w0, (7) where R is the predicted class of an item described by explanatory variables X1,X2,X3,X4, andw1,w2,w3,w4,w0are respective coefficients of regression equa-tion estimated from a training set ofntelements. The value ofRestimated from (7) is a real number, so an additional requirement is needed for the final classification (e.g. ifR <0,5 an item is classified as belonging to the class 0, and to the class 1 otherwise). The only advantage of this naive method is its simplicity. In the prob-lem of binary classification the logistic regression is definitely a better choice, but the classical linear regression is implemented in all spreadsheets, such as, e.g., MS Excel. For this reason we have chosen this classifier as the easiest to implement in practice.

The next classifier implements the algorithm of the Linear Discriminant Analy-sis (LDA) introduced by Fisher, and described in many textbooks on multivariate statistical analysis and data mining (see, e.g. [3]). In this method statistical data are projected on a certain hyperplane estimated from the training data. New data points which are closer to the mean value of the projected on this hyperplane training data representing the class 0 than to the mean value of training data representing the remaining class 1 are classified to the class 0. Otherwise, they are classified to the class 1. The equation of the hyperplane is given by the following formula:

L=y1∗X1+y2∗X2+y3∗X3+y4∗X4+y0, (8) whereLis the value of the transformed data point calculated using the values of the explanatory variablesX1,X2,X3,X4, andy1,y2,y3,y4,y0are respective coefficients of the LDA equation estimated from a training set ofnt elements. IfZLdenote the decision point, a new item is classified to the class 0 ifL ≤ ZL, and to the class 1 otherwise. In our simulation we implemented the classical method of the calculation ofZL, and this point is just the average of the mean values of the transformed data points from the training set that belonged to the class 0 and the class 1, respectively.

The calculation of the LDA equation (8) is not so simple. However, it can be done using basic versions of many statistical packages such as SPSS, STATISTICA, etc.

Therefore, we have implemented this classifier in order to represent methods available in basic versions of professional statistical packages of general purpose.

The third considered classifier is based on the implementation of the one of the most popular data mining classification algorithms, namely the classification decision tree (CDT) algorithm C4.5 introduced by [10], and described in many textbooks on data mining, such as, e.g., [11]. In our simulations we used its version (known as J48) implemented in the WEKA software, available from the University of Waikato, Hamilton, New Zealand, under the GNU license. The decision tree is constructed using “IF..THEN..ELSE” rules, deducted from the training data. In this paper for the description of the CDT classifier we use the notation of the MS Excel function

Process Inspection by Attributes Using Predicted Data 121 IF(lt,t,f), whereltis a logical condition (e.g.C<50),tis the action whenlt=true, and f is the action whenlt = false. The actions t and f can be defined as the combinations of otherIF functions, or—finally—as the assignments of classes to the considered items.

The third module of the program is dedicated to the simulation of a production process. The simulated process is described by two mathematical models. First model describes the process in the so called Phase I when it is considered as being under control. Second model represents a process that has been deteriorated, and is consid-ered as not being under control. Each simulation run begins with the simulation of a sample ofnditems which are used for the estimation of the parameters of the inspec-tion procedure (Shewhart control charts or MAV control charts). Separate control charts are designed for actual and predicted (using different classifiers) data. Next, n−1 items (we call them historical data) are simulated using a first model in order to represent the process before its possible transition to the possible out of control state. This additional sample is needed if we want to simulate the MAV control chart.

After the historical data have been generated we start to simulate consecutive items using either the first model (for the analysis of process’ behavior when it is under control) or the second model (for the analysis of process’ behavior when it is out of control). The first item generated in this way forms, together with the historical data, the first sample whose fraction of nonconforming items (actual and predicted using different classifiers) is compared with the control lines of the respective con-trol charts. If the value of one (or more) of these fractions falls beyond the concon-trol limit of the respective control chart an alarm signal is generated. Consecutive items are generated until the moment when the alarm signals have been observed on all considered control charts. Note, that for the MAV chart decisions are taken after the generation of each consecutive item. In contrast to the MAV chart, for the Shewhart chart decisions are taken after the generation of consecutive segments ofnitems.

In order to evaluate statistical properties of the simulated inspection processes we have to repeat the simulation processNR times. In the context of 100 % inspection the most important characteristic of the inspection process is itsTime to Signal(TS).

By the Time to Signal we understand the number of inspected items between the moment of process deterioration and the alarm signal. The data obtained fromNR

simulation runs let us to analyze the probability distributions ofTS, and to compute the estimated values of important characteristics such as its average value (ATS), median, standard deviation, and skewness.

The simulation program described above can be run using different settings of parameters. In the paper by [5] only one particular setting was used. We have decided to continue this work using the same setting of parameters. This allows us not to repeat some interesting findings that have been already described in [5]. In the model used in the simulation experiments described in this paper the random predictor X1 is distributed according to the normal distribution,X2has the exponential distribution, X3 is distributed according to the log-normal distribution, andX4has the Weibull distribution. The dependence betweenX1 andX2has been described by the Clay-ton copula with τ = 0,8. The joint distribution ofX2 andX3 is described by the normal copula withτ = −0,8 (Notice that this is bivariate “normal” distribution,

but with non-normal marginals!), and the joint distribution ofX3 andX4has been described by the Frank copula withτ =0,8. The hidden variableHX1is described by the log-normal distribution, and its joint probability distribution withX1has been described by the normal copula withτ = −0,8. The joint distribution ofHX2and X2has been described by the Frank copula withτ =0,9, and the marginal distri-bution ofHX2is assumed to be the exponential. The joint model ofHX3andX3is similar, but the copula describing the dependence in this case is the Gumbel cop-ula. Finally, the hidden variableHDis described by the Weibull distribution, and its joint probability distribution withDhas been described by the Clayton copula with τ = −0,8. The random variableT that describes the life time has been defined as T =min[max(HX1,HX2),min(HX3,HX4)]. The parameters of the aforementioned distributions have been found experimentally in such a way, that the items belong-ing to class 1 have the values ofT smaller than 5, and to class 0 otherwise. More-over, the relation between the predictorsX1,X2,X3,X4and their hidden counterparts HX1,HX2,HX3,HX4is such that a shift in the expected value of each observed vari-able, measured in terms of its standard deviation, results with the similar shift of the expected value of its hidden counterpart, measured in terms of its own standard deviation.

Theoretically, classifiers can be built using training data for each production run.

However, the collection of training data is usually very costly (it requires, e.g., mak-ing long-lastmak-ing destructive tests), and the same classifier can be used for several production runs. In our experiment we have decided to use ten sets of classifiers.

Each of these sets consisted of classifiers designed using the same training data set ofnt=100 elements. Note that in the data mining community this size of the training data is considered as small, and usually insufficient. However, in production reality this is often the upper limit for the number of elements which can be used for the experimental building of the prediction model.

The coefficients of the regression equation (7) are presented for the considered ten data sets in Table1. Those coefficients that have been indicated by the regression statistical tool as statistically non-significant have been printed in this Table initalics.

However, we have to remember that in the calculation of the significance of regression coefficient it is assumed that observed data are distributed according to the normal probability distribution. In our case it is obviously not true, so in our classification experiment we have used full regression equations.

Just a first look at Table1reveals that the estimated regression equation (7) may be completely different, depending on the chosen training data set. However, some general pattern is visible: only explanatory variableX2(represented by the coefficient w2) is significant for all regression lines. On the other hand, the explanatory variable X3(represented by the coefficientw3) seems to be of no practical importance in the classification process.

A similar comparison of the decision model parameters in the LDA case is pre-sented in Table2. In this case we cannot say about statistical significance of the parameters of the decision rule. However, the general impression is the same as in the case of the regression algorithm. The particular models look completely different depending on the training data set. However, in all the cases the explanatory variable

Process Inspection by Attributes Using Predicted Data 123

Table 1 Regression model—different sets of training data

Dataset w₁ w₂ w₃ w₄ w₅

Set 1 −0.133 −0.034 −0.0001 −0.026 3.036

Set 2 0.010 −0.039 0.0001 −0.033 2.321

Set 3 0.144 −0.033 <0.0001 0.0002 1.359

Set 4 −0.194 −0.034 −0.0005 −0.019 3.396

Set 5 −0.346 −0.021 −0.0006 0.002 3.763

Set 6 0.073 −0.047 <0.0001 −0.023 2.058

Set 7 −0.142 −0.040 −0.0004 −0.0008 3.019

Set 8 0.109 −0.038 <0.0001 −0.001 1.611

Set 9 −0.346 −0.039 0.0002 −0.034 4.054

Set 10 −0.002 −0.018 −0.0005 0.042 1.745

Table 2 Linear discrimination analysis—different sets of training data

Dataset y₁ y₂ y₃ y₄ y₀ Midpoint

Set 1 0.687 0.174 0.001 0.133 −6.338 0.628

Set 2 −0.045 0.178 −0.001 0.151 −2.710 −0.014

Set 3 −0.646 0.148 <0.0005 −0.001 1.663 0.464

Set 4 0.880 0.152 0.002 0.087 −7.499 0.585

Set 5 1.500 0.091 0.003 −0.008 −9.121 0.254

Set 6 −0.342 0.219 <0.0005 0.107 −1.399 0.706

Set 7 0.703 0.196 0.002 0.037 −6.044 0.784

Set 8 −0.501 0.173 <0.0005 0.006 0.636 0.629

Set 9 1.458 0.127 0.001 0.143 −10.048 0.344

Set 10 0.008 0.087 0.002 −0.206 0.272 0.771

X3seems to have no effect (very low values of the coefficient describing this variable) on the classification.

Finally, let us considered different decision rules estimated for the CDT algorithm.

Because of a completely different structure of decision rules presented in Table3we cannot compare directly these rules with the rules described by Eqs. (7)–(8). They also look completely different for different training data sets, but in nearly all cases (except for the Set 9) decision are predominantly (and in two cases exclusively) based on the value of the explanatory variableX3. One can notice that the weights assigned to the explanatory variables in the CDT algorithm are nearly exactly opposite to the weights assigned in the classification models (7)–(8). In order to explain this shocking difference one should take into account that both the linear regression and the LDA procedures are based on the assumption of the linear dependence and normality. Hryniewicz [5] shows the example of training data from which it is clear that in the considered model the dependence between the random variableT and the

Table 3 Decision trees—different sets of training data Dataset Decision rule

Set 1 IF(X₃<=70,0181;IF(X₃<=56,1124;1;IF(X₃<=63,2962;2;1));IF(X₄<=

16,4381;2;IF(X₁<=4,3509;2;1)))

Set 2 IF(X₃<=56,4865;1;IF(X₄<=17,3301;2;IF(X₁<=4,0217;2;1))) Set 3 IF(X₃<=73,6148;IF(X₃<=57,1355;1;IF(X₁<=5,0876;2;IF(X₄<=

4,497;2;1)));IF(X₄<=17,3499;2;1)) Set 4 IF(X₃<=70,2191;1;IF(X₄<=15,9098;2;1))

Set 5 IF(X₃<=73,1584;1;(IF(X₄<=17,0516;(IF(X₃<=87,8921;(IF(X₄<=

5,0679;2;1));2));1)))

Set 6 IF(X₃<=60,3912;1;IF(X₄<=16,3504;2;1)) Set 7 IF(X₃<=71,8184;1;2)

Set 8 IF(X₃<=71,4456;1;(IF(X₃<=983,0929;2;(IF(X₄<=18,8213;2;1))))) Set 9 IF(X₂<=14,7339;(IF(X₄<=16,7482;(IF(X₄<=4,527;1;2));1));1) Set 10 IF(X₃<=60,5044;1;2)

explanatory variablesX3andX4is definitely not linear, but also non-monotonic. This dependence cannot be captured by the measures of linear correlation in the linear models (7)–(8). Moreover, this example shows that the explanatory potential of these two variables is seemingly much greater than the potential of the variablesX1and X2. We will discuss this problem in the next section of this paper.

In the experiment we used a different coding of the observed classes ((2,1) instead of (0,1)), and this feature is reflected in the values of coefficients presented Tables1 and2, and the description of rules presented in Table3. However, this change does not influence the properties of the considered classifiers.

In our simulation experiment we have simulatedNRruns of the inspected process.

Each of the ten sets of classifiers was used a random number of times NR,j,j = 1, . . . ,10 having the same expectationNR/10. The joint probability distribution of NR,1, . . . ,NR,10was multinomial with parameterNR, and with all probabilities equal to 1/10.

5 Simulation Experiments—Selected Results

Dans le document Challenges in Computational Statistics and Data Mining (Page 128-133)