Rule Extraction by Gradient Descent - Advanced Information and Knowledge Processing

methods.

Methodology Rule Complexity Type of decision boundary accuracy [115]

Modiﬁed RX algorithm 97.33% 3.4 Hyper-plane based on MLP [145]

Inputs are transformed into 97.33% 2.2 Hyper-rectangular discrete ones artiﬁcially

based on IMLP [28]

Based on RBF [214] 80% 3.4 Hyper-rectangular Based on RBF [212] 100% 32.2 Hyper-rectangular Our algorithm combining

GA and RBF 97.33% 2.6 Hyper-rectangular

a compact RBF classiﬁer in order to explain and represent the concept of data in a concise way. First, a compact RBF network is obtained by allowing for large overlaps among the clusters belonging to the same class. Next, the weights between the hidden layer and the output layer are simpliﬁed. Then, the interval for each input in the condition part of a rule is determined by a GA. Experimental results show that our rule extraction technique is simple to implement, and concise rules with high accuracy are obtained. In addition, rules extracted by our algorithm have hyper-rectangular decision boundaries, which are desirable due to their explicit perceptibility.

7.5 Rule Extraction by Gradient Descent

7.5.1 The Method

The objective of tuning the rule premises is to determine the boundaries of rules so that a high rule accuracy is obtained for the test data set. In this section, we describe an algorithm to extract rules from trained RBF neural networks using the gradient descent method, which we proposed earlier [105].

Before starting the tuning process, all of the premises of the rules must be initialized. Assume that the number of attributes is n. The number of rules equals the number of hidden neurons in the trained RBF network. The number of the premises of the rules equals n. The upper limit U_ji and the lower limitL_ji of thejth premise in theith rule are initialized according to the trained RBF classiﬁer as:

U_ji⁽⁰⁾=µ_ji+σ_i, (7.12) L⁽⁰⁾_ji =µ_ji−σ_i, (7.13)

176 7 Rule Extraction from RBF Neural Networks

where µ_ji is thejth coordinate of the center of the ith kernel function.σ_i is the width of theith kernel function.

We introduce the following notation. Suppose that η^(t) is the tuning rate at timet. Initiallyη⁽⁰⁾= 1/N_I, whereN_I is the number of iteration steps for adjusting a premise.N_I is set to be 20 in our experiments, i.e., the smallest changing scale in one tuning step is 0.05, which is determined empirically.E is the rule error rate. Denote

Q^(t)_ji ≡ ∂E

Subsequent ∆U_ji^(t)and ∆L^(t)_ji are calculated as follows.

∆W_ji^(t)=

3N_I consecutive iterations,

(7.20)

where W = U, L. When Q^(t)_ji = 0 consecutively for ¹

3N_I time steps, which means that the current direction of premise adjustment is fruitless, ∆W_ji^(t) changes its sign as shown in the fourth line of Eq. (7.20). In this situation, we also letη^(t)= 1.1η^(t−1), which helps to keep the progress from being trapped.

Otherwise,η^(t) remains unchanged.

Compared with the technique proposed by McGarryet al.[212][213][214], a higher accuracy with concise rules is obtained with this method. In [212][214], the input intervals in rules are expressed by the following equations:

X_upper=µ_i+σ_i−S, (7.21)

X_lower=µ_i−σ_i+S. (7.22)

7.5 Rule Extraction by Gradient Descent 177 Here X_upper is the upper limit of the premise of a rule, and X_lower is the lower limit. S is the feature ‘steepness’, which was discovered empirically to be about 0.6 by McGarry et al.. µ_i is the n-dimensional center location of rulei and σ_i is the width of the receptive ﬁeld. We note that the empirical parameterS may vary from data sets to data sets.

Two rule-tuning stages are used in our method. In the ﬁrst tuning stage, the premises of m rules (m is the number of hidden neurons of the trained RBF network) are adjusted using gradient descent to minimize the rule error rate. Some rules do not contribute to the improvement of rule accu-racy, which is due to the following reason. The input data space is separated into several subspaces through training the RBF neural network. Each sub-space is represented by a hidden neuron of the RBF neural network and is a hyper-ellipse. The decision boundary of our rules is hyper-rectangular. We use gradient descent for searching the premise parts of rules. Since overlaps (Fig. 7.8(a)) exist between clusters of the same class, some hidden neurons may be overlapped completely when a hyper-rectangular rule is formed using gradient descent (see Fig. 7.8(b)). Thus, the rules overlapped completely are redundant for representing data and should be removed from the rule set. It is expected that this action will not reduce the rule accuracy. The number of rules should be fewer than the number of hidden neurons.

Based on the results of the ﬁrst tuning stage, the second tuning stage removes irrelevant and unimportant features by calculating an importance factor, i.e., variations of rule accuracy on validation data when tuning the feature. We set the importance factor threshold for removing a feature as 1%, i.e., if the rule accuracy of the validation set does not decrease by 1% when tuning a feature on the ﬁrst training stage, the feature will be considered to be unimportant, and will be deleted from the data set. Rules with boundaries completely overlapped by other rules are redundant and will be removed.

7.5.2 Experimental Results

The Thyroid, Breast cancer, and Glass data sets available at the UCI database [223] are used to demonstrate our method.

Table 7.4 shows that when large overlaps among clusters of the same class are permitted, both the number of hidden neurons and the classiﬁcation error rate are reduced.

Thyroid Data Set

Four rules (Table 7.5) are extracted for the Thyroid data set by the method described in this section. The average number of premises in each rule is three, and the accuracy of the extracted rules is 92% for the test data set. Experimen-tal results show that better rule accuracy is obtained by this rule extraction method compared with the GA-based rule extraction method described in Sect. 7.4. The rules for the Thyroid data set are as follows:

178 7 Rule Extraction from RBF Neural Networks

B D

C E

(a)

B D

C E

(b)

Fig. 7.8. (a) Clusters in an RBF network, (b) hyper-rectangular rule decision boundaries corresponding to the clusters.

Table 7.4. Reduction in the number of hidden units in the RBF network when large overlaps are allowed between clusters for the same class.

Results Thyroid Breast cancer Glass

Classiﬁcation Small overlap 94% 97.08% 78.41%

accuracy Large overlap 95.2% 98.54% 85.09%

Number of Small overlap 14.4 31 13 hidden units Large overlap 8 11 10

Rule 1:

IF attribute 2 is within the interval [11.97, 22.57]

AND attribute 3 is within the interval [2.50, 10]

AND attribute 5 is within the interval [0, 13.62]

THEN the class label is hyper-thyroid.

Rule 2:

IF attribute 2 is within the interval [15.49, 25.3]

AND attribute 3 is within the interval [1.3, 10]

AND attribute 5 is within the interval [0, 13.73]

THEN the class label is hyper-thyroid.

Rule 3:

IF attribute 2 is within the interval [0, 4.62]

AND attribute 3 is within the interval [0, 2.62]

AND attribute 5 is within the interval [0, 17.11]

THEN the class label is hypo-thyroid.

7.5 Rule Extraction by Gradient Descent 179 Rule 4:

IF attribute 2 is within the interval [0.26, 7.81]

AND attribute 3 is within the interval [0.0, 2.61]

AND attribute 5 is within the interval [8.78, 55.73]

THEN the class label is hypo-thyroid.

Default rule:

the class label is normal.

Table 7.5. Rule accuracy and numbers of rules for the Thyroid data set by the gradient descent method.

Results Thyroid

Training accuracy 98.26%

Rule accuracy Validation accuracy 96%

Testing accuracy 92%

The number of premises/rule 3

The number of rules 4

Glass Data Set

There are nine attributes, six classes, and 214 patterns in the Glass data set.

For a comparison with the results in [126], only attributes 2, 3, 4, 5, 6, 7, and 8 were used in the Glass data set. Six rules (Table 7.6) are extracted for Table 7.6.Rule accuracy and numbers of rules for the Glass data set by the gradient descent method.

Results Glass

Training accuracy 84.85%

Rule accuracy Validation accuracy 86.21%

Testing accuracy 86.21%

The average number of premises per rule 3.33

The number of rules 6

the Glass data set by our method. The average number of premises in each rule is 3.33, and the accuracy of the extracted rules is 86.21%. In [126], two rule extraction results are shown for the same Glass data set. A rule accuracy of 83.88% was obtained based on the C4.5 decision tree. A rule accuracy of 83.33% was obtained by the GLARE rule extraction method based on the MLP. Hence, experimental results show that better rule accuracy is obtained by our rule extraction method.

180 7 Rule Extraction from RBF Neural Networks Breast Cancer Data Set

Based on our method, we obtain four symbolic rules (Table 7.7) for the Breast cancer data set. The average number of premises in each rule is two. The accuracy of the symbolic rules obtained through our method is 96.35% for the test data set. In comparison, Setiono [286] extracted 2.9 rules and obtained 94.04% accuracy for the Breast cancer data set based on the pruned MLP.

Table 7.7. Rule accuracy and numbers of rules for the Breast cancer data set by the gradient descent method.

Results Breast cancer

Training accuracy 95.35%

Rule accuracy Validation accuracy 95.62%

Testing accuracy 96.35%

The average number of premises per rule 2

The number of rules 4

7.5.3 Summary

We have described a novel rule-extraction algorithm from RBF networks based on the gradient descent method. First, a compact RBF network is obtained by allowing for large overlaps among the clusters belonging to the same class. Sec-ond, the rules are initialized according to the training result. Next, premises of each rule are tuned using gradient descent theory. The unimportant rules which do not aﬀect the rule accuracy will be removed from the rule set. The unimportant features will also be deleted from the data set based on the results obtained in the ﬁrst tuning stage. Fourth, rules left will be tuned using gradi-ent descgradi-ent theory again. Experimgradi-ental results show that our rule extraction technique is simple to implement, and concise rules with high accuracy are ob-tained. In addition, rules extracted by our algorithm have hyper-rectangular decision boundaries, which are desirable due to their explicit perceptibility.

The approach eliminates the need for an error-prone transformation from con-tinuous attributes into discrete ones as required in MLP-based methods.

7.6 Rule Extraction After Data Dimensionality

Dans le document Advanced Information and Knowledge Processing (Page 182-187)