Input Selection - A Modiﬁed Fuzzy Neural Network

3.4 A Modiﬁed Fuzzy Neural Network

3.4.5 Input Selection

Real-world classification problems usually involve many attributes and all at-tributes are not always necessary for the classification task. That is, some attributes may be removable without significant deterioration of the classi-fication ability. If only a few attributes are selected, it is easier to design a fuzzy rule-based classification system with high comprehensibility.

In this step, we evaluate the importance of each input variable based on the initial fuzzy model that incorporates all possible input variables and then eliminate those redundant inputs. The objective of this step is to reduce the input dimensionality of the model without signiﬁcant loss in accuracy. Elimi-nation of redundant input features may even improve the accuracy.

It is known that the change of system output is contributed by all input variables. The larger the output change caused by a speciﬁed input variable, the more important this input may be. The modiﬁed fuzzy neural network provides an easy mechanism to test the importance of each input variable without having to generate new models as done in [161]. The basic idea is to let the antecedent clauses associated with a particular input variable i in the rules be assigned a truth value of 1 and then compute the fuzzy output, which is due to the absence of input i. Then, we rank those fuzzy outputs from worst to best, and the worst result indicates that the associated input is the most important.

Furthermore, we have to decide how many inputs should be selected. We start from using the most important input as the only input to the FNN, setting antecedents associated with all other inputs to 1. In the following steps, each time we add one more input according to the ranking order. The subset with the best output result is selected as the input group; other inputs and associated membership functions are deleted.

In Sect. 3.6.2, we will introduce another feature selection method, which is the t-statistics-based approach for gene ranking and selection. Besides the ability of ﬁnding the most discriminative features, this method can also be used to handle the correlated feature selection problem [136][215].

3.4 A Modiﬁed Fuzzy Neural Network 61 3.4.6 Partition Validation

After supervised learning using the back-propagation algorithm, we may ﬁnd that some of the fuzzy sets in layer two are almost the same. In other words, some term sets of the corresponding universe of discourse have a high degree of similarity. Term sets with a high degree of similarity can be combined into a single term set, that is, they can share a common term node.

In this step, we calculate fuzzy similarities between diﬀerent fuzzy sets for each input variable. If the degree of overlapping of membership functions is greater than a threshold, we combine those membership functions. We use the followingfuzzy similarity measureequation [80]:

E(A₁, A₂) = M(A₁∩A₂)

M(A₁∪A₂) , (3.19)

where ∩ and ∪ denote the intersection and union of two fuzzy sets A₁ and A₂, respectively.M(·) is the size of a fuzzy set, and 0≤E(A₁, A₂)≤1. The higherE(A₁, A₂), the more similar fuzzy setsA₁ andA₂ are.

From Eq. (3.19), we see that the computation of the similarity of two fuzzy sets requires calculating the size of the intersection and union of two triangular membership functions. For any two fuzzy sets A₁ and A₂, M(A₁∪A₂) can be easily derived as follows:

M(A₁∪A₂) =M(A₁) +M(A₂)−M(A₁∩A₂) (3.20) Using the above equation of fuzzy similarity measure, the exact formula for the fuzzy similarity measure of two fuzzy sets with triangular-shaped member-ship functions, which will be used in our fuzzy neural network, can be derived as below. We consider the similarity measure in six diﬀerent cases based on the relative positions of membership functions. Figure 3.9 shows the six cases under consideration.

62 3 Fuzzy Neural Networks for Bioinformatics (1)

(6) (5)

(3) (4)

(2)

L₁

L₁ L₁

L₁

C₁ R₁ L₂ C₂ R₂ L₂ R₁ R₂

h_rl

L₂

R₁ R₁

R₁

R₂

R₂ h_rr h_rl

R₂

h_rl h_ll

h_rr h_ll h_rl

R_d R_c L_a L_b C₁/C₂

Fig. 3.9. Fuzzy similarity measures of two triangular membership functions at diﬀerent relative positions.

If an input variable ends up with only one membership function, which means that this input is irrelevant, then we delete the input. We can thus eliminate irrelevant inputs and reduce the size of the rule base.

After this step, if the classiﬁcation accuracy of the FNN is below the requirement, and the number of rules is less than the speciﬁed maximum, we will modify the rule base by following the method introduced in the next section.

3.4.7 Rule Base Modiﬁcation

In general, when a fuzzy rule-based classification system consists of a large number of specific fuzzy IF–THEN rules, its classification performance on

3.5 Experimental Evaluation Using Synthesized Data Sets 63 training patterns is very high [157]. Generalization ability on test patterns of such a fuzzy system, however, is not always high due to the overfitting to training patterns. A large number of fuzzy IF–THEN rules also degrade the comprehensibility of fuzzy rule-based classification systems. On the other hand, when a fuzzy rule-based classification system consists of a small number of general fuzzy IF–THEN rules, the comprehensibility is high. Such a compact fuzzy system can avoid the overfitting to training patterns. There are, however, many cases where a large number of fuzzy IF–THEN rules are required for handling complicated non-linear classification problems. That is, when the number of fuzzy IF–THEN rules is too small, fuzzy rule-based classification systems may have poor classification ability on training patterns as well as test patterns. There is a tradeoff between the performance of a fuzzy rule-based classification system and the number of fuzzy IF–THEN rules. That is why we need the rule-base modification procedure.

Firstly, an additional membership function is added for each input at its value at the point of the maximum output error, following Higgins and Good-man [135]. One vertex of the additional membership function is placed at the value at the point of the maximum output error and has the membership value unity; the other two vertices lie at the centers of the two neighboring regions, respectively, and have membership values zero. As the output of the network is not a binary 0 or 1, but a continuous function in the range from 0 to 1, by ﬁrstly eliminating the error whose deviation from the target value is the greatest, we can speed up the convergence of the network substantially.

The rules generated above are then evaluated for accuracy and generality.

We use a weighting parameter between accuracy and simplicity, which is the compatibility grade (CG) of each fuzzy rule.CGof rulej is calculated by the product operator as:

µ_j(x) =µ_j1(x₁)×µ_j2(x₂)× · · · ×µ_jn(x_n), (3.21) when the system provides correct classiﬁcation results.

All rules whose compatibility grade (CG) falls below a predeﬁned threshold are deleted. Elimination of rule nodes is rule by rule, i.e., when a rule node is deleted, its associated input membership nodes and links are deleted as well. By varying the CG threshold we are able to specify the degree of rule base compactness. The size of the rule base can thus be kept minimal. If the classiﬁcation accuracy of the FNN after the elimination of rule nodes is below the prescribed requirement, we will add another rule as described in the previous section; otherwise we will stop the training process.

3.5 Experimental Evaluation Using Synthesized Data Sets

To demonstrate the effectiveness of the modified fuzzy neural network [93], we use 10 classification problems of different complexity defined in [2] on

64 3 Fuzzy Neural Networks for Bioinformatics

a synthetic database. Experimental results are compared with decision tree construction algorithms C4.5 and C4.5rules [251] and a pruned feedforward crisp neural network (NeuroRule) [200].

3.5.1 Descriptions of the Synthesized Data Sets

There are nine attributes (see Table 3.1 for detailed descriptions) involved for the experiments. Attributes elevel, car, and zipcode are categorical, and all others are non-categorical.

Table 3.1.Descriptions of attributes for the 10 synthesized problems deﬁned in [2].

Attribute Description Value

salary Salary Uniformly distributed from 20K to 150K

commission Commission Salary≥75K=>commission = 0 else uniformly distributed from 10K to 75K

age Age Uniformly distributed from 20 to 80

elevel Education level Uniformly chosen from 0 to 4 Car Make of the car Uniformly chosen from 1 to 20 zipcode Zip code of the town Uniformly chosen from 9 available

zipcodes

hvalue Value of the house Uniformly distributed from n50K to

n150K, wheren∈0, ...,9 depends on zipcode hyears Years house owned Uniformly distributed from 1 to 30

loan Total loan amount Uniformly distributed from 0 to 500K

The ﬁrst classiﬁcation problem has predicates on the values of only one attribute. The second and third problems have predicates with two attributes, and the fourth to sixth problems have predicates with three attributes. Prob-lems 7 to 9 are linear functions and problem 10 is a non-linear function of attribute values.

Attribute values were randomly generated according to the uniform dis-tribution as in [2]. For each experiment we generated 1000 training and 1000 test data tuples. The target output was 1 if the tuple belongs to class A, and 0 otherwise. We used the random data generator with the same perturbation factorp= 5% as in [2] to model fuzzy boundaries between classes. The number of data points available for class A (class data distribution) in each problem is as follows: problem 1 = 1321 (66.1% of the total number of data points in this problem), problem 2 = 800 (40.0%), problem 3 = 1111 (55.5%), problem 4

= 626 (31.3%), problem 5 = 532 (26.6%), problem 6 = 560 (28.0%), problem 7 = 980 (49.0%), problem 8 = 1962 (98.1%), problem 9 = 982 (49.1%), and problem 10 = 1739 (87.0%).

3.5 Experimental Evaluation Using Synthesized Data Sets 65 The detailed descriptions of the 10 testing functions are as shown below, whereM?N :Qis equivalent to the sequential condition, i.e., the expression is equivalent to (M

(75K≤(salary+commission)≤125K))

((age≥40) (25K

≤(salary+commission)≤75K))

• Function 7

disposable= (0.67×(salary+commission)−0.2×loan−20K) Class A:disposable >0

• Function 8

disposable= (0.67×(salary+commission)−5000×elevel−20K) Class A:disposable >0

• Function 9

disposable= (0.67×(salary+commission)−5000×elevel−0.2×loan− 10K)

Class A:disposable >0

• Function 10

hyears <20⇒equity= 0

hyears≥20⇒equity= 0.1×hvalue×(hyears−20)

disposable= (0.67×(salary+commission)−5000×elevel−0.2×equity− 10K)

Class A:disposable >0

66 3 Fuzzy Neural Networks for Bioinformatics 3.5.2 Other Methods for Comparisons

For comparisons with the modiﬁed fuzzy neural network, we applied decision tree construction algorithms C4.5 and C4.5rules [251] on the same data sets.

C4.5 and C4.5rules release 8 from the pruned trees with default parameters were used. We also compared our results with those of a pruned feedforward crisp neural network (NeuroRule) [200] for the classiﬁcation problems from [2].

In the next two subsections, we brieﬂy review the principles of C4.5, C4.5rules, and NeuroRule. Experimental results will follow in the subsequent sections.

The Basics of C4.5 and C4.5rules

C4.5 is commonly regarded as a state-of-the-art method for inducing decision trees [78], and C4.5rules transforms C4.5 decision trees into decision rules and manipulates these rules. The heart of the popular and robust C4.5 program is a decision tree inducer. It performs a depth-ﬁrst, general to speciﬁc search for hypotheses by recessively partitioning the data set at each node of the tree.

C4.5 attempts to build a decision tree with a measure of the information gain ratio of each feature and branching on the attribute which returns the maximum information gain ratio. At any point during the search, a chosen attribute is considered to have the highest discriminating ability between the diﬀerent concepts whose description is being generated. This bias constrains the search space by generating partial hypotheses using a subset of the di-mensionality of the problem space. This is a depth-ﬁrst search in which no alternative strategies are maintained and in which no back-tracking is allowed.

Therefore, the ﬁnal decision tree built, though simple, is not guaranteed to be the simplest possible tree.

C4.5 uses a pruning mechanism to stop tree construction if an attribute is deemed to be irrelevant and should not be branched upon. A χ²-test for statistical dependency between the attribute and the class label is carried out to test for this irrelevancy. The induced decision tree is ﬁnally converted to a set of rules with some pruning and the generation of a default rule.

The C4.5rules program contains three basic steps to produce rules from a decision tree:

1. Traverse a decision tree to obtain a number of conjunctive rules. Each path from the root to a leaf in the tree corresponds to a conjunctive rule with the leaf as its conclusion.

2. Manipulate each condition in each conjunctive rule to see if it can be dropped or can be merged into a similar condition in another rule without more misclassiﬁcation than expected on the original training examples.

3. If some conjunctive rules are the same after step 2, then keep only one of them.

3.5 Experimental Evaluation Using Synthesized Data Sets 67 The ﬁnal decision rules thus produced are expected to be simpler than the original decision trees but not necessarily more accurate.

The Basics of NeuroRule

NeuroRule is a novel approach proposed by Lu et al.[200] that exploits the merits of connectionist methods, such as low classiﬁcation error rates and robustness to noise. It is also able to obtain rules which are more concise than the rules usually obtained by decision trees.

NeuroRule consists of three steps:

1. Network training: a three-layer feedforward neural network is trained in this step. To facilitate the rule extraction in the third step below, con-tinuous attributes are discretized by dividing their ranges into subin-tervals. The number of nodes in the input layer corresponds to the dimensionality of subintervals of the input attributes. The number of nodes in the output layer equals the number of classes to be classiﬁed.

The network starts with the oversized hidden layer and works towards a small number of hidden nodes as well as the fewest number of input nodes. The training phase aims to find the best set of weights that classify input tuples with a sufficient level of accuracy. An initial set of weights are chosen randomly from the interval [−1, 1]. Weight updating is carried out by a quasi-Newton algorithm. The training phase is ter-minated when the norm of the gradient of the error function falls below a prespecified value. A penalty function is used to prevent weights from getting too large or too many with very small values.

2. Network pruning: the pruning phase aims to remove redundant links and nodes without increasing the classiﬁcation error of the network.

The pruning is achieved by removing nodes and links whose weights are below a prespeciﬁed threshold. A smaller number of nodes and links left in the network after pruning provide for extracting concise and comprehensive rules that describe the classiﬁcation function.

3. Rule extraction: this phase extracts classification rules from the pruned network. The rule extraction algorithm first discretizes the activation values of hidden nodes into a manageable number of discrete values without sacrificing the classification accuracy of the network. The al-gorithm consists of four basic steps:

(a) A clustering algorithm is applied to ﬁnd clusters of activation values of the hidden nodes.

(b) The discretized activation values of the hidden nodes are enumer-ated and the network outputs are computed. Rules that describe

68 3 Fuzzy Neural Networks for Bioinformatics

the network outputs in terms of the discretized hidden unit activa-tion values are generated.

(c) For each hidden node, the input values that lead to it are enumer-ated and a set of rules is generenumer-ated to describe the hidden units’

discretized values in terms of the inputs.

(d) The two sets of rules obtained in the previous two steps are merged to obtain rules that relate the inputs to the outputs.

The resulting set of rules is expected to be smaller than the one pro-duced by decision trees but with generally more conditions per rule than that of decision trees.

3.5.3 Experimental Results

For each experimental run, the errors for all the groups are summed to obtain the classiﬁcation error. Table 3.2 shows the overall accuracy on each test data set and the number of rules for all three approaches for the 10 testing problems.

Table 3.2. The overall accuracy on each test data set for the problems from [2], with a compact rule base. The FNN uses the following parameters: learning rate η = 0.1, maximum number of rules = 20, degree of overlapping of membership functions = 0.8, and maximum number of iterationsk= 200.

Fnc. Accuracy (%) No. of rules NR C4.5 FNN NR C4.5 FNN

1 99.9 100 100 2 3 2

2 98.1 95.0 93.6 7 10 5 3 98.2 100 97.6 7 10 7 4 95.5 97.3 94.3 13 18 8 5 97.2 97.6 97.1 24 15 7 6 90.8 94.3 95.0 13 17 8 7 90.5 93.5 95.7 7 15 13 8 N/A 98.6 99.3 N/A 9 2 9 91.0 92.6 94.1 9 16 15 10 N/A 92.6 96.2 N/A 10 8

Compared to NeuroRule, the modiﬁed FNN produces rule bases of less complexity for all problems, except problems 7 and 9. The FNN gives better accuracy for problems 1, 6, 7, and 9, but lower accuracy for the rest. Compared to C4.5rules, the FNN gives less complex rules for all problems. The FNN gives higher accuracy than C4.5rules on problems 6 to 10, the same for problem 1,

3.5 Experimental Evaluation Using Synthesized Data Sets 69 and lower for the rest. The results for problems 8 and 10 were not reported for the NeuroRule approach [200] (indicated by ‘N/A’ in Table 3.2).

To have a balanced comparison of all three approaches we have used a weighted accuracy(W A), deﬁned as the accuracyAdivided by the number of rule conditionsC, as our main aim is to achieve a maximum compactness of the rule base while maintaining a high accuracy. The need of the compact rule base is twofold: we should allow for scaling of data mining algorithms to large-dimensional problems, as well as providing the user with easily interpretable rules.

Table 3.3 shows that the FNN performs better than NeuroRule, in terms ofW A, on all problems, except problem 7. Compared with C4.5rules, the FNN shows better results for all problems except 10.

Table 3.3.Weighted accuracy on each test data set for NeuroRule (NR), C4.5rules release 8 (C4.5), and the FNN for the problems from [2].

Function Weighted accuracy (WA)

NR C4.5 FNN

1 20.0 25.0 33.3

2 3.2 4.1 4.8

3 4.7 5.0 6.5

4 1.7 2.2 2.8

5 0.9 2.3 3.1

6 1.5 1.9 2.4

7 4.1 2.8 3.7

8 N/A 5.8 7.1

9 2.9 2.3 3.1

10 N/A 4.2 3.2

In Table 3.2, we attempted to obtain the most compact rule base for the FNN to allow for easy analysis of the rules for very large databases, and subsequently easy decision making in real-world situations.

If accuracy is more important than compactness of the rule base, it is possible to use the FNN with more strict accuracy requirements, i.e., a higher threshold for pruning the rule base, thereby producing more accurate results at the expense of a higher rule base complexity; the classiﬁcation result based on this requirement is shown in Table 3.4.

Actually, the result in Table 3.4 is a special case where we simply want to achieve higher accuracy. More rules are produced compared with NeuroRule and C4.5rules. For problem 1, compared with the result in Table 3.2, more rules are involved to achieve 100% accuracy, which means that redundant rules are produced as the threshold for rule base compactness decreased. This is only acceptable when accuracy is more important than compactness of

70 3 Fuzzy Neural Networks for Bioinformatics

the rule base as indicated above. For most cases, the ﬁnal decision regarding complexity versus accuracy of the rule base is application speciﬁc.

Table 3.4.The overall accuracy on each test data set with a higher accuracy for the problems from [2]. The FNN uses the following parameters: learning rate η= 0.1, maximum number of rules = 200, degree of overlapping of membership functions

= 0.8, maximum number of iterationsk= 200.

Fnc. Accuracy (%) No. of rules NR C4.5 FNN NR C4.5 FNN

1 99.9 100 100 2 3 11

2 98.1 95.0 98.5 7 10 25 3 98.2 100 99.3 7 10 36 4 95.5 97.3 98.1 13 18 120 5 97.2 97.6 97.2 24 15 81 6 90.8 94.3 96.5 13 17 50 7 90.5 93.5 97.7 7 15 52 8 N/A 98.6 99.1 N/A 9 15 9 91.0 92.6 95.2 9.0 16 65 10 N/A 92.6 97.7 N/A 10 50

3.5.4 Discussion

Ten diﬀerent experiments were conducted to test the proposed fuzzy neural network on various data mining problems. The results have shown that our FNN is able to achieve an accuracy comparable to both feedforward crisp neural networks and decision trees, with more compact rules compared to

Dans le document Advanced Information and Knowledge Processing (Page 69-0)