GeneralObservations

5 Supervised Learning for Operational

5.9 GeneralObservations

Several supervised machine learning approaches are introduced using illustrative case studies. These include neural and fuzzy neural networks, fuzzy set covering and fuzzy-SDG. The focus of discussion has been on issues that need to be

Chapter 5 Supervised Learning for Operational Support 117

addressed in practical applications. These include variable selection and feature extraction for FFNN inputs, model validation and confidence bounds, accuracy and generality, feature extraction from dynamic transient signals as well as the ability to give causal knowledge. There are some other issues which have not or have not fully addressed in the context.

The backpropagation learning is not a recursive approach, which means that when new data is to be used to update the knowledge, it has to be mixed with previous data and the model to be retrained again. Recursive models are particularly useful for on-line systems because they can continuously update their knowledge.

When applied to fault diagnosis, supervised learning approaches described above are suitable for mapping the set of symptoms to the set of so-called essential faults [151]. Faults can be divided into two categories: essential faults and root faults. For example, the symptom - "the trend of liquid level increases dramatically at the bottom of the column" can be explained by the following essential faults: (1) high feedflowrate; (2) low bottom withdrawal flowrate; (3) low flash zone temperature;

( 4) low flash zone pressure; (5) change in feed properties. These essential faults are indications of root faults. Looked at in this way, low bottom withdrawal flow is an essential fault which can be the result of the root faults of a pump failure or fail to open a valve. Therefore fault diagnosis is a two step procedure, mapping from the set of symptoms to the set of essential faults, and explanation of the essential faults.

The unsupervised approaches introduced are suitable for the first step. For the second step, it is more appropriate to use expert systems or graphical models such as signed digraph. Similar strategies have been developed by Becraft and Lee [177]

and Wang et al. [151].

Since supervised learning approaches require training data which is difficult to get for the purpose of fault identification and diagnosis, they are less attractive compared with multivariate analysis approaches (Chapter 3), and unsupervised approaches (Chapter 6). However, the good performance in correlating non-linear inputs-outputs makes them very attractive in developing software sensors or software analysers (Chapter 9).

The extrapolation problem of feedforward neural networks has been addressed through input dimension reduction, proper selection of training and test data, model validation and confidence bound evaluation as well as using single layer percetron, FSC or fuzzy-SDG if feasible. There are also efforts in incorporating partial domain knowledge into neural network learning.

IDENTIFICATION

6.1 Supervised vs. Unsupervised Learning

This chapter describes some representative unsupervised machine learning approaches for process operational state identification. Whether a machine learning approach is regarded as supervised or unsupervised depends on the way it makes use of prior knowledge of the data. Data encountered can be broadly divided into the following four categories:

(1) Part of the database is known, i.e., the number and descriptions of classes as well as the assignments of individual data patterns are known. The task is to assign unknown data patterns to the established classes.

(2) Both the number and descriptions of classes are known, but the assignment of individual data patterns is not known. The task is then to assign all data patterns to the known classes.

(3) The number of classes are known but the descriptions and the assignment of individual data patterns are not known. The problem is to develop a description for each class and assign all data patterns to them.

e

4) Both the number and descriptions of classes are not known and it is necessary to determine the number and descriptions of classes as well as the assignments of the data patterns.

The methods introduced in Chapter 5 are powerful tools for dealing with the first type of data where the objective is to assign new data patterns to previously established classes. Clearly this is not appropriate for the last three types of data,

X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control

120 Data Mining and Knowledge Discovery for Process Monitoring and Control

since training data is not available. In these cases unsupervised learning approaches are needed and the goal is to group data into clusters in a way such that intraclass similarity is high and interclass similarity is low. In other words, supervised approaches can learn from known to predict unknown while unsupervised approaches learn from unknown in order to predict unknown. Supervised learning can generally give more accurate predictions, but can not be extrapolated: when new data is not in the range of training data, predictions will not generally be reliable.

For process operational state identification and diagnosis, supervised learning needs both symptoms and faults. Therefore the routine data collected by computer control systems can not be used directly for training. Faults are unlikely to be deliberately introduced to an industrial process in order to generate training data.

Grouping of data patterns using unsupervised learning is often based on a similarity or distance measure, which is then compared with a threshold value. The degree of autonomy will depend on whether the threshold value is given by the users or determined automatically by the system. In this chapter three approaches are studied, the adaptive resonance theory (ART2), a modified version of it named ARTnet, and Bayesian automatic classification (AutoClass). ART2 and ARTnet though require a pre-defined threshold value are able to deal with both the third and fourth types of data. AutoClass is a completely automatic clustering approach without the need to pre-define a threshold value and the number and descriptions of classes, so is able to deal with the fourth type of data.

6.2 Adaptive Resonance Theory

The adaptive resonance theory (ART) was developed by Grossberg [47, 48], as a clustering-based, autonomous learning model. ART has been instantiated in a series of separate neural network models referred to as ART!, ART2 and ART3 [43,49, 50,51]. ART! is designed for clustering binary vectors and ART2 for continuous-valued vectors.

A general architecture of the ART neural network is shown in Figure 6.1. The network consists of three groups of neurons, the input processing layer (F1), the cluster layer (F2), and a mechanism which determines the degree of similarity of patterns placed in the same cluster (a reset mechanism). The input layer can be considered as consisting of two parts: the input and an interface. Some processing may occur in the both (especially in ART2). The interface combines signals from the input to the weight vector for the cluster unit which has been selected as a

candidate for learning. The input and interface are designated as F, (a) and F, (b) in Figure 6.1.

Figure 6.1 The general architecture of ART.

(Cluster layer)

Fl (b) Layer (Interface)

An input data pattern is presented to the network, following some specialised processing by the network, and is compared to each of the existing cluster prototypes in the cluster layer (F2)' The "winner" in the cluster layer is the prototype most similar to the input. Whether or not a cluster unit is allowed to learn an input data pattern depends on how similar its top-down weight vector is to the input vector. The decision is made by the reset unit, based on signals it receives from the input (a) and the interface (b) of the F, layer. If the similarity between the "winner"

and the input exceeds a predetermined value (the vigilance parameter p, in ART terms), learning is enabled and the "winner" is modified slightly to more closely reflect the input data pattern. If the similarity between the "winner" and the input is less than required by the vigilance parameters, a new cluster unit is initialised in the top layer and learning is enabled to create a prototype similar to the input. Thus both the number and descriptions of classes are continuously updated during learning. This is different from the Kohonen network, where the descriptions of classes (called neurons in a Kohonen network) are also continuously update during learning, but the number of classes needs to be determined before learning starts.

To control the similarity of patterns placed in the same cluster, there are two sets of connections (each with its own weights) between each unit in the interface part of

122 Data Mining and Knowledge Discovery for Process Monitoring and Control

the input layer and the cluster units. The FI (b) layer is connected to the F2 layer by bottom-up weights; the bottom-up weight on the connection from the ith FI unit to the jth F2 unit is designated bij . The F2 layer is connected to the FI (b) layer by top-down weights; the top-top-down weight on the connection from the jth F2 unit to the ith Fl unit is designated tji .

f(x)

Wi l---+-i

Figure 6.2 Typical ART2 architecture.

ART2 is designed to process continuous-valued data patterns. A typical ART2 architecture [43] is illustrated in Figure 6.2. It has a very sophisticated input data processing layer (FI ) consisting of six types of units (the W, X U, v, P and Q units). There are n units of each of these types (where n is the dimension of an input pattern). Only one unit of each type is shown in Figure 6.2. A supplemental unit between the Wand X units receives signals from all of the W units, computes the norm of the vector w, and sends this (inhibitory) signal to each of the X units. Each of these also receives an excitation signal from the corresponding W unit. Details of this part of the net are shown in Figure 6.3. A similar supplemental unit performs the same role between the P and Q units, and another does so between the V and U units. Each X unit is connected to the corresponding V unit, and each Q unit is connected to the corresponding V unit also.

Figure 6.3 Details of connections from W to X units, showing the supplemental unit N to perform normalisation.

The symbols on the connection paths between the various units in the FJ layer in Figure 6.2 indicate the transformation that occurs to the signals as it passes from one type of unit to the next; they do not indicate multiplication by the given quantity.

However, the connections between units Pi (of the Fi layer) and Y; (of the F2 layer) do show the weights that multiply the signal transmitted over those paths. The activation of the winning F2 unit is d, where O<d<l. The symbol ---> indicates normalisation; i.e., the vector q of activations of Q units is just the vector p of activations of the P units, normalised to be approximately unit length.

The U units perform the role of an input phase of the FJ layer. The P units play the role of the interface to the FJ layer. Units Xi and Qi apply an activation function to their net input; this function suppresses any components of the vectors of activations at those levels that fall below the user-selected value 9. The connection paths from W to U and from Q to V have fixed weights a and b, respectively.

The most complete description of the ART2 network can be found in [43].

Although written for ARTl, the discussions by Caudil [181] and Wasserman [182]

provide a useful but less technical description of many of the principles employed in ART2.

ART2 has shown great potential for analysis of process operational data and identification of operational state due to a number of properties [183, 184, 195].

First, it is considered as an unsupervised machine learning system that does not require training data. It is well known that it is difficult to find training data for the purpose of process fault identification and diagnosis. Second, the approach is recursive, or in the terms used in ART2, it is plastic, that is, it is able to acquire new knowledge and retain stable in the sense that existing knowledge is not corrupted.

124 Data Mining and Knowledge Discovery for Process Monitoring and Control

This property is apparently very useful for on-line monitoring where information is received continuously.

6.3 A Framework for Integrating Wavelet Feature Extraction and ART2

Wang et al. [185] and Chen et al. [82] developed an integrated framework named ARTnet which combines wavelet for feature extraction from dynamic transient signals and adaptive resonance theory. In ARTnet the data pre-processing part uses wavelets for preprocessing the data for feature extraction. In order to introduce ARTnet it is helpful to first examine the mechanism of ART2 for noise removal.

ART2 has a data pre-processing unit which is very complicated but the mechanism for removing noise uses a simple activation function A(x),

{ X x >

e

A(x)

=

x <

e

^(6.1)

where

e

is a threshold value. If an input signal is less than

e,

it is considered to be a noise component and set to zero. This has proved to be inappropriate for removing noise components contained in process dynamic transient signals which are often of high frequencies and in certain magnitude.

6.3.1 The Integrated Framework

ART

net

A mechanism was proposed by Pao [186] to replace the data pre-processing part of ART2 with more efficient noise removal and dimension reduction methods. This has been followed in this study by using wavelet to replace the data pre-processing unit of ART2. The integral framework is called ARTnet to distinguish it from ART2. The conceptual architecture of ART net is shown in Figure 6.4.

In this new architecture, wavelets are used to pre-process the dynamic trend signals. The extracted features are used as inputs to the kernel of ARTnet for clustering. A pattern feature vector (XI'

x

2 ,.··

,x

^{N )}is fed to the input layer of the ARTnet kernel and weighted by bij' bottom-up weights. In Chapter 3, the extrema of wavelet multiscale analysis should be regarded as the features of dynamic transient signals.

The weighted input vector is then compared with existing clusters in the top layer by calculating the distance between the input and existing clusters. The existing cluster prototype, which has a smaller distance than the input is called the winner.

By considering this input the description or knowledge of the wining cluster is updated. Whether or not a winning cluster prototype is allowed to learn from an input data pattern depends on how similar the input is to the cluster. If the similarity measure exceeds a predetermined value, called the vigilance parameter, learning is enabled. If the similarity measure is less than the required vigilance parameter, a new cluster unit is then created which reflects the input. Clearly this is unsupervised and recursive learning process.

Updating the

f-G ...

C ^~

..

...

^~^ITTf1'l../ofac1uster prototype description

~

^____

+.-==~

Top layer

·{~j?<l _{• •} _•

t t t

Input to

ARTnet

Features Kernel

~ ( Wavelet Feature extraction)

~ Dynamic trend signals

Figure 6.4 The conceptual architecture of ARTnet.

6.3.2 The Similarity Measure in ART net

It is apparent that the learning process is concerned with the extent to which how similar two vectors are. There are several ways to measure the distance between two pairs of observations, such as the Hamming or Euclidean distance. For continuous data, the Euclidean distance is the most commonly used [187]. Formally, the Euclidean distance between two vectors x and y is defined as the root sum-squared error,

126 Data Mining and Knowledge Discovery for Process Monitoring and Control

(6.2) Suppose there are K existing cluster prototypes. The kth cluster prototype consists of a number of data patterns and is also described by a vector, denoted as ^Z(k)^, which has considered all data patterns belonging to it. Clearly, if there is only one data pattern in the cluster, ^Z(k)^,it is equal to that data pattern. When a new input data pattern x is received, a distance between x and ^z(k) is calculated according to the expression,

(6.3) Since the distance between x and all existing cluster prototypes is calculated, the cluster prototype with the smallest distance is the winner. If the distance measure for the winner is smaller than a pre-set distance threshold, p, then the input x is assigned to the winning cluster and the description of the cluster is then updated,

Z(k)

=

^Z(k)

+ --x.b ..

I I NF ^I Y i= 1 ... N, j

=

^{1 ..}.K (6.4) where ^Z,(k)refers to the ith attribute of the vector z for the cluster k. ^h'lis the weight between the ith attribute of the input and the jth existing cluster prototype. NF is the number of features.

6.4 Application of ARTnet to the FCC Process

The FCC process and the data used is described in appendix B. To demonstrate the procedure, 64 data patterns are used. Discussion is limited to 64 data patterns in order to keep the discussion manageable and to assist in presentation of the results.

The data sets include the following faults or disturbances:

• fresh flow rate is increased or decreased

• preheat temperature for the mixed feed increases or decreases

• recycle slurry flow rate increases or decreases

• opening of the hand valve V20 increases or decreases

• air flow rate increases or decreases

• the opening of the fully open valve 40 I-ST decreases

• cooling water pump fails

• compressor fails

• double faults occur

The sixty four data patterns were obtained from a customised dynamic training simulator, to which random noise was added using a zero-mean noise generator (MATLAB®). In the following discussions, the term "data patterns" refers to these sixty four data patterns and "identified patterns" to the patterns estimated by ARTnet.

Figure 6.5(a) shows a reactor temperature transient when the fresh feed flowrate increases by 70%. Figure 6.5(b) is the same transient with random noise. The corresponding four scales from multiresolution analysis for this transient are shown in Figure 6.6, together with the corresponding extrema on the right hand side of Figure 6.6.

2 0

·2

·4

·6

·8

·10

·12

·14

·16 0 20 40 60 80 100

(a)

-2 -4

·6 -8 -10 -12 -14 -16

(b)

Figure 6.S A signal from the simulator (a) and the signal with random noise (b).

128 Data Mining and Knowledge Discovery for Process Monitoring and Control

after noise removal (a) after piece-wise analysis (b)

Figure 6.6 Multiresolution analysis (left) and extrema (right),

As stated in Chapter 3, the extrema that are mostly influenced by noise fluctuations are those (1) where the amplitude decreases on average as the decomposition scale increases and (2) do not propagate to large scales. Using these criteria, noise extrema are removed.

The extrema representation after the noise extrema are removed is a sparse vector, so a piece-wise technique is employed to reduce the dimensionality of the signal.

The extrema after removing noise and carrying out piece-wise processing are shown in Figure 6.7.

Figure 6.8 shows the result after noise removal and compares it with the multiresolution analysis of the original noise-free signal. The extrema are the same in positions but are slightly different in value. The noise removal algorithm is therefore suitable in this case and the 4th scale extrema are selected as input to the ARTnet for pattern identification

It is important that a suitable threshold for pattern recognition is used when applying ARTnet. For a threshold p

=

0.8, all 64 data patterns are identified as individual patterns. A more suitable threshold is obtained by analysing clustering results for increased threshold values as shown in Table 6.1.

Table 6.1 ARTnet clustering result using different distance threshold^a.

Threshold p Number of

patterns Identified grouping of data samples identified

0.8 64

1.0 63 r56571

2.0 60 r5 71 r25 261 r27 281 r56 571

3.0 57 r5 71 r19 2023241 r25 261

m

281 r56 571 4.0 54 r5 6 7 81 r19 20 21 23 241 r25 26] r27 281 r56 571

4.5 49 [3456789] [192021222324] [25 26] [2728]

r35 361 r56 57]

5.0 48 r34 5 6 7 8 91 r 19 20 21 22 23 24 291 r25 261 r27 281 r35 361 r56 57]

6.0 47 r34 5 6 7 8 91 r 19 20 21 22 23 24 29 611 r25 261 r27 281 [35 361 r56 571

a -[56 57] means that data patterns 56 and 57 are identified in the same cluster.

130 Data Mining and Knowledge Discovery for Process Monitoring and Control

With the threshold p increased to 1.0, data patterns 56 and 57, which represent cases where the opening of valve 401-ST is decreased from 100% by 80% and 90%

are grouped together. When p is 2.0, further groupings are [5, 7], representing the

Dans le document Industrial Control (Page 132-0)