Artificial Organisms: The Models - Preprocessing Tools for Nonlinear Datasets

Preprocessing Tools for Nonlinear Datasets

8.2 Artificial Organisms: The Models

N K

D2^N, and among these subsets, the best one is the one that maximizes some cost functionF ./. Research into the attribute space to determine the pool of characteristics is carried out with a search function by measuring the discrimination capacity of each of the subsets. This evaluation is carried out on the possible subsetsD⁰ using each variable subset as training samples (D⁰^Œtr) and testing samples (D⁰^Œts) for an inductor_D0Œtr;A;F;Z./, using a fixed induction algorithm A, the configuration parameters F, and the installation parameters Z.

Excluding the exhaustive search strategy on the global set of characteristics, which is not applicable to a dataset with a high number of variables, the techniques that can be used are a blind search (e.g., depth first) or heuristic search (hill climbing, best first), but in the literature, evolutionary search techniques have also been proposed (Kudo and Sklansky2000; Siedlecki and Slansky1989). Genetic algorithms have been shown to be very effective as global search strategies when dealing with nonlinear and large problems.

Feature selection techniques can be developed using two different general ap-proaches based on whether the selection of the variables is carried out dependently or independently of the learning algorithm used to build the inductor. The filter approach attempts to select the best attribute subset by evaluating its relevance based on the data. The “wrapper” approach, instead, requires that the selection of the best attribute subset takes place considering as relevant those attributes that allow the induction algorithm to generate a more accurate performance (John et al.

1994).

Input selection (IS) operates as a specific evolutionary wrapper system that responds to the need to reduce the dimensionality of the data by extracting the minimum number of variables necessary to control the “peaking” phenomenon and, at the same time, conserving the most information available.

8.2 Artificial Organisms: The Models

We now introduce a new concept called the artificial organism (AO). We define AO as a group of dynamic systems (ANN, evolutionary algorithms, etc.) that use the same sensors and effectors of a process, working in synergy without explicit supervision. T&T, T&Tr, and IS systems satisfy these characteristics and are therefore considered models of AO.

Fig. 8.1 The T&T algorithm: each individual-network of the population distribute the complete dataset Din two subsets,d^{Œt r}(training) andd^Œts(testing)

8.2.1 The Training and Testing Algorithm

The “training and testing” algorithm (T&T) is based on a population of n artificial neural networks (ANNs) managed by an evolutionary system. In its simplest form, this algorithm reproduces several distribution models of the complete dataset D

(one for every ANN of the population) in two subsets (dŒtr, the training set, and dŒts, the testing set). During the learning process, each ANN, according to its own data distribution model, is trained on the subsampledŒtrand blind validated on the subsampledŒts(see Fig.8.1).

The performance score reached by each ANN in the testing phase represents its “fitness” value (i.e., the individual probability of evolution). The genome (the full content of all information) of each “network individual” thus codifies a data distribution model with an associated validation strategy. The n data distribution models are combined according to their fitness criteria using an evolutionary algorithm. The selection of “network individuals” based on fitness determines the evolution of the population, that is, the progressive improvement of performance of each network until the optimal performance is reached, which is equivalent to the better division of the global dataset into subsets.

The evolutionary algorithm mastering this process, named “genetic doping algorithm” (GenD for short), was created at Semeion Research Center (Buscema 2004). GenD has similar characteristics to a genetic algorithm, but (1) the criteria of evolution and the mathematics of the crossover are completely new and different from classical models; (2) a species-health-aware evolutionary law and genetic operators are used; and (3) the individuals are organized into a structure (Buscema 2004).

In T&T systems, the solution space of GenD is constituted by all the possible partitions of records between the training and testing sets.

Given a dataset Dof N records, the number of samples d which are comprised of K possible records is given by

However, the possible useful partitions of records between training and testing sets are limited to

< 2^N with r typically having a value between 0.4 and 0.5.

The evolutionary algorithm codes those partitions according to a two-symbol alphabet:

ˆ_T&TD f tr; _tsg where

( _tr) represents a record belonging to the training setd^Œtrand ( _ts) represents a record belonging to the testing setd^Œts.

Therefore, a pair of training and testing sets represents, in the solution space, a possible solutionx D

D^Œtr; D^Œts

, given by the vector xD

D^Œtr; D^Œts

DŒx₁; x₂; ; x_N2ˆ^N_T&T x_i 2ˆ_T&T

The elaboration of T&T is articulated in two phases:

1. Preliminary phase: the parameters of the fitness function that will be used on the global dataset are evaluated. During this phase, an inductor_DŒtr

;A;F;Z./

is configured which consists of an artificial neural network with a standard algorithm (A) back propagation. For this inductor, the optimal configuration to reach convergence is stabilized at the end of different training trials on the global datasetD; in this way, the configuration that most “suits” the available dataset is determined: the number of layers and hidden units and some possible generalizations of the standard learning law. The parameters thus determined that define the configuration (F) and the initialization (Z) of the population’s individual networks will then stay fixed in the following computational phase.

Basically, during this preliminary phase, there is a fine-tuning of the ANN

that defines the fitness values of the population’s individuals during evolution.

Additionally, a valueE₀ of epochs is necessary to give an adequate evaluation of the fitness of the individuals. The selection of the individuals is carried out on the basis of the fitness value defined according to a cost function that is deemed useful to the optimal interpolation of the set.

2. Computational phase: the system extracts from the global dataset the best training and testing sets. During this phase, the individual network of the population is running, according to the established configuration and the initialization parame-ters. From the evolution of the population, managed by the GenD algorithm, the best distribution of the global dataset D into two subsets is generated, starting from the initial population of possible solutionsx D

D^Œtr; D^Œts

. For each GenD epoch, each individual of the initial population is trained on the training setD^Œtr for a number of epochsE0and is tested on the corresponding testing set D^Œts. For the evolutionary system, the following options are fixed: only one tribe and the two global genetic operators of crossover and mutation. This allows the algorithm to converge on the desired evolution in minimum time.

8.2.2 T&Tr (Training and Testing Reverse) Algorithm

The T&T algorithm can be enhanced by introducing a “reverse” procedure in order to achieve a better measure of the accuracy of the performance on the global dataset, when the representativeness of the dataset is not completely satisfactory.

In the T&Tr evolutionary system, every individual of the population is composed of a pair of ANNs. Each pair represents a distribution model of the global dataset Din two subsets:d^Œtr(training set) andd^Œts (testing set). For each pair, the first ANN is trained on subsampled^Œtr, and it is validated blind on the subsampled^Œts. For the second ANN, completely independent from the first, the subsetd^Œtsis used as a training set, and the subsetd^Œtris used as a testing set (see Fig.8.2).

The average value of the performance reached by the two ANNs during the testing phase is the fitness of the individual. T&Tr optimizes the procedure that splits the global set into training and testing subsets where

f₁.d^Œtr/Šf₂.d^Œts/Šf₀.d^Œglobal/

f1(d^Œtr) and f2(d^Œts)Dprobability density function of the testing and training subset, respectively; f0(d^Œglobal)Dprobability density function of the global dataset.

The goal of such optimization is to achieve the best performance with a single ANN trained on one of these subsets and tested on the other. In T&Tr, performance overestimation is avoided by training the inductor on the testing set D^Œts forE₀ epochs and testing it on the corresponding training setD^Œtr(“reverse” procedure), that is, exchanging the subsets in the training and testing phase of the pair’s second ANN. For each individual in the population, we obtain a different model of data

Fig. 8.2 The T&Tr algorithm: every individual of the population is composed of a pair of ANNs, each representing a distribution model of the global dataset Din two subsetsd^{Œt r}(training) and d^Œts(testing). The first network of each pair is trained on subsetd^{Œt r}and it is blind-validated on the subsetd^Œts; instead, the second network is trained on subsetd^Œtsand tested on the subsetd^{Œt r}

distribution which, at every generation, is combined by the GenD evolutionary algorithm according to the fitness criterion. In this way, the best distribution of the overall dataset into training and testing subsets is reached after a finite number of generations.

8.2.3 Input Selection

Input selection (IS) is an adaptive system based on the evolutionary algorithm GenD and able to evaluate the relevance of the different variables of the available dataset in an intelligent way. Therefore, it can be considered on the same level as a feature selection technique.

For a pair of training and testing subsets evaluated by the inductor in a classifi-cation/prediction problem, IS is able to determine which variables are relevant for the considered problem; the inductor is therefore trained on this pool of variables using the variation in its performance as feedback. It is possible to assume that, if the selection of the input variables has some influence on the performance of the inductor, the goodness of the results obtained in the classification/prediction problem depends mainly on the relevance of the selected variables.

From a formal point of view, IS is an artificial organism based on the GenD algorithm and consists of a population of ANNs, in which each ANN carries out a selection of independent variables from the available database.

In a specific domino problem, the solution space is determined by 2^M./possible combinations of H variables which describe the data.

(*) the acceptable solution space is (2^M1).

Given the following two-symbol alphabet, ˆ_IS D f rel; irrelg in which

relrepresents membership of a variable in the setVrelof relevant variables and

irrel represents membership of a variable in the setVirrel of irrelevant variables.

Therefore, the vector

xD.Vrel; Virrel/DŒx1; x2; : : : ; xM2 ˆ^M_IS xi 2ˆIS

represents a single possible solutionxD.Vrel; Virrel/, given a pair of sets of relevant and irrelevant variables.

The elaboration of IS, as for T&T, is developed in two phases:

1. Preliminary phase: during this phase, an inductor

D^Œtr;A;F;Z./is configured to evaluate the parameters of the fitness function. This inductor is a standard back-propagation ANN. The parameters configuration and the initialization of the ANNs are carried out with particular care to avoid possible overfitting problems that can surface when the database is characterized by a high number of variables that describe a low quantity of data. The number of epochsE0necessary to train the inductor is determined through preliminary experimental tests.

2. Computational phase: the inductor is active, according to the stabilized con-figuration and the fixed initialization parameters, to extract the most relevant variables of the training and testing subsets. Each individual network of the population is trained on the training set D⁰^Œtr and tested on the testing set D⁰^Œts. The evolution of the individual network of the population is based on the algorithm GenD and leads to the selection of the best combination of input variables, that is, the combination that produces the best performance (maximum accuracy) of the inductor_D₀

Œtr;A;F;Z./ in the testing phase with the least number of input variables:

Dans le document Intelligent Data Mining in Law Enforcement Analytics (Page 153-158)