• Aucun résultat trouvé

Phase 1: Temporal processing

4 E MERGENT LIQUID STATE AFFECT (ELSA): A NEW MODEL FOR THE DYNAMIC SIMULATION

4.4 Emergent Liquid State Affect (ELSA)

4.4.3 Phase 1: Temporal processing

During the temporal processing phase, ELSA transforms the input time series to each of the three response networks to extract useful temporal properties. Figure 4.6 lists four steps in this procedure, (a) wavelet transform, (b) interaction transform, (c) liquid state machine, and (d) eigenliquid factorization. Note that these four steps do not include preprocessing of data that would be expected during or after the raw data sampling was conducted. It is highly recommended to preprocess and scale the raw time series before submitting them to an analysis in ELSA. Typically this involves centering and scaling each time series such that it has mean 0 and standard deviation 1. For physiological and motor data this means applying standardized filtering methods such as bandpass filters and detrending.

The set of appraisal time series should be excluded from preprocessing, however, and must remain a binary (or ternary) design matrix representing time-varying manipulations.

Wavelet transform. The first temporal processing step of ELSA consists of a wavelet transform called the maximal overlap discrete wavelet transform (Fowler, 2005; Percival & Walden, 2000). An important problem in signal processing theory is to detect change in time series (e.g., due to a timed manipulation). A wavelet transform can facilitate this by decomposing a time series at different temporal resolutions. This can separate, for example, noise components (with a high frequency) and effect components (with a lower frequency). The maximal overlap discrete wavelet transform (MODWT; Fowler, 2005) is a tool that can be applied for this purpose. For a selected range of resolution levels i to j, this wavelet transform will apply a wavelet filter to produce a set of wavelet functions and a set of scaling functions. This is an iterative algorithm such that, at each resolution level, a new wavelet function is computed from the scaling function of the preceding level. The scaling functions thus represent a kind of residuals—the part of the raw signal that has not been

114

accounted for in the wavelet decomposition of preceding levels. In practice, this means we only ever need to retain the scaling function of the highest level, j, since all other information is encoded by the wavelet functions of the preceding levels.

Figure 4.8. Example of a maximal overlap discrete wavelet transformed at 6 levels. Right panel: raw input signal. Left panel:

wavelet decomposed signals.

Figure 4.8 shows an example of a MODWT over 6 levels. The wavelet transform (right side) illustrates how different resolution levels encode different frequencies of the raw signal (left side) and where they occur. The first wavelet level isolates most clearly the noise frequency of the raw signal, while the fifth wavelet level isolates most clearly the peak frequency. Beyond level five the resolution is already too wide to convey useful information. Finally, as discussed, only one scaling function (right side, bottom series) is retained that codes for the undecomposed residual information. In ELSA, the appraisal time series are excluded from wavelet transformation, as these represent categorical experimental manipulations. Time series belonging to the other components can be wavelet transformed. This step is optional but usually improves the predictive fit of the model substantially.

Applying a wavelet transform requires a choice of which range of wavelet levels to retain, which can be regarded as a tuning parameter of an ELSA model.

Interaction transform. Following the wavelet transform, the new input series can be expanded with interactions up until a specified order k. ELSA allows input time series to interact with each other, both within components and across components, subject to one restriction, which is that wavelet and scaling functions belonging to the same variable cannot interact (e.g., in Figure 4.8, the wavelet functions of this signal are not allowed to interact amongst each other). Such interactions would not make sense conceptually because the wavelet functions are supposed to be statistically independent.

Moreover, omitting these interactions limits the total number of interactions in the model.

115

Adding interactions to an emotion model is theoretically justified for two reasons. First, because factorial manipulation designs normally imply factorial analyses (e.g., 2 × 2 ANOVA), which requires the adding of interaction terms to the model matrix. Second, because many appraisal theories assume that appraisal criteria interact to differentiate emotional responding (e.g., Lazarus, 2001; Ortony et al., 1988; Roseman, 2001; Scherer, 2001). Some of these interaction hypotheses have been supported in recent studies that used a machine learning approach (Meuleman & Scherer, 2013; Meuleman et al., submitted) but in general both the use of factorial manipulation designs and the testing of interactions remain an exception in appraisal research (Meuleman et al., submitted).

Liquid state machine. In the first two temporal processing steps, ELSA applies a wavelet transform and an interaction transform to the input time series. The third and most important temporal processing step consists of the liquid state machine (LSM), which is the principal processing unit of ELSA and the module that allows the learning of variable delay, nonlinear effects, and transient attractor states. The LSM, also known as an echo state machine, was developed independently by Jaeger (2001) and Maass, Natschlager & Markram (2002), and belongs to a domain known as reservoir computing. A liquid state machine decomposes an inputted time series inside a massive, randomly connected recurrent neural network, and then uses the decomposition for further modelling rather than the original inputs. The large recurrent network is called a reservoir or liquid. In this liquid, the original time series are transformed into nonlinear memory states. Because a recurrent network has feedback connections, it constitutes a dynamic system. Thus, this module implements the dynamic systems view of the CPM.

Figure 4.9. The liquid state metaphor. Objects dropped into a basin of water leave a transient imprint (ripples). The patterning of these ripples can be studied to infer aspects of the original objects, as well as their history.

The liquid state machine owes its name to the water metaphor that illustrates best how it operates. Figure 4.9 depicts a basin of water into which different types of objects are dropped.

Dropping these objects causes disturbances—or ripples—on the surface that fade out with time. Each

116

ripple can be considered a transient imprint of the object that was dropped, and encodes not only information of the original object but also its history. If we aimed a camera at the basin’s surface we could record all ripples, frame by frame. Studying these frames would provide information on the time since an object was dropped (via the extent of the ripple), the shape of the object (via the shape of the ripple), and so on.

A basin of water is simply a medium. It has no intelligent knowledge of stones or anvils but it can still encode useful information about dropped objects by recording their fading imprint. Now imagine a large collection of processing units that are interconnected with feedback loops (i.e., a recurrent network). Activating one of the units will cause it to pass its activation to all connected units and, since the network has feedback loops, this activation can be sustained for a certain number of time steps until it fades to 0. An example has been depicted in Figure 4.10, showing a recurrent network of nine units with random connections all set to 0.5. The initial state—state 0—consist of a maximal activation of the top three units and no activation for the remaining six units. At each step t, the activation of a unit is equal to the weighted sum of the units from which it receives input. In the absence of external input to the network, one can see how the initial activation of state 0 spreads and fades throughout this network, until the network converges to a complete null state. Like a ripple in water, the recurrent network maintained a fading imprint of the initial state. For this reason such networks are called “liquids” in reservoir computing literature. The fading imprint of past activation is referred to as fading memory.

Figure 4.10. Fading memory in a small recurrent neural network. Following initial activation at time 0, activity cycles through the network in a series of updates. Due to the nature of the connectivity, activity in this network slowly fades to 0. In state 5, there is virtually no activity left.

117

In a liquid state machine (LSM), the liquid is in a certain state at every time step t. This state corresponds to a combination of (a) current input activation, (b) past input activation up until a certain delay, and (c) internal liquid activation due to feedback loops. As a result, a single state of the liquid at time t contains almost all relevant past and present input information, without the need to store data in an explicit register. The LSM exploits this property by training output units to recognize the liquid states, rather than the raw input values which only contain information on their own current state. In other words, instead of predicting, for instance, time series y from time series x, we predict y from x’s liquid state transformation.

The liquid of an LSM possesses two additional properties that are essential to its computational power. The first property is that each liquid unit applies a nonlinear transformation to its raw input sum, that is, a hyperbolic tangent transformation. This means that activation in the liquid units is not just a memory of the original input time series but a nonlinear memory of that time series. This greatly increases the flexibility of the LSM to learn difficult time series associations. The second property concerns attractor learning. The LSM has only one permanent attractor state, which is the null state.

When no external input arrives in the liquid, the activity of its units will gradually revert back to zero (Figure 4.10). All other attractor states in the system are transient (Jaeger, 2001). Simulation data show that the LSM is particularly adept at learning transient attractors, including multiple attractors simultaneously and chaotic attractors (Jaeger, 2001). These properties make the LSM highly suitable for operationalizing an emotion theory like the CPM. As discussed in Section 4.2.3, the CPM assumes that emotion episodes are transient attractors of a dynamic system of component subsystems, counteracted by a permanent and non-emotional baseline attractor (Figure 4.4).

Constructing a liquid network such that it has the properties fading memory and a permanent attractor to the null state is a surprisingly simple task. A detailed description of the procedure can be found in the supplementary material (see also Lukoševičius, 2012, for a discussion). Here we note that a liquid network has three important tuning parameters that may affect the eventual prediction task that it is used for. These parameters are (a) a liquid size of L units, (b) a scaling parameter b for the input-to-liquid connections, and (c) a leaking rate parameter α that affects the length of fading memory in the LSM. The liquid size L is often set to a high number (e.g., 500 or 1000; Lukoševičius, 2012), whereas suitable values for b and α tend to be more data-dependent and may require extensive tuning.

The α parameter performs a type of exponential smoothing on the internal liquid activation that allows long-lasting influence of past states. This is especially useful for phenomena with a long delay. High values result in shorter fading memory (i.e., information leaks faster), whereas lower values result in longer fading memory. The internal weights of the liquid—which will be of size L × L—do not need to be estimated from data. They are randomly set to fixed values, subject to certain constraints (see supplementary material). Although this may seem odd, recall that the liquid merely has to act as a medium for briefly storing the input data. Like the basin of water in Figure 4.10 there is no need to tune it for the eventual prediction task itself.

118

Eigenliquid factorization. The third and final temporal processing step consists of an eigenliquid factorization. The basic LSM computes liquid states from input time series, and uses the new liquid time series to predict a response time series. However, the raw liquid time series can number into the thousands. Much of this information can likely be cut or compressed, while still retaining the parts that are relevant for prediction. Principal component analysis (PCA) is a classic method for achieving this aim. ELSA performs PCA on the liquid and retains a specified number of factors M < L. These factors are then rotated using promax rotation with power m = 3. Rotation has the effect of “clarifying” the structure of the loadings by shrinking irrelevant loadings for each factor and magnifying the relevant ones (Jolliffe, 2002). This procedure will further diminish the influence of liquid units that contribute only noise to ELSA, as their loadings will be shrunken by the rotation.

Although classic PCA remains a straightforward and powerful technique for dimension reduction, different datasets may present specialized challenges, which has lead to a number of extended and related techniques to be developed. ELSA supports additional options for eigenliquid factorization, which are (a) spherical PCA for robust estimation of principal components, (b) kernel PCA for nonlinear principal component analysis, and (c) Independent component analysis (ICA) for signal separation. Each of these methods has different strengths and weaknesses but the final choice of factorization can be regarded as another tuning parameter an ELSA model. Spherical PCA (Locantore et al., 1999) can be applied when data are suspected to contain multivariate outliers or influential cases. Kernel PCA (Schölkopf, Smola & Müller, 1998) estimates nonlinear PCs on a kernel transformation of the inner product matrix of the data. Independent component analysis (ICA) originated from signal processing theory and is able to perform blind source separation, that is, the unmixing of independent sources in a mixed signal. The method is well-suited to time series data. An in-depth discussion of the eigenliquid factorization step and a comparison of methods is included in the supplementary material.