Advanced process data analytics

(1)

Advanced Process Data Analytics

by

Weike Sun

B.S., Tsinghua University (2015)

Submitted to the Department of Chemical Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

May 2020

© Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . .

Department of Chemical Engineering

May 1, 2020

Certified by. . . .

Richard D. Braatz

Edwin R. Gilliland Professor in Chemical Engineering

Thesis Supervisor

Accepted by . . . .

Patrick S. Doyle

Chairman, Department Committee on Graduate Theses

(2)

(3)

Advanced Process Data Analytics

by

Weike Sun

Submitted to the Department of Chemical Engineering on May 1, 2020, in partial fulfillment of the

requirements for the degree of Doctor of Philosophy

Abstract

Process data analytics is the application of statistics and related mathematical tools to data in order to understand, develop, and improve manufacturing processes. There have been growing opportunities in process data analytics because of advances in ma-chine learning and technologies for data collection and storage. However, challenges are encountered because of the complexities of manufacturing processes, which of-ten require advanced analytical methods. In this thesis, two areas of application are considered. One is the construction of predictive models that are useful for process design, optimization, and control. The other area of application is process monitoring to improve process efficiency and safety.

In the first area of study, a robust and automated approach for method selection and model construction is developed for predictive modeling. Two common challenges when building data-driven process models are addressed: the high diversity in data quality and how to select from a wide variety of methods. The proposed approach combines best practices with data interrogation to facilitate consistent application and continuous improvement of tools and decision making.

The second area of study focuses on process monitoring for complex manufacturing systems, which includes fault detection, identification, and classification. Four sets of algorithms are developed to address limitations of traditional monitoring methods. The first set provides the optimal strategy for Gaussian linear processes, including deep understanding of the process monitoring structure and optimal fault detection based on a probabilistic formulation. The second set aims at building a self-learning fault detection system for changing normal operating conditions. The third set is developed based on information-theoretic learning to address limitations of second-order statistical learning for both fault detection and classification. The fourth set tackles the problem of nonlinear and dynamic process monitoring.

The proposed methodologies and algorithms are tested on several case studies where the value of advanced process data analytics is demonstrated.

Thesis Supervisor: Richard D. Braatz

(4)

(5)

Acknowledgments

It has been an extraordinary privilege to have been given the opportunity to pur-sue my doctoral degree at MIT and work on research that I find both interesting and consequential. I have had the opportunity to work with incredible mentors and collaborators without whom my Ph.D. would have been a very different experience.

I would like to first thank my advisor, Richard D. Braatz, for giving me the re-search opportunity and being an exceptional mentor throughout the years. His gen-erosity with this time, energy, and intellect in advising me was boundless. He gave me critical feedback on my research and asked questions that I would have never thought to consider. He also provided me with candid advice with my career development and the list of things that I learned is too long to articulate here. Besides, I would like to express my gratitude to my thesis committee members George Stephanopoulos and James Swan for valuable conversations and contributions to my research. I would also like to thank ExxonMobil and U.S. Food and Drug Administration for funding. I am appreciative of my fellow students and all of my lab mates for the help and company. In particular, I would like to thank Zhe Yuan, Zhenshu Wang, and Ruihao Zhu for helping me with my coursework as well as being great friends outside of the class. Kristen Severson and Wit Chaiwatanodom for providing help with my research. Matthias VA for being an excellent collaborator during my teaching experience. Ana Nikolakopoulou and Amos Lu for the help with the cluster. And all of my teammates at practice school for the company, conversation, and hard work.

I was fortunate to have internship experiences in several companies across different industries. I am indebted to the support, guidance, and encouragement given by my colleagues and mentors during my time at Takeda, General Mills, and ExxonMobil. I am particularly appreciative of Peng Xu and Antonio Paiva at ExxonMobil for helping me sharpen both my technical and communication skills. Also, Arash Fathi for the encouragement and making my day-to-day work enjoyable.

Finally, I would like to thank my family and friends for their tremendous amount of support over the past five years.

(6)

(7)

List of Figures

2-1 Illustration of a predictive modeling problem. . . 45 2-2 Plot of the contours of the non-regularized error function (blue) along

with the constraint region (red) for (a) LASSO 𝑞 = 1, (b) ridge regres-sion 𝑞 = 2, (c) elastic net, for a prediction model with two predictors. The figure is based on [1]. . . 49 2-3 Illustration of the PLS matrix decomposition. The loading matrices are

calculated to maximize the covariance between the score matrices. . . 50 2-4 Illustration of the SVR lost function. Only points outside the shaded

region contribute to the cost function. The left figure shows a linear SVR model, and the right figure shows a nonlinear SVR model with kernel function. . . 51 2-5 Illustration of a single regression tree. Each internal node tests a

par-ticular predictor 𝑥𝑖 and each leaf assigns an output value. . . 52

3-1 Schematic of the ALVEN algorithm. The three steps in ALVEN algo-rithm are (1) nonlinear basis expansion, (2) feature pre-screening via univariate test, and (3) sparse regression via the EN. . . 60 3-2 3D printer data visualization via single-variable histograms and scatter

plots: LH is the layer height, WT is the wall thickness, ID is the infill density, NT is the nozzle temperature, BT is the bed temperature, PS is the print speed, M is the material, FS is the fan speed, and TS is the tension strength. . . 65

(18)

3-3 Testing error distributions via nested cross-validation for the 3D printer dataset by three different nonlinear modeling methods. . . 66 3-4 Model prediction results for ALVEN for a 3D printer. . . 67 3-5 Residual analysis for the ALVEN model of the 3D printer. . . 68 3-6 Model coefficients for the final retained terms in the ALVEN model for

a 3D printer. . . 69 3-7 CSTR data visualization as single-variable histograms and scatter plots:

Q is the coolant flow rate and CA is the concentration of A. . . 69 3-8 Training MSE for the CSTR data over 10 prediction steps. . . 70 3-9 Testing MSE for the CSTR data over 10 prediction steps. . . 70 3-10 CSTR 1-step ahead prediction result for the testing data by

DALVEN-full (AIC). . . 71 3-11 Model coefficients for the final retained terms in the DALVEN-full

(AIC) model of the CSTR. . . 71 4-1 Systematic data interrogation procedure for predictive modeling. . . . 75 4-2 The data analytics triangle for predictive modeling with a single

re-sponse variable. The modeling techniques are mapped to three ma-jor model regression characteristics. ALVEN, algebraic learning via elastic net; CVA, canonical variate analysis; DALVEN, dynamic AL-VEN; MOESP, multivariable output error state space; PLS, partial least squares; RF, random forest; RNN, recurrent neural network; RR, ridge regression; SVR, support vector regression. References to the methods are given in the text. . . 76 4-3 Testing performance of four types of models. The curve shows the

median of testing MSEs over 3000 repetitions for each noise level: The unbiased model is quartic. The biased model is quadratic. The “cv” refers to selecting the order from 1 to 10 in a polynomial based on 3-fold cross-validation. The “cv limited order” refers to selecting the order from 2 and 4 in a polynomial based on 3-fold cross-validation. . . 77

(19)

4-4 Testing prediction error variation for the four types of models. Each curve is the median of the standard deviation of testing MSEs over 3000 repetitions for each noise level. Plot (b) is a zoomed version of plot (a) to resolve the performance difference between the unbiased model, biased model, and cross-validation with limited order. . . 80 4-5 Median MSE for the four methods for two cases in Type I: Case 2 and

Case 5 with high noise level. . . 92 4-6 Median MSE for the four methods with different noise levels of Case 1

in Type I. . . 93 4-7 Median MSE for the four methods for Case 8 in Type II. . . 93 4-8 Median MSE for the four methods for Case 1 and its sparse version

Case 5 in Type I. . . 93 4-9 RNN with feedback and one state layer for predictive modeling. . . 102 4-10 Multi-step (3-step in this example) prediction by RNN with feedback

correction and one state layer. . . 102 4-11 Illustration of 𝑘-fold cross-validation (𝑘 = 5). . . 104 4-12 Illustration of Monte Carlo-based cross-validation with 5 repetitions. . 105 4-13 Illustration of leave-one-group-out cross-validation. . . 106 4-14 Illustration of nested cross-validation, using 𝑘-fold cross-validation as

an example. The conventional cross-validation will just select one held-out dataset as a test dataset, whereas nested cross-validation repeats the procedure using the outer loop. . . 107 4-15 Illustration of the one-standard-error rule. The conventional

cross-validation will select 5 as the number of latent variables while the one-standard-error rule will pick 3 as the final latent variables. . . 108 4-16 Illustration of time-series cross-validation with 3 folds. . . 108 4-17 Fiber data visualization: single-variable histograms and scatter plots. . 110 4-18 Nonlinearity test results for the fiber data. . . 111 4-19 Comparison of raw fiber data with predictions from the OLS model. . 112 4-20 Residual analysis for the OLS model of the fiber data. . . 112

(20)

4-21 Example ATR-FTIR spectra for KDP solution: (left) raw spectra, (right) identified blue regions with spectral artifacts that should be

removed in pre-processing. . . 113

4-22 Absorbance at wavenumber 1074.17 cm−1 _{of KDP solutions at} differ-ent temperatures (label corresponds to KDP concdiffer-entration). This fre-quency corresponds to the peak absorbance, which makes it easier to evaluate trends due to having the highest signal-to-noise ratio. . . 114

4-23 Sparse regression results for spectral data for (left) SPLS and (right) EN for the original dataset: the blue line is the average spectra over 116 samples while the red line shows the final model parameters for the absorbances. . . 116

4-24 PLS model prediction results by (left) PLS with standard cross-validation strategy and (right) SPA. . . 117

4-25 Scatterplot of CCPP data using 200 training data points. The breadth of scatter in the predictor variables indicate that the data reasonably cover the input space. . . 118

4-26 Nonlinear test results for CCPP data. One of the values is much greater than 1. . . 119

4-27 Multicollinearity test results for CCPP data. . . 120

4-28 CCPP model predictions vs. testing data for the SVR model using 200 training data points. . . 121

4-29 CCPP residual analysis for testing data. . . 122

4-30 Water evaporator data for the three input variables (FF, VF, and CF) and one output variable (DM). . . 123

4-31 Water evaporator data scatter plot. . . 124

4-32 Nonlinearity test for water evaporator data. . . 125

4-33 Multicollinearity test for water evaporator data. . . 125

4-34 Comparison of model predictions and raw testing data for a static linear model of a water evaporator. . . 126

(21)

4-35 Residual analysis for the testing data for a static linear model of a water evaporator. . . 127 4-36 Autocorrelation analysis for the residuals for the testing data for a static

linear model of a water evaporator. . . 127 4-37 Dynamic nonlinearity tests for the water evaporator data. . . 128 4-38 Testing prediction MSE for water evaporator data for seven data

ana-lytics methods for 0 to 10 prediction steps. . . 128 4-39 Comparison of predicted values with testing data for an SSARX model

for a water evaporator. . . 129 4-40 Water evaporator SSARX model residual analysis for testing data. . . 130 4-41 Water evaporator SSARX model residual dynamic analysis for testing

data. . . 130 4-42 Type I case testing MSE distributions for RR, PLS, EN, SPLS, and

OLS with all predictors/only useful predictors. . . 133 4-43 Type I case: testing MSE medians with 95% confidence interval for RR,

PLS, EN, and SPLS. . . 134 4-44 Type II case: testing MSE distributions for RR/PLS/EN/SPLS and

RR/PLS with only useful predictors. . . 136 4-45 Type II case: testing MSE medians with 95% confidence interval for

RR, PLS, EN, and SPLS. . . 137 4-46 Case 1 data for linear dynamic model comparison. . . 138 4-47 𝑘-step ahead prediction MSE results for Case 1 testing set by

differ-ent methods with 400 training samples. The label ‘auto’ indicates the option for automatic algorithm selection in N4SID in Matlab, which selects the weights in the subspace algorithm from MOESP, CVA, and SSARX. . . 138 4-48 𝑘-step ahead prediction MSE results for Case 1 testing set by different

methods with 300 training samples. . . 139 4-49 Case 2 data for linear dynamic model comparison. . . 140

(22)

4-50 𝑘-step ahead prediction MSE results for Case 2 testing set by different methods with 400 training samples. . . 140

4-51 𝑘-step ahead prediction MSE results for Case 2 testing set by different methods with 300 training samples. . . 141

4-52 Case 3 data for linear dynamic model comparison. . . 141

4-53 𝑘-step ahead prediction MSE results for Case 3 testing set by different methods. . . 142

5-1 Process monitoring loop [2] . . . 147

5-2 Illustration of PCA decomposition. The original data matrix 𝑋 is decomposed into the principal subspace and the residual space. . . 150

(23)

5-3 Illustration of fault detection indices for the specific case of three vari-ables with one principal component. The projection space is a one-dimensional space depicted as the dashed line in the figure. The 𝑇2

statistic measures the variation of the data projection onto that line, and defines a specific region by two planes in the original space. Any data points that fall between those two planes are considered as be-ing normal by the 𝑇2 _{statistic. The control limit for the distance of the}

data points to the projection space is specified by the 𝑄 statistic. In this example, √𝑄 is the shortest distance between any point to the dashed line, which in this case denotes a cylinder of the in-control region. More generally, the shape of the NOC defined by the combination of the 𝑇2

and 𝑄 statistics is the intersection of the interior of a hyperellipsoid in one subspace and the interior of a hyperellipsoid in the complementary subspace. . . 153 5-4 A process flowsheet for the TEP with the second control structure in [3].158 6-1 Different strategies for fault detection and classification without

un-known classes. . . 163 6-2 Different layouts for classes in Example 1. . . 164 6-3 Region distributions for layout 1 in Example 1. Boundaries between

different regions are shown in the lower part of the corresponding figures.165 6-4 Region distributions for layout 2 in Example 1 with different significance

levels. Boundaries between different regions are shown in the lower part of the corresponding figures. . . 168 6-5 Different layouts for classes in Example 2. For layout 1: 𝜇0 = [0, 0]⊤,

𝜇1 = [3, 3]⊤,𝜇2 = [12, 6]⊤; for layout 2: 𝜇0 = [0, 0]⊤, 𝜇1 = [−3, −3]⊤,

𝜇2 = [12, 6]⊤. . . 169

6-6 Illustration of fault detection and classification boundaries in Example 2 layout 1. . . 170

(24)

6-7 Region distributions for layout 1 in Example 2 with different significance levels. . . 171 6-8 Region distributions for layout 2 in Example 2 with different significance

levels. . . 172 6-9 Different layouts for classes in Example 3. Three classes have different

means: 𝜇0 = [0, 0]⊤, 𝜇1 = [−2, 1]⊤,𝜇2 = [6, 6]⊤. . . 173

6-10 Region distributions for Example 3 with different significance levels. . 175 6-11 Nine different strategies for situations with data from unknown faults. 178 6-12 Illustration of possible active boundaries for Example 1 with nine

dif-ferent strategies. . . 179 6-13 Region distributions by nine different strategies for Example 1 layout

1. Case i-j represents strategy i with method j for outlier detection. . . 179 6-14 Region distributions by nine different strategies for Example 1, layout

2. Case 𝑖-𝑗 represents strategy 𝑖 with method 𝑗 for outlier detection. . 180 6-15 Region distributions by nine different strategies for Example 2. Case

𝑖-𝑗 represents strategy 𝑖 with method 𝑗 for outlier detection. . . 181 6-16 Region distributions by nine different strategies for Example 3. Case

i-j represents strategy i with method j for outlier detection. . . 182 7-1 Monitoring boundaries by (a) PCA and (b) PPCA for a two-dimensional

space. . . 199 8-1 Flowchart of the PQDM fault detection. . . 218 8-2 PQDM-based monitoring results for a step change in the reactor cooling

water inlet temperature. . . 220 8-3 CVA-based monitoring results for a step change in the reactor cooling

water inlet temperature. . . 220 8-4 PQDM-based monitoring results for a random variation in the reactor

cooling water inlet temperature. . . 221 8-5 CVA-based monitoring results for a random variation in the reactor

(25)

8-6 Adaptive fault detection flowchart. . . 223 8-7 Fault detection by PCA-based methods for NOC testing data. . . 225 8-8 Fault detection by PLS-based methods for NOC testing data. . . 226 9-1 Fault detection problem. . . 230 9-2 Shannon’s information theory problem [4,5]. . . 230 9-3 Information-theoretic formulation of fault detection. . . 234 9-4 Squared residuals for fault detection methods with optimal

transforma-tion for a single simulatransforma-tion: left with PCA and right with CVA. . . 238 9-5 Squared residuals for fault detection methods with random

transforma-tion for a single simulatransforma-tion: left with PCA and right with CVA. . . 239 9-6 The FDR and MI for PCA and CVA averaged over 1000 simulations. . 240 10-1 The monitoring statistics for Fault 2. . . 258 10-2 The monitoring statistics for Fault 10. . . 260 10-3 FDRs for 20 predefined faults by MEA and PCA. . . 261 11-1 FDRs at different orders for Fault 9. . . 278 11-2 Fault detection statistics for Fault 9 at 𝑘 = 18 (left is MIA and right is

CVA). . . 279 11-3 FDRs at different orders for Fault 3. . . 280 11-4 Fault detection statistics for Fault 3 at 𝑘 = 17 (left is MIA and right is

CVA). . . 281 11-5 FDRs for 20 predefined faults by MIA and CVA (𝑘 = 18). . . 282 12-1 The linear transformation of Fault 3, 4 and 11 onto the first two loading

vectors by (a) PCA, (b) DPLS, (c) FDA, and (d) MJDA. . . 300 12-2 The classification results for Faults 3, 4, and 11 obtained by (a) PCA,

(b) DPLS, (c) FDA, and (d) MJDA. The left axis labels the fault that occurs starting at time 𝑡 = 0, and the legend at the right gives the color of the fault indicated by the fault classification method. . . 302

(26)

12-3 Training data from Fault 1 (blue) and 2 (red). The first loading vectors by PCA, FDA, and MJDA are also shown on the figure. . . 305 13-1 Three general ways of building nonlinear model: (a) original nonlinear

relationship; (b) nonlinear feature transformation; (c) nonlinear approx-imation; (d) feature transformation and nonlinear approximation. . . . 310 13-2 A typical structure of a feedforward NN. . . 329 13-3 The ANN structure. . . 331 13-4 Principal curve and NN based NPCA structure. . . 338 13-5 ITN structure. . . 340 13-6 GP-based NPCA algorithm (note: the figure is based on [6]). . . 345 13-7 Two-dimensional manifold embedded in a three-dimensional space: (a)

linear subspace; (b) and (c) nonlinear submanifold (image adapted from [7]). . . 353 13-8 Example of the quadratic test unable to detect nonlinearity: (a)

sinu-soidal data, (b) 𝑝-values from quadratic tests, and (c) maximal corre-lations. . . 357 13-9 Example of the maximal correlation analysis unable to detect

nonlinear-ity: (a) nonlinear manifold by the bilinear term, (b) linear correlations, (c) 𝑝-values from quadratic tests, and (d) maximal correlations. . . 357 13-10 NOC data and faulty data of the simple linear example. . . 359 13-11 Linearity assessment for the simple linear example: (a) linear

correla-tion matrix of the NOC data; (b) scatter plot of the NOC data. . . 360 13-12 Nonlinearity assessment results for the simple linear example: (a)

𝑝-values from quadratic tests; (b) maximal correlations. . . 361 13-13 Parallel analysis results of (a) PCA; (b) KPCA-RBF; (c) KPCA-poly;

and (d) ANN for the simple linear example. . . 362 13-14 Monitoring charts of Fault 1 for the simple linear example. . . 363 13-15 Monitoring charts of Fault 2 for the simple linear example. . . 364 13-16 NOC data and faulty data of the simple nonlinear example. . . 367

(27)

13-17 Linearity assessment for the simple nonlinear example: (a) linear cor-relation matrix of the NOC data; (b) scatter plot of the NOC data. . . 368 13-18 Nonlinearity assessment results for the simple nonlinear example: (a)

𝑝-values from quadratic tests; (b) maximal correlations. . . 368 13-19 Parallel analysis results of (a) PCA; (b) KPCA-RBF; (c) KPCA-poly;

and (d) ANN for the simple nonlinear example. . . 369 13-20 Monitoring charts of Fault 1 for the simple nonlinear example. . . 370 13-21 Monitoring charts of Fault 2 for the simple nonlinear example. . . 371 13-22 Linear correlation matrix of the TEP. . . 374 13-23 𝑝-values of TEP NOC data by the quadratic test. . . 375 13-24 Maximal correlation matrix of TEP. . . 376 13-25 Scatter plot of TEP NOC data. . . 377 13-26 MDRs with different numbers of PCs by different methods for the TEP. 380 13-27 Diagram of the buffer make-up system. . . 385 13-28 Model of a single in-line mixer. . . 386 13-29 NOC data for the buffer creation process. . . 387 13-30 Linear correlation matrix for the buffer creation process. . . 389 13-31 𝑝-values from quadratic tests for the buffer creation process. . . 390 13-32 Maximal correlation matrix for the buffer creation process. . . 390 13-33 Scatter plot of the buffer creation process. . . 391 13-34 MDRs with different numbers of PCs by different methods for the buffer

(28)

14-1 A simple RNN structure with one recurrent layer and showing the un-folding in time of the sequence of its forward computation. The RNN includes the input variable 𝑥𝑡, state variable 𝑠𝑡 and outputs ˆ𝑦𝑡. The

state variable 𝑠𝑡 is calculated based on the previous state 𝑠𝑡−1 and the

current input 𝑥𝑡. The RNN output ˆ𝑦𝑡 is then calculated based on the

current state. In this way, the input sequence 𝑥𝑡 is mapped to output

sequence ˆ𝑦𝑡 with each ˆ𝑦𝑡 depending on all previous inputs. The model

parameters 𝜔 = {𝑊𝑠, 𝑈𝑠, 𝑊𝑠, 𝑏𝑠, 𝑏𝑦} are shared at each time step. . . 412

14-2 Illustration of the variational dropout technique (right) compared to standard dropout technique (left) for a simple RNN. Each graph shows units unfolded over time, with the lower level for inputs, middle level for state units, and upper level for output units. Vertical arrows represent the connections from inputs to outputs while horizontal arrows sent recurrent connections. The arrows with dashed grey lines repre-sent the standard connection without dropout. Colored lines reprerepre-sent dropout connections with different colors for different dropout masks. (Left) In the standard dropout technique, no dropout is applied for the recurrent layers, while other connections have different dropout masks at different time steps. (Right) For the variational dropout approach proposed in [8], dropout is applied to both input, recurrent, and output layers with the same dropout mask at different time steps. Variational dropout is applied during both training and testing. . . 416

(29)

14-3 General procedure for process monitoring system development (left) versus procedure for developing BRNN-based FDI system (right). The general framework to establish a monitoring system begins with a model to characterize NOC behavior, such as using the BRNN model to learn the NOC pattern from the training data. Then, the method to measure the deviation of a particular observation to the NOC region is chosen. In our case, the process observations are compared to the BRNN posterior predictive distributions. Finally, the decision will involve determining whether the acquired observation is from the NOC or not (i.e., compare deviations of observations for fault detection and assess which variables significantly deviate from the NOC for identification). . . 418

14-4 Depiction of BRNN model using variational dropout (left) for FDI. The BRNN model uses the current observation and state to predict the next system observation. The BRNN model is unrolled in two dimensions (right): the time of the computation involved in its forward computation and the stochastic repetition by variational dropout. At each time step, stochastic variational dropout is applied 𝑁 times and the corresponding MC prediction samples {ˆ𝑥𝑡(𝑖)}𝑖=1,...,𝑁 are used to

approximate the posterior predictive distribution for that time step. For the next time step, the same procedure is repeated and MC samples { ˆ𝑥𝑡+1(𝑖)}_{𝑖=1,...,𝑁} are collected and used to approximate the distribution. 420

14-5 Flowchart of the BRNN-based FDI methodology. The offline training stage (left) and the online monitoring stage (right) are shown in the figure. The procedure starts with offline training, and then the offline-trained model is used during online monitoring. The choice of statistics for detection and identification is made at design time. . . 426

(30)

14-6 BRNN model outputs for TEP NOC training data. The plot shows all 52 variables in TEP. The dark blue lines are the TEP measurements and the light blue lines correspond to the BRNN model predictive distribu-tion outputs for the NOC data. For measurements under the NOC, the dark blue lines should lie within the predictive distribution. . . 430 14-7 BRNN model outputs for TEP NOC validation data. . . 431 14-8 BRNN model outputs for TEP Fault 3. . . 434 14-9 Fault identification plot of the BRNN-𝐷𝑙 _{statistic for Fault 3. The}

{︀𝐷𝑙}︀

𝑙=1,...,52 values for the 960 timesteps are color coded in the

identifi-cation plot. Variables with dark blues have high values of 𝐷𝑙_{, meaning}

that the variable has positively deviated from the NOC region. Con-versely, variables with dark red have low values of 𝐷𝑙 _{and have}

nega-tively deviated from the NOC region. A light color means the variable is not significantly affected by the disturbance. As expected, no variable significantly deviates from the NOC. . . 435 14-10 Contribution plots for Fault 3 from (a) r-PCA, (b) f-PCA, (c) r-DPCA,

and (d) f-DPCA. The plot shows the contribution factor and with the darkness of the blue color indicating the amount of deviation of the variable from the NOC region. . . 436 14-11 BRNN model outputs for TEP Fault 5. . . 437 14-12 Fault identification plot by BRNN-𝐷𝑙 for Fault 5. The switch between

dark blue and red colors shows that the system is undergoing large fluctuation. . . 438 14-13 Contribution plots for Fault 5 from (a) r-PCA, (b) f-PCA, (c) r-DPCA,

and (d) f-DPCA. . . 439 14-14 BRNN model outputs for TEP Fault 1. . . 441 14-15 Fault identification plot by BRNN-𝐷𝑙 _{for Fault 1. The root cause for}

this uncontrollable fault can be assessed by looking at the variables that are persistently off the NOC region. . . 442

(31)

14-16 Contribution plots for Fault 1 from (a) r-PCA, (b) f-PCA, (c) r-DPCA, and (d) f-DPCA. . . 443 14-17 Sorted fault identification plot according to the detected deviation

oc-currence time of BRNN-𝐷𝑙 for Fault 6. . . 444

14-18 Fault 6 propagation path at the 180th _{data point (1 hour after the fault}

occurs). Colored nodes indicate that the corresponding variable has been detected as deviating significantly from the NOC. . . 445 14-19 FDI by the BRNN model on the Fault 1 testing data. The red box

indicates the period with the foaming event recorded by the operator. 446 14-20 FDI by (D)PCA methods for Fault 1: (a) PCA, (b) f-PCA, (c)

r-DPCA, and (d) f-DPCA. The red box indicates the period with the foaming event recorded by the operator. . . 447 14-21 FDI by BRNN for Fault 2. The red box indicates the period with the

foaming issue as recorded by the operator. . . 448 14-22 FDI by (D)PCA methods for Fault 2: (a) PCA, (b) f-PCA, (c)

r-DPCA, and (d) f-DPCA. The red box indicates the period with the foaming issue as recorded by the operator. . . 449

(32)

(33)

List of Tables

3.1 Model fitting results for 3D printer data using nested cross-validation. 66 4.1 Median MSE of testing data by multicollinearity methods for Type I

cases. . . 91 4.2 Median MSE of testing data by multicollinearity methods for Type II

cases. . . 92 4.3 Model prediction results for fiber data. . . 111 4.4 Mean squared errors for different static regression methods for CCPP

data with different numbers of training samples. . . 120 6.1 Monitoring results for layout 1 in Example 1 for three strategies [%]. . 167 6.2 Monitoring results for layout 2 in Example 1 for three strategies [%]. . 167 6.3 Monitoring results for layout 1 in Example 2 by three strategies [%]. . 170 6.4 Monitoring results for layout 2 in Example 2 by three strategies [%]. . 173 6.5 Monitoring results for Example 3 from the three strategies [%]. . . 174 6.6 Probability of the assigned region belongs to the corresponding class

[%] for Example 1, layout 1. . . 179 6.7 Probability of the assigned region belongs to the corresponding class

[%] for Example 1, layout 2. . . 180 7.1 FDRs by PCA and PPCA for the TEP [%]. . . 204 7.2 FDRs by PCA with 50/51 PCs [%]. . . 205 8.1 Monitoring statistics and detection thresholds of PQDM. . . 218 8.2 Adaptive fault detection FAR results [%]. . . 226

(34)

8.3 Adaptive fault detection FDR results [%]. . . 227 9.1 Fault detection performances. . . 239 12.1 MCRs for Fault 3, 4 and 11 by different methods. . . 301 12.2 MCRs by PCA, FDA, and MJDA for general distributions. . . 305 13.1 MDRs [%] for the simple linear example with parallel analysis/intrinsic

dimensionality for order selection. . . 363 13.2 MDRs [%] of the simple linear example with different numbers of PCs. 365 13.3 MDRs [%] of simple nonlinear example with parallel analysis/intrinsic

dimensionality for order selection. . . 370 13.4 MDRs [%] for simple nonlinear example with different numbers of PCs 372 13.5 MDRs [%] of TEP by different methods. . . 378 13.6 MDRs [%] of ANN with different numbers of hidden neurons for the TEP.381 13.7 MDRs [%] of KPCA-RBF with different kernel width for the TEP. . . 383 13.8 MDRs [%] of LLE with different numbers of 𝑘NN for the TEP. . . 384 13.9 Process faults of the buffer creation process. . . 388 13.10 MDRs [%] for the buffer creation process by different methods. . . 392 13.11 MDRs [%] of ANN with different numbers of hidden neurons for the

buffer creation process. . . 396 13.12 MDRs [%] of KPCA-RBF with different kernel width for the buffer

creation process. . . 398 13.13 MDRs [%] of KPCA-poly with different orders for the buffer creation

process. . . 398 13.14 MDRs [%] of LLE with different numbers of 𝑘NN for the buffer creation

process. . . 399 14.1 TEP fault detection percentage results. The FAR is shown for the NOC

(35)

(36)

(37)

Chapter 1 Introduction

1.1 Thesis goals

The unifying aim of this thesis is to develop advanced data analytics tools for man-ufacturing processes. Data analytics is playing an increasingly important role in next-generation manufacturing processes for improving process productivity, reliabil-ity, and control [9–11]. Often in real manufacturing processes, the approach of using mathematical models based on first principles is unavailable or too time-consuming to apply due to the complexity of the systems. On the other hand, when sufficient data are available, data analytics methods can be applied at relatively low cost at many different scales, from individual sensors to unit operations to the entire manufacturing plants. Opportunities for process data analytics are further enabled by advances in sensor technology, wireless networks, and computational power, and the availability of low-cost data storage.

Despite major advances in machine learning and data science, the uptake of ad-vanced data analytics methods to real manufacturing processes has been relatively slow. One major reason is the complexity of real manufacturing processes. Real manufacturing systems often involve nonlinear multivariable interactions between the system variables. For fast-sampling processes, system dynamics are important for model construction. A real manufacturing process can also have time-varying op-erating conditions where the performance of any offline-trained model degrades over

(38)

time. The second reason is the diversity in data quantity and quality. While advances in sensor technology have created much more data for some manufacturing processes, it is very costly to acquire large amounts of data for other processes, and the data quality can be low because of sensor drift, bias, and noise and unmeasured distur-bances. The relatively limited use of advanced data analytics in real manufacturing processes and the wide variety of available methods calls for a systematic review and guidelines on how to select the best-in-class method for a given application, how to correctly construct the model, and how to interpret the results.

This thesis explores ways to address the above challenges with an emphasis on two application areas: predictive modeling and process monitoring. Besides the de-velopment of advanced algorithms, this thesis also addresses the importance of using a systematic framework for data analysis and points out common misconceptions in the literature and industry. The next section describes the objectives of this thesis in more detail.

1.2 Thesis organization

Chapters 2–4 considers process data analytics methods for application to predictive modeling.

• Chapter 2 describes basic concepts of regression and provides an introduction to popular regression techniques.

• Chapter 3 introduces a new technique for interpretable nonlinear model con-struction. The proposed approach combines automated feature generation for chemical and biological processes and a two-step sparsity-promoting regression procedure, which produces an interpretable nonlinear model. The proposed technique can be extended to dynamic nonlinear model construction. The in-terpretability sheds light on the physical properties of the process, and more importantly, it can be easily integrated with other application purposes, such as control/optimization design. The proposed methods are tested in two case

(39)

stud-ies, which show robust and accurate performance over state-of-the-art black-box nonlinear models.

• Chapter 4 presents a systematic approach for automated and robust method selection and model construction for manufacturing processes, which empowers the users to focus on goals rather than on methods and automatically trans-forms manufacturing data into intelligence. The focus of the framework is on predictive modeling, and the general framework can be applied to other ap-plications. For automated method selection, a bottom-up approach is taken to select the best-in-class method based on the characteristics of the data and available domain knowledge. For automated model construction, the appro-priate cross-validation strategy is selected based on data attributes and rigor-ously implemented to validate model performance. The proposed framework is demonstrated on several case studies with different data characteristics, includ-ing nonlinearity, multicollinearity, and dynamics.

Chapters 5—14 considers process data analytics for application to process monitoring, which separately address specific key challenges encountered in real manufacturing processes.

• Chapter 5 introduces basic concepts and methods for process monitoring, in-cluding fault detection, identification, and classification. Also, the benchmark dataset from the Tennessee Eastman process (TEP) is introduced, which is commonly used to test and compare process monitoring algorithms.

• Chapter 6 discusses the impact of fault detection and classification strategy on the final performance. For the case where data are all from known classes, three strategies for fault detection and classification are proposed, and com-parisons are drawn based on case studies. For the case of data from known and unknown classes, three approaches are proposed, and the combination with different strategies are investigated based on case studies. This work shows the importance of analyzing the process monitoring strategy.

(40)

• Chapter 7 proposes a probabilistic framework for optimal fault detection of Gaussian linear processes based on probabilistic principal component analysis (PPCA) with a novel monitoring index. The proposed probabilistic approach is fundamentally different from the widely applied principal component analysis (PCA) method. Moreover, the proposed PPCA framework enables the selection of the optimal number of principal components (PCs) based only on normal operating data. Detailed comparisons between PPCA and PCA are provided, which show the advantages of PPCA in terms of model accuracy, robustness, and generalization capability. The equivalency between the weighted combination of PCA-based monitoring indices and the PPCA-based monitoring index is also discovered. The proposed PPCA-based fault detection method is applied to the TEP and compared with PCA.

• Chapter 8 discusses the development of a self-learning fault detection system for time-varying normal operating conditions (NOCs). A self-learning system is proposed to inform when the monitoring model should be updated. The proposed system is developed based on concurrent canonical variate analysis (CCVA) and has three sets of monitoring indices for both quality and dynamic monitoring, which is capable of accurately distinguish the normal variation from a real fault. The online model update is only necessary when the system has a normal variation. The proposed system is tested on the TEP. Besides, a review and comparison study for various commonly used adaptive algorithms are provided in the appendix, illustrating the importance of online updates of fault detection models for the process with varying normal conditions.

• Chapters 9–13 present a unified information-theoretic learning framework for advanced fault detection and classification algorithm developments, which over-comes the limitation associated with traditional second-order statistical learning methods. The information-theoretic learning framework explores information beyond the second-order moment, which is important for non-Gaussian pro-cesses, as discussed in Chapter 9. Linear subspace learning methods are

(41)

de-veloped based on the information maximization principle. Renyi’s information definition combined with kernel density estimators has been applied in order to cope with the difficulties in mathematical formulations and computations by using Shannon’s definition. The developed algorithms can be viewed as generalizations of traditional statistical learning methods: (1) maximized en-tropy analysis as a generalization of PCA in Chapter 10, (2) mutual information analysis (MIA) for quality variable monitoring as a generalization of canonical variate analysis (CVA) in Chapter 11, and (3) maximized Jensen-Shannon di-vergence analysis (MJDA) for fault classification in Chapter 12. The proposed algorithms are tested on the TEP and compared with traditional statistical learning methods.

• Chapter 13 provides a systematic review and comparative study of nonlinear PCA (NPCA)-based fault detection methods. Different nonlinear extensions of PCA have been proposed in order to address the challenges with nonlinear sub-manifold learning. However, there has been no systematic review of the various NPCA methods, especially for fault detection purposes. A thorough discussion on assumptions, algorithms, fault detection schemes, strengths, and limitations of various NPCA algorithms is provided. Moreover, a systematic methodology for deciding when to use a nonlinear fault detection method is proposed based on data interrogation. The most widely studied NPCA methods for fault detection – including kernel PCA (KPCA), locally linear embedding (LLE), and neural network PCA (NN-NPCA) – are compared with PCA in several linear and nonlinear case studies that collectively illustrate significant limitations and strengths of various NPCA methods. In addition to pointing out some common misconceptions in the literature, the case studies are used to provide general guidelines on when to use NPCA, how to use NPCA effectively, and how to interpret the results for fault detection of nonlinear systems.

• Chapter 14 explores an application of Bayesian recurrent neural networks (BRNNs) for fault detection and identification (FDI). A newly developed deep learning

(42)

approach using BRNNs with variational dropout is adopted, which directly tack-les three key challenges in modeling real process data: nonlinearity, dynamics, and uncertainties. The proposed BRNN-based method provides robust fault detection of chemical processes, direct fault identification with higher accuracy and specificity, and visualized fault propagation analysis. The outstanding per-formance of this method is contrasted to (dynamic) PCA in the TEP and an industrial dataset from chemical manufacturing.

Chapter 15 concludes the thesis, discusses ongoing challenges, and provides additional ideas of future research needs for process data analytics.

(43)

Part I

(44)

(45)

Chapter 2 Introduction to Predictive Modeling

The objective of predictive modeling is to build mathematical models to estimate the relationships between the output (response) variable and the inputs (predictors) based on the training data. The purpose of the constructed model is to predict new or future observations, as depicted in Figure 2-1. The model can also improve understanding the process that generated the data, which is especially useful for manufacturing processes. For example, such models can inform on which are the key variables for affecting product quality, and to identify rate-limiting steps in the manufacturing process.

Figure 2-1: Illustration of a predictive modeling problem. The most basic predictive model has the form

𝑦 = 𝑓 (𝑥) + 𝜀 (2.1)

where 𝑦 represents a univariate response variable, 𝑥 ∈ R𝑚𝑥 is a vector of predictors,

(46)

Predictive modeling also includes temporal forecasting where observations of pre-dictors (and outputs, if needed) up to time 𝑡 are used to forecast the future values for the output at time 𝑡 + 𝑘, 𝑘 > 0.

Two steps are involved with predictive modeling: the model form 𝑓(·) needs to be selected, and then an appropriate learning algorithm needs to be picked to estimate the model. There are many types of regression models. The simplest linear static model is

𝑦 = 𝑤0+ 𝑤1𝑥1+ · · · + 𝑤𝑚𝑥𝑥𝑚𝑥 (2.2)

where 𝑤0, . . . , 𝑤𝑚𝑥 are the model parameters. The parameter 𝑤0 represents the

inter-cept of the regression model, which allows for any fixed offset. It is often convenient to define an additional dummy variable 𝑥0 = 1 so that

𝑦 = 𝑤⊤𝑥 + 𝜀 (2.3)

where 𝑤 = [𝑤0, . . . , 𝑤𝑚𝑥]

⊤

∈ R𝑚𝑥+1 is a vector of model parameters (aka weights).

The most basic way of reconstructing the model parameters in Equation 2.3 is via ordinary least squares (OLS), which minimizes the mean squared error (MSE) of the predictions, min 𝑤 1 𝑁 𝑁 ∑︁ 𝑖=1 (𝑦𝑖− 𝑤⊤𝑥𝑖)2 (2.4)

where 𝑁 is the total number of training data points, and 𝑖 is the index for a specific observation.

Given the training predictor matrix 𝑋 ∈ R𝑁 ×𝑚𝑥 and the training output vector

𝑦 ∈ R𝑁, OLS has a unique analytical solution ˆ

𝑤𝑂𝐿𝑆 = (𝑋⊤𝑋)−1𝑋⊤𝑦 (2.5)

provided that the matrix 𝑋⊤_𝑋 _{is invertible.}

The Gauss-Markov theorem states that OLS produces the best linear unbiased estimation under certain conditions. Those conditions include: (1) the underlying

(47)

relationship between the predictor variables 𝑥 and the output variable 𝑦 should be linear, (2) the design matrix 𝑋 has full rank (identification condition), and (3) errors are uncorrelated, homoscedastic, and have zero mean.

Alternatives are required when one or more of the assumptions are violated, which include: (1) nonlinear relationship between the output and predictors, (2) strong multicollinearity between predictors, (3) serial correlations in the model errors, which implies unexplained dynamics in the response, and (4) heteroscedasticity of the model errors. Moreover, in cases with a large number of predictors, OLS performs poorly in terms of model robustness and interpretation.

In the aforementioned situations, other predictive modeling techniques give more robust and accurate prediction results. For example, latent variable models and regularized regression techniques are suitable for data with multicollinearity. For nonlinear problems, support vector regression (SVR) and random forests (RFs) are powerful methods to approximate a nonlinear structure. For dynamical systems, system identification techniques, e.g., state-space models, can capture the dynamic information of the process. Introductions to these popular regression techniques are provided in next sections.

2.1 Linear Regression with Penalty

Regularized linear regression is formulated as a linear regression plus a penalty for model complexity, min 𝑤 ‖𝑦 − 𝑋𝑤‖ 2 2+ 𝜆‖𝑤‖ 𝑞 𝑞 (2.6)

where 𝜆 is a positive penalty coefficient that quantifies the relative tradeoff between the complexity of a model and its training error.

For different 𝑞, the penalty norm has different forms. If 𝑞 = 2, the formulation is ridge regression (RR) [12]. RR is useful for data with multicollinearity, and has a simple analytical solution,

ˆ

(48)

This solution is unique and the matrix inverse is well-defined for any 𝜆 > 0.

RR shrinks the norm of the model parameters, which gives a biased estimation in order to increase prediction accuracy through a bias-variance tradeoff. However, the shrinkage does not necessarily result in the removal of variables. Other values of 𝑞 could give more sparse models as compared to RR. If 𝑞 = 0, the penalty norm is called the 𝑙0 pseudonorm and is the number of nonzero regression coefficients.

This regression problem is also called best subset selection. The motivation for this formulation is to directly select the most useful features and provide an interpretable model.

The optimization that defines 𝑙0-regularization is NP-hard, inherently

combinato-rial, and computationally expensive to solve. The tightest convex relaxation of the 𝑙0

pseudonorm penalty is the 𝑙1 norm penalty. The sparse regression with the 𝑙1 norm

penalty is called least absolute shrinkage and selection operator (LASSO) [13], which also promotes sparsity. LASSO tends to select slightly more variables as compared to best subset selection, but is computationally efficient and has been successful in many applications. However, for datasets with a large number of variables 𝑚𝑥 > 𝑁

or highly correlated variables, LASSO has limited performance. The elastic net (EN) has been proposed to resolve the limitations associated with LASSO [14]. EN does feature selection and continuous shrinkage simultaneously, enabling variable selection with only limited data and can select groups of correlated variables. EN is formulated as

min

𝑤 ‖𝑦 − 𝑋𝑤‖ 2

2+ 𝜆(𝛼‖𝑤‖1+1−𝛼₂ ‖𝑤‖22) (2.8)

where 𝛼 is a scalar between 0 and 1 which specifies the tradeoff between the 𝑙1 and

𝑙2 penalties.

The combination of 𝑙1 and 𝑙2 penalties imposes both sparsity and grouping effects,

which provides stable and automated feature selection with good prediction accuracy. The LARS-EN algorithm is proposed [14] to solve the EN efficiently, which is based on the LARS algorithm for LASSO and has computational advantages over other optimization techniques for feature selection.

(49)

Figure 2-2: Plot of the contours of the non-regularized error function (blue) along with the constraint region (red) for (a) LASSO 𝑞 = 1, (b) ridge regression 𝑞 = 2, (c) elastic net, for a prediction model with two predictors. The figure is based on [1].

The optimizations that define ridge regression, LASSO, and elastic net can be reformulated as equivalent optimizations in which the penalty terms move from the objective function to become constraints. Those formulations can be used to interpret the three methods geometrically [1]. For example, it is clear from inspection of Figure 2-2 why LASSO generates sparse solutions and why ridge regression does not, and how elastic net compromises between the other methods.

2.2 Partial Least Squares

Partial least squares (PLS) is a dimensionality reduction technique that maximizes the covariance between the predictor matrix 𝑋 and the output matrix 𝑌 ∈ R𝑁 ×𝑚𝑦

for each component in the lower dimensional space. PLS can extract latent variables for a single output variable 𝑚𝑦 = 1 or multiple output variables. When only a single

output variable is considered, PLS is also referred to as PLS1. There are different algorithms to solve PLS, all of which apply an iterative process to extract the latent variable in the lower dimensional space.

In each iteration, PLS computes loading and score vectors by successively extract-ing factors from both predictor matrix and predicted matrix such that the covariance between the extracted factors is maximized,

max 𝑙𝑖,𝑞𝑖 𝑙⊤_𝑖 𝑋_𝑖⊤𝑌𝑖𝑞𝑖 s.t. 𝑙⊤ 𝑖 𝑙𝑖 = 𝑞𝑖⊤𝑞𝑖 = 1 (2.9)

(50)

where 𝑙𝑖 and 𝑞𝑖 are the projection vectors for the matrices 𝑋 and 𝑌 , respectively.

PLS decomposes the 𝑋 and 𝑌 matrices into the score matrix 𝑇 ∈ R𝑁 ×𝑘 _and

𝑈 ∈ R𝑁 ×𝑘 and the residual matrix 𝐸 ∈ R𝑁 ×𝑚𝑥 and 𝐹 ∈ R𝑁 ×𝑚𝑦, where 𝑘 is the

dimension of the lower dimensional space by projecting along the loading matrix 𝑃 ∈ R𝑚𝑥×𝑘 and 𝑄 ∈ R𝑚𝑦×𝑘 (illustrated in Figure 2-3)

𝑋 = 𝑇 𝑃⊤+ 𝐸

𝑌 = 𝑈 𝑄⊤+ 𝐹 (2.10)

The most widely applied algorithms are nonlinear iterative partial least squares (NI-PALS) and SIMPLS [15]. Reviews on the various PLS algorithms are available [16,17].

Figure 2-3: Illustration of the PLS matrix decomposition. The loading matrices are calculated to maximize the covariance between the score matrices.

After the latent variable 𝑡 is obtained, the regression model is established using OLS estimation. In the case with a single output variable, the OLS optimization is

min 𝑤 1 𝑁 𝑁 ∑︁ 𝑖=1 (𝑦𝑖− 𝑤⊤𝑡𝑖)2 (2.11)

By regressing on the latent variables, which are the linear combination of the original predictors, PLS is a powerful tool when there is multicollinearity in the predictors.

(51)

2.3 Support Vector Regression

Support Vector Regression ( [18]) is developed based on the same principles as the support vector machine for classification, with only minor differences. The objective function for a linear SVR with soft margin loss is formulated as

min 𝑤 1 2‖𝑤‖ 2 2+ 𝐶 𝑁 ∑︁ 𝑖=1 (𝜉𝑖+ 𝜉*𝑖) s.t. 𝑦𝑖− 𝑤⊤𝑥𝑖 ≤ 𝜖 + 𝜉𝑖 𝑤⊤𝑥𝑖− 𝑦𝑖 ≤ 𝜖 + 𝜉𝑖* 𝜉𝑖, 𝜉𝑖* ≥ 0 (2.12)

where the constant 𝐶 > 0 specifies the tradeoff between the training accuracy and the flatness of the final function form, 𝜖 stands for the amount up to which deviations are tolerated, and 𝜉, 𝜉* _{are the slack variables which allow for deviation from the 𝜖}

band (see Figure 2-4 for a graphical representation of SVR).

Figure 2-4: Illustration of the SVR lost function. Only points outside the shaded region contribute to the cost function. The left figure shows a linear SVR model, and the right figure shows a nonlinear SVR model with kernel function.

The objective function is often solved via optimization of the dual formulation of Equation 2.12. The dual formulation also allows nonlinear extensions of linear SVR by replacing the dot product ⟨·, ·⟩ with different nonlinear kernel functions 𝑘(·, ·). Details on the formulation of nonlinear SVR are available elsewhere [19].

(52)

2.4 Random Forest Regression

A regression tree is a decision tree involving two major steps. The first step divides the predictor variable space into distinct and non-overlapping regions. In the second step, for each observation that belongs to a single region, the prediction is the mean of outputs of the training data in that particular region (as illustrated in Figure 2-5).

Figure 2-5: Illustration of a single regression tree. Each internal node tests a particular predictor 𝑥𝑖 and each leaf assigns an output value.

The objective function of a regression tree is to minimize the sum of squared errors of the prediction by searching over possible splits,

𝐽 ∑︁ 𝑗=1 ∑︁ 𝑖∈𝑅𝑗 (𝑦𝑖− ˆ𝑦𝑅𝑗) 2 _(2.13) where ˆ𝑦𝑅𝑗 = 1 𝑛𝑐 ∑︀

𝑖∈𝑅𝑗𝑦𝑖 is the prediction for leaf 𝑗. Considering every possible

par-tition of feature spaces is computationally infeasible. The greedy approach known as recursive binary splitting is used to train the decision tree. To avoid overfitting, other hyperparameters can be added to the training procedure, for example, the max-imum depth of the tree. Those hyperparameters provide tradeoffs between training accuracy and generalization capability. For more details about the decision tree and training procedure, see [20].

The Random Forest (RF) [21] is an ensemble learning method that combines the predictions from multiple regression trees to make more accurate and robust predictions than a single tree. RF is a bagging technique, and the trees are run in parallel. During the training of each tree, the sub-samples of the training data are drawn with replacement, and then a regression tree is built on the sub-samples. In

(53)

order to reduce structural similarities among the trees, the number of features that can be searched at each split is specified as a parameter to the algorithm. RFs reduce the variance of the model as compared to a single decision tree.

2.5 Linear state-space model

A discrete-time linear state-space model is of the form 𝑠𝑡+1 = 𝐴𝑠𝑡+ 𝐵𝑥𝑡+ 𝑒𝑡

𝑦𝑡 = 𝐶𝑠𝑡+ 𝐷𝑥𝑡+ 𝑣𝑡

(2.14) where 𝑦𝑡 ∈ R𝑚𝑦, 𝑠𝑡 ∈ R𝑚𝑠, 𝑥𝑡 ∈ R𝑚𝑥, 𝑒𝑡 ∈ R𝑚𝑠, and 𝑣𝑡 ∈ R𝑚𝑦 represent the system

output, state, input, state noise, and output measurement noise measured at time instant 𝑡, respectively; and 𝐴, 𝐵, 𝐶, and 𝐷 are system matrices with appropriate dimensions.

The state-space model allows dynamic modeling of the system, which is necessary for many industrial processes with sampling times faster than the process dynamics. In particular, dynamic models are nearly always needed when the model is used for feedback control design, and are sometimes needed when the model is used for process monitoring.

The state vector 𝑠𝑡 includes all “memory” of the system at time 𝑡, and the future

states depend only on the current state 𝑠𝑡and on any inputs 𝑥𝑡at time 𝑡 and beyond.

The state-space matrices can be solved via subspace identification algorithms, which are discussed in Chapter 4.

2.6 Summary of the Contributions in Part I

Two specific challenges of predictive modeling are addressed in the remaining chapters of Part I.

First, the interpretability of the model is important in many applications, espe-cially for manufacturing processes. While several techniques are available to construct

(54)

a linear interpretable model, as discussed in Section 2.1, an interpretable nonlinear dynamic model for manufacturing processes with relatively few training samples is still missing. This observation motivated the proposed interpretable nonlinear and dynamic modeling technique in Chapter 3.

Secondly, there is a wide variety of methods and a substantial level of expertise is needed to select the best-in-class method for a given manufacturing application. An automated and systematic procedure for method selection and model construction is proposed in Chapter 4 so that the user can focus on the modeling objective rather than attempting to learn all of the available data analytics methods and best how to apply them.

(55)

Chapter 3 ALVEN: Algebraic Learning Via

Elastic Net for Static and Dynamic

Nonlinear Model Prediction

3.1 Introduction

Data-driven modeling can leverage advances in machine learning and modern instru-mentation and can be applied to different scales, from sensors to individual unit operations to entire manufacturing plants. Data-driven model have the potential for manufacturers to improve product quality in several ways. They can be used to make predictions, such as using system inputs to predict the final product quality variables. They can also be used for controller design, that is, to compute adjustments to the critical process parameters to move the critical quality attributes towards desirable values.

Despite major advances in machine learning and data science, there has been lim-ited diffusion of these advanced data analytics methods to real manufacturing pro-cesses. One major reason is that real manufacturing processes often involve nonlinear dynamic interactions between the manipulated variables and the output variables but do not have the quality and quantity of data needed by many of the data analytics

Advanced process data analytics

Advanced Process Data Analytics

by

Weike Sun

B.S., Tsinghua University (2015)

Submitted to the Department of Chemical Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

May 2020

© Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . .

Department of Chemical Engineering

May 1, 2020

Certified by. . . .

Richard D. Braatz

Edwin R. Gilliland Professor in Chemical Engineering

Thesis Supervisor

Accepted by . . . .

Patrick S. Doyle

Chairman, Department Committee on Graduate Theses

Advanced Process Data Analytics

by

Weike Sun

Abstract

Acknowledgments

Contents

Introduction

35

I

Advanced Predictive Modeling

43

II

Advanced Process Monitoring

145

Conclusion

452

Appendix

457

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Thesis goals

1.2

Thesis organization

Part I

Chapter 2

Introduction to Predictive Modeling

2.1

Linear Regression with Penalty

2.2

Partial Least Squares

2.3

Support Vector Regression

2.4

Random Forest Regression

2.5

Linear state-space model

2.6

Summary of the Contributions in Part I

Chapter 3

ALVEN: Algebraic Learning Via

Elastic Net for Static and Dynamic

Nonlinear Model Prediction

3.1

Introduction