Stochastic parameterization identification using ensemble Kalman filtering combined with maximum likelihood methods

(1)

HAL Id: hal-01773980

https://hal.archives-ouvertes.fr/hal-01773980

Submitted on 23 Apr 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

ensemble Kalman filtering combined with maximum likelihood methods

Manuel Pulido, Pierre Tandeo, Marc Bocquet, Alberto Carrassi, Magdalena Lucini

To cite this version:

Manuel Pulido, Pierre Tandeo, Marc Bocquet, Alberto Carrassi, Magdalena Lucini. Stochas- tic parameterization identification using ensemble Kalman filtering combined with maximum like- lihood methods. Tellus A: Dynamic Meteorology and Oceanography„ 2018, 70 (1), pp.1 - 17.

�10.1080/16000870.2018.1442099�. �hal-01773980�

(2)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=zela20 ISSN: (Print) 1600-0870 (Online) Journal homepage: http://www.tandfonline.com/loi/zela20

Stochastic parameterization identification

using ensemble Kalman filtering combined with maximum likelihood methods

Manuel Pulido, Pierre Tandeo, Marc Bocquet, Alberto Carrassi & Magdalena Lucini

To cite this article: Manuel Pulido, Pierre Tandeo, Marc Bocquet, Alberto Carrassi & Magdalena Lucini (2018) Stochastic parameterization identification using ensemble Kalman filtering combined with maximum likelihood methods, Tellus A: Dynamic Meteorology and Oceanography, 70:1, 1442099, DOI: 10.1080/16000870.2018.1442099

To link to this article: https://doi.org/10.1080/16000870.2018.1442099

Published online: 19 Mar 2018.

Submit your article to this journal

View related articles

View Crossmark data

(3)

Tellus

AND OCEANOGRAPHY PUBLISHED BY THE INTERNATIONAL METEOROLOGICAL INSTITUTE IN STOCKHOLM

Stochastic parameterization identification using ensemble Kalman filtering combined with maximum

likelihood methods

By M A N U E L P U L I D O

¹^,²^∗

, P I E R R E TA N D E O

³

, M A R C B O C QU E T

⁴

, A L B E RTO

C A R R A S S I

⁵

and M AG DA L E NA L U C I N I

⁶

,

¹

Department of Physics, FaCENA, Universidad Nacional del Nordeste, Corrientes, Argentina;

²

IMIT, IFAECI, CNRS-CONICET, Buenos Aires, Argentina;

³

IMT Atlantique, Lab-STICC, UBL, Brest, France;

⁴

CEREA, Joint Laboratory École des Ponts ParisTech and EDF R&D, Université

Paris-Est, Champs-sur-Marne, France;

⁵

Nansen Environmental and Remote Sensing Center, Bergen, Norway;

6

Department of Mathematics, FaCENA, Universidad Nacional del Nordeste and CONICET, Corrientes, Argentina

(Manuscript received 21 September 2017; in final form 9 February 2018)

A B S T R A C T

For modelling geophysical systems, large-scale processes are described through a set of coarse-grained dynamical equations while small-scale processes are represented via parameterizations. This work proposes a method for identifying the best possible stochastic parameterization from noisy data. State-of-the-art sequential estimation methods such as Kalman and particle filters do not achieve this goal successfully because both suffer from the collapse of the posterior distribution of the parameters. To overcome this intrinsic limitation, we propose two statistical learning methods. They are based on the combination of the ensemble Kalman filter (EnKF) with either the expectation–

maximization (EM) or the Newton–Raphson (NR) used to maximize a likelihood associated to the parameters to be estimated. The EM and NR are applied primarily in the statistics and machine learning communities and are brought here in the context of data assimilation for the geosciences. The methods are derived using a Bayesian approach for a hidden Markov model and they are applied to infer deterministic and stochastic physical parameters from noisy observations in coarse-grained dynamical models. Numerical experiments are conducted using the Lorenz-96 dynamical system with one and two scales as a proof of concept. The imperfect coarse-grained model is modelled through a one-scale Lorenz- 96 system in which a stochastic parameterization is incorporated to represent the small-scale dynamics. The algorithms are able to identify the optimal stochastic parameterization with good accuracy under moderate observational noise.

The proposed EnKF-EM and EnKF-NR are promising efficient statistical learning methods for developing stochastic parameterizations in high-dimensional geophysical models.

Keywords: parameter estimation, model error estimation, stochastic parameterization, expectation–maximization algorithm

1. Introduction

The statistical combination of observations of a dynamical model with a priori information of physical laws allows the estimation of the full state of the model even when it is only partially ob- served. This is the main aim of data assimilation (Kalnay, 2002).

One common challenge of evolving multi-scale systems in ap- plications ranging from meteorology, oceanography, hydrology and space physics to biochemistry and biological systems is the presence of parameters that do not rely on known physical constants so that their values are unknown and unconstrained.

∗Corresponding author. e-mail: pulido@unne.edu.ar

Data assimilation techniques can also be formulated to estimate these model parameters from observations (Jazwinski et al., 1970; Wikle and Berliner, 2007).

There are several multi-scale physical systems which are mod- elled through coarse-grained equations. The most paradigmatic cases being climate models (Stensrud, 2009), large-eddy sim- ulations of turbulent flows (Mason and Thomson, 1992) and electron fluxes in the radiation belts (Kondrashov et al., 2011).

These imperfect models need to include subgrid-scale effects through physical parameterizations (Nicolis, 2004). In the last years, stochastic physical parameterizations have been incor- porated in weather forecast and climate models (Palmer, 2001;

Christensen et al., 2015; Shutts, 2015). They are called stochastic

1

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Citation: Tellus A: 2018,70, 1442099, https://doi.org/10.1080/16000870.2018.1442099

(4)

parameterizations because they represent stochastically a pro- cess that is not explicitly resolved in the model, even when the unresolved process may not be itself stochastic. The forecast skill of ensemble forecast systems has been shown to improve with these stochastic parameterizations (Palmer, 2001; Christensen et al., 2015; Shutts, 2015). Deterministic integrations with models that include these parameterizations have also been shown to improve climate features (see e.g. Lott et al., 2012). In general, stochastic parameterizations are expected to improve coarse- grained models of multi-scale physical systems (Katsoulakis et al., 2003; Majda and Gershgorin, 2011). However, the functional form of the schemes and their parameters, which represents small-scale effects, are unknown and must be inferred from observations. The development of automatic statistical learning techniques to identify an optimal stochastic parameterization and estimate its parameters is, therefore, highly desirable.

One standard methodology to estimate physical model param- eters from observations in data assimilation techniques, such as the traditional Kalman filter, is to augment the state space with the parameters (Jazwinski et al., 1970). This methodol- ogy has also been implemented in the ensemble-based Kalman filter (see e.g. Anderson, 2001). The parameters are constrained through their correlations with the observed variables. However, three challenges are posed for parameter estimation in EnKFs.

Firstly, parameter probability distributions are in general non- Gaussian, even though Kalman-based filters rely on the Gaussian assumption. Secondly, the estimation of global parameters is theoretically incompatible with the use of domain localization (Bellsky et al., 2014), which is very often employed to implement the EnKF in high-dimensional systems. Thirdly, the parameters are usually assumed to be governed by persistence so that their impact on the augmented error covariance matrix diminishes with time (Ruiz et al., 2013a).

The collapse of the parameter posterior distribution found in both ensemble Kalman filters (Delsole and Yang, 2010; Ruiz et al., 2013a, 2013b; Santitissadeekorn and Jones, 2015) and particle filters (West and Liu, 2001) is a major contention point when one is interested in estimating stochastic parameters of non-linear dynamical models. Hereafter, we refer as stochastic parameters to those that define the covariance of a Gaussian stochastic process (Delsole and Yang, 2010). In other words, the sequential filters are, in principle, able to estimate determin- istic physical parameters, the mean of the parameter posterior distribution, through the augmented state-space procedure, but they are unable to estimate stochastic parameters of the model, because of the collapse of the corresponding posterior distribu- tion. Using the Kalman filter with the augmentation method, Delsole and Yang (2010) proved analytically the collapse of the parameter covariance in a first-order autoregressive model.

They proposed a generalized maximum likelihood estimation using an approximate sequential method to estimate stochastic parameters. Carrassi and Vannitsem (2011) derived the evolution of the augmented error covariance in the extended Kalman filter

using a quadratic in time approximation that mitigates the col- lapse of the parameter error covariance. Santitissadeekorn and Jones (2015) proposed a particle filter blended with an ensemble Kalman filter and use a random walk model for the parameters.

This technique was able to estimate stochastic parameters in the first-order autoregressive model, but a tunable parameter in the random walk model needs to be introduced.

The expectation–maximization (EM) algorithm (Dempster et al., 1977; Bishop, 2006) is a widely used method to maximize the likelihood function in a broad spectrum of applications. One of the advantages of the EM algorithm is that its implementation is rather straigthforward. Wu (1983) showed that if the likelihood is smooth and unimodal, the EM algorithm converges to the unique maximum likelihood estimate. Accelerations of the EM algorithm have been proposed for its use in machine learning (Neal and Hinton, 1999). Recently, it was used in an application with a highly non-linear observation operator (Tandeo et al., 2015). The EM algorithm was able to estimate subgrid-scale parameters with good accuracy while standard ensemble Kalman filter techniques failed. It has also been applied to the Lorenz-63 system to estimate model error covariance (Dreano et al., 2017).

In this work, we combine the ensemble Kalman filter (Evensen, 1994; Evensen, 2003) with maximum likelihood estimators for stochastic parameterization identification. Two maximum like- lihood estimators are evaluated: the EM (Dempster et al., 1977;

Bishop, 2006) and the Newton-Raphson algorithm (Cappé et al., 2005). The derivation of the techniques is explained in detail and simple terms so that readers that are not from those communities can understand the basis of the methodologies, how they can be combined, and hopefully foresee potential applications in other geophysical systems. The learning statistical techniques are suit- able to infer the functional form and the parameter values of stochastic parameterizations in chaotic spatio-temporal dynam- ical systems. They are evaluated here on a two-scale spatially extended chaotic dynamical system (Lorenz, 1996) to estimate deterministic physical parameters, together with additive and multiplicative stochastic parameters. Pulido et al. (2016) evalu- ated methods based on the EnKF alone to estimate subgrid-scale parameters in a two-scale system: they showed that an offline estimation method is able to recover the functional form of the subgrid-scale parameterization, but none of the methods was able to estimate the stochastic component of the subgrid-scale effects. In the present work, the results show that the NR and EM techniques are able to uncover the functional form of the subgrid-scale parameterization while successfully determining the stochastic parameters of the representation of subgrid-scale effects.

This work is organized as follows. Section 2 briefly introduces the EM algorithm and derives the marginal likelihood of the data using a Bayesian perspective. The implementation of the EM and NR likelihood maximization algorithms in the context of data assimilation using the ensemble Kalman filter is also discussed.

Section 3 describes the experiments which are based on the

(5)

one- and two-scale Lorenz-96 systems. The former includes sim- ple deterministic and stochastic parameterizations to represent the effects of the smaller scale to mimic the two-scale Lorenz-96 system. Section 4 focuses on the results: Section 4.1 discusses the experiments for the estimation of model noise. Section 4.2 shows the results of the estimation of deterministic and stochastic parameters in a perfect-model scenario. Section 4.3 shows the estimation experiments for an imperfect model. The conclusions are drawn in Section 5.

2. Methodology

2.1. Hidden Markov model

A hidden Markov model is defined by a stochastic non-linear dynamical model

M

that evolves in time the hidden variables

x_k₋₁∈R^N

, according to,

x_k=M(x_k₋₁)+ηk,

(1)

where k stands for the time index. The dynamical model

M

depends on a set of deterministic and stochastic physical param- eters denoted by

. We assume an additive random model error,

ηk

, with covariance matrix

Q_k =E

ηkη^T_k

. The notation

E()

stands for the expectation operator,

E

[ f

(

x

)

]

≡

f

(

x

)

p

(

x

)

dx with p being the probability density function of the underlying process X .

The observations at time k,

y_k∈R^M

, are related to the hidden variables through the observational operator

H,

y_k=H(xk)+k,

(2)

where

k

is an additive random observation error with observa- tion error covariance matrix

R_k=E

k^T_k

.

Our estimation problem:

Given a set of observation vectors distributed in time,

{y_k,

k

=

1

, . . . ,

K

}

, a nonlinear stochastic dynamical model,

M

, and a nonlinear observation operator,

H

, we want to estimate the initial prior distribution p

(x₀)

, the observation error covariance

R_k

, the model error covariance

Q_k

, and deterministic and stochastic physical parameters

of

M

.

In the EM literature, the term ‘parameters’ is used for all the parameters of the likelihood function including the moments of the statistical distributions. Here, the parameters of the likelihood function are referred more specifically to as likelihood param- eters. The likelihood parameters may include the deterministic and stochastic physical parameters, the observation error and the model error covariances and the first two moments of the initial prior distribution.

The estimation method we derived is based on maximum likelihood estimation. Given a set of independent and identically

distributed (iid) observations from a probability density function represented by p(y

1:K|θ), we seek to maximize the likelihood

function L

(y1:K;θ)

as a function of

θ. We denote{y1, . . . ,y_K}

by

y₁_:_K

and the set of likelihood parameters to be estimated by

θ: the deterministic and stochastic physical parameters

of the dynamical model

M

as well as observation error covari- ances

R_k

, model error covariances

Q_k

and the mean

x₀

and covariance

P₀

of the initial prior distribution p

(x₀)

. In practical applications, the statistical moments

R_k

,

Q_k

and

P₀

are usually poorly constrained. It may thus be convenient to estimate them jointly with the physical parameters. The dynamical model is assumed to be non-linear and to include stochastic processes represented by some of the physical parameters in

.

The estimation technique used in this work is a batch method: a set of observations taken along a time interval is used to estimate the model state trajectory that is closest to them, considering measurement and model errors with a least-square criterion to be established below. The simultaneous use of observations dis- tributed in time is essential to capture the interplay of the several likelihood parameters included in the estimation problem. The required minimal length K for the observation window is eval- uated in the numerical experiments. The estimation technique may be applied in successive K-windows. For stochastic param- eterizations in which the parameters are sensitive to processes of different time scales, a batch method may also be required to capture the sensitivity to slow processes.

2.2. Expectation-maximization algorithm

The EM algorithm maximizes the log-likelihood of observations as a function of the likelihood parameters

θ

in the presence of a hidden state

x₀_:_K

,

¹

l

(θ)=

ln L

(y₁_:_K;θ)=

ln

p

(x₀_:_K,y₁_:_K;θ)

dx

₀_:_K.

(3)

An analytical form for the log-likelihood function, (3), can be obtained only in a few ideal cases. Furthermore, the numerical evaluation of (3) may involve high-dimensional integration of the complete likelihood (integrand of (3)). Given an initial guess of the likelihood parameters

θ, the EM algorithm maximizes the

log-likelihood of observations as a function of the likelihood parameters in successive iterations without the need to evaluate the complete likelihood.

2.2.1. The principles. Let us introduce in the integral (3) an arbitrary probability density function of the hidden state, q

(x₀_:_K)

,

l

(θ)=

ln

q

(x₀_:_K)

p

(x₀_:_K,y₁_:_K;θ)

q

(x₀_:_K)

dx

₀_:_K.

(4)

(6)

We assume that the support of q(x

0:K)

contains that of p(x

0:K, y₁_:_K;θ). The density,

q(x

0:K)

may be thought, in particular, as a function of a set of fixed likelihood parameters

θ

, q(x

0:K;θ).

Using Jensen inequality, a lower bound for the log-likelihood is obtained,

l(θ)

≥

q(x

0:K)

ln

p

(x₀_:_K,y₁_:_K;θ)

q

(x₀_:_K)

dx

₀_:_K≡Q(q, θ).

(5) If we choose q(x

0:K) =

p(x

0:K|y1:K;θ), the equality is

satisfied in (5), therefore p(x

0:K|y1:K;θ)

is an upper bound to

Q

and so it is the q function that maximises

Q(q,θ). The inter-

mediate function

Q(q,θ)

may be interpreted physically as the free energy of the system, so that

x

are interpreted as the physical states and the energy is the joint density (Neal and Hinton, 1999).

Rewriting the joint density in (5) as a function of the conditional density, p

(x₀_:_K,y₁_:_K;θ) =

p

(x₀_:_K|y₁_:_K;θ)

L

(y₁_:_K;θ)

, the intermediate function may be related to the Kullback–Leibler divergence,

Q(

q

, θ)= −

D

_{K L}(

q

|

p

(x₀_:_K|y₁_:_K;θ))+

l

(θ),

(6) where D

_{K L}(

q

|

p

)≡

q ln

q

p

dx is a positive definite func- tion and D

_{K L}(

q

|

p

)=

0 iff q

=

p. From (6), using the properties of the Kullback–Leibler divergence, it is clear that the upper bound of

Q

is obtained for q

=

p

(x₀_:_K|y₁_:_K;θ)

.

From (5), we see that if we maximize

Q(

q

,θ)

over

θ

, we find a lower bound for l

(θ)

. The idea of the EM algorithm is to first find the probability density function q that maximizes

Q

, the conditional probability of the hidden state given the obser- vations, and then to determine the parameter

θ

that maximizes

Q. Hence, the EM algorithm encompasses the following steps:

Expectation: Determine the distribution q that maximizes

Q.

This function is easily shown to be q

^∗=

p(x

0:K|y1:K;θ)

(see (5); Neal and Hinton 1999). The function q

^∗

is the conditional probability of the hidden state given the observations. In practice, this is obtained by evaluating the conditional probability at

θ

.

Maximization: Determine the likelihood parameters

θ^∗

that maximize

Q(

q

^∗,θ)

over

θ

. The new estimation of the likelihood parameters is denoted by

θ^∗

while the (fixed) previous estimation by

θ

. The expectation step is a function of these old likelihood parameters

θ

. The part of function

Q

to maximize is given by:

p(x

0:K|y1:K;θ)

ln

(p(x0:K,y₁_:_K;θ))

dx

₀_:_K

≡E

[ln

(

p

(x₀_:_K,y₁_:_K;θ))|y₁_:_K

]

,

(7) where we use the notation

E(

f

(x)|y) ≡

f

(x)

p(x|y)dx (Jazwinski et al., 1970). While the function that we want to maximize is the log-likelihood, the intermediate function (7)

to maximize in the EM algorithm is the expectation of the joint distribution conditioned to the observations.

2.2.2. Expectation-maximization for a hidden Markov model.

The joint distribution of a hidden Markov model using the defi- nition of the conditional probability distribution reads

p

(x₀_:_K,y₁_:_K)=

p

(y₁_:_K|x₀_:_K)

p

(x₀_:_K).

(8)

The model state probability density function can be expressed as a product of the transition density from t

_k

to t

_k₊₁

using the definition of the conditional probability distribution and the Markov property,

p

(x₀_:_K)=

p

(x₀) K k=1

p

(x_k|x_k₋₁).

(9)

The observations are mutually independent and are conditioned on the current state (see (2)) so that

p(y

1:K|x0:K)= K k=1

p(y

k|xk).

(10)

Then, replacing (9) and (10) in (8) yields

p(x

0:K,y₁_:_K)=

p(x

0) K k=1

p(x

k|xk−1)

p(y

k|xk).

(11)

If we now assume a Gaussian hidden Markov model, and that the covariances

R_k

and

Q_k

are constant in time, the logarithm of the joint distribution (11) is then given by:

ln

(

p

(x₀_:_K,y₁_:_K))

= −(

M

+

N

)

2 ln(2π)

−

1 2 ln

|P0| −

1

2

(x0−x₀)^TP⁻₀¹

×(x0−x₀)−

K

2 ln

|Q| −

1 2

K

k=1

(xk−M x_k₋₁

)^T

×Q⁻¹(xk−M x_k₋₁

)−

K 2 ln

|R|

−

1 2

K

k=1

(y_k−H(x_k))^TR⁻¹(y_k−H(x_k)).

(12)

The Markov hypothesis implies that model error is not corre-

lated in time. Otherwise, we would have cross terms in the model

error summation of (12). The assumption of a Gaussian hidden

Markov model is central to derive a closed analytical form for the

(7)

likelihood parameters that maximize the intermediate function.

However, the dynamical model and observation operator may have non-linear dependencies so that the Gaussian assumption is not strictly held. We therefore consider an iterative method in which each step is an approximation. In general, the method will converge through successive approximations. For severe non- linear dependencies in the dynamical model, the existence of a single maximum in the log-likelihood is not guaranteed. In that case, the EM algorithm may converge to a local maximum. As suggested by Wu (1983), one way to avoid that the EM algorithm be trapped in a local maximum of the likelihood function is to apply the algorithm for different starting parameters. Then, the EM simulation with the highest likelihood is chosen and the corresponding estimated parameters. In practice, the stochastic nature of the likelihood function may contribute to avoid the EM algorithm gets stuck in a local maximum (as in stochastic optimization).

We consider (12) as a function of the likelihood parameters

θ

in this Gaussian state-space model. In this way, given the known values of the observations the log-likelihood function in (3), is a function of the likelihood parameters, namely:

x₀

,

P₀

,

Q,R,

and

, the physical parameters from

M

.

In this Gaussian state-space model, the maximum of the inter- mediate function in the EM algorithm, (7), may be determined analytically from

0

= ∇_θE

[ln

(

p(x

0:K,y₁_:_K;θ))|y1:K

]

=

p

(x₀_:_K|y₁_:_K;θ)∇_θ

ln

(

p

(x₀_:_K,y₁_:_K;θ))

dx

₀_:_K

=E

[

∇_θ

ln

(p(x0:K,y₁_:_K;θ))|y1:K

] (13)

Note that

θ

is fixed in (13). We only need to find the critical values of the likelihood parameters

Q

and

R. The physical pa-

rameters are appended to the state, so that their model error is included in

Q. Thex₀

,

P₀

are at the initial time so that they are obtained as an output of the smoother which gives a Gaussian approximation of p(x

k|y1:K)

with k

=

0, . . . , K . The smoother equations are shown in the Appendix 1.

Differentiating (12) with respect to

Q

and

R

and applying the expectation conditioned to the observations, we can determine the root of the condition, (13), which gives the maximum of the intermediate function. The value of the model error covariance, solution of (13), is

Q=

1 K

K

k=1 E

x_k−M

x_k₋₁ x_k−M

x_k₋₁ ^Ty₁_:_K .

(14) In the case of the observation error covariance, the solution is:

R=

1 K

K

k=1 E

y_k−H(x_k)

y_k−H(x_k)Ty₁_:_K .

(15)

Therefore, we can summarize the EM algorithm for a hidden Markov model as:

Expectation: The required set of expectations given the ob- servations must be evaluated at

θi

, i being the iteration number, specifically,

E

x_ky₁_:_K

,

E

x_kx^T_ky₁_:_K

, etc. The outputs of a classical smoother are indeed

E

x_ky₁_:_K

,

E

x_k−E

x_ky₁_:_K x_k−E

x_ky₁_:_K_Ty₁_:_K

which fully characterize p(x

k|y1:K)

in the Gaussian case. Hence, this expectation step involves the application of a foward filter and a backward smoother.

Maximization: Since we assume Gaussian distributions, the optimal value of

θi+1

can be determined analytically, which in our model are

Q

and

R, as derived in (14) and (15). These

equations are evaluated using the expectations determined in the Expectation step.

The basic steps of this EM algorithm are depicted in Fig. 1a. In this work, we use an ensemble-based Gaussian filter, the ensem- ble transform Kalman filter (Hunt et al., 2007) and the Rauch–

Tung–Striebel (RTS) smoother (Cosme et al., 2012; Raanes, 2016).

²

A short description of these methods is given in the Appendix. The empirical expectations are determined using the smoothed ensemble member states at t

_k

,

x^s_m(tk). For instance,

E

x_kx_k^Ty₁_:_K

=

1 N

e

Ne

m=1

x^s_m(

t

_k)x^s_m(

t

_k)^T,

(16)

where N

e

is the number of ensemble members. Then, using these empiral expectations

R

and/or

Q

are computed from (14) and/or (15).

The EM algorithm applied to a linear Gaussian state-space model using the Kalman filter was first proposed by Shumway and Stoffer (1982). Its approximation using an ensemble of draws (Monte Carlo EM) was proposed in Wei and Tanner (1990).

It was later generalized with the extended Kalman filter and Gaussian kernels by Ghahramani and Roweis (1999). The use of the EnKF and the ensemble Kalman smoother permits the extension of the EM algorithm to non-linear high-dimensional dynamical models and non-linear observation operators.

2.3. Maximum likelihood estimation via Newton–

Raphson

The EM algorithm is highly versatile and can be readily imple-

mented. However, it requires the optimal value in the maximiza-

tion step to be computed analytically which limits the range of its

applications. If physical deterministic parameters of a non-linear

model need to be estimated, an analytical expression for the op-

timal likelihood parameter values may not be available. Another

approach to find an estimate of the likelihood parameters consists

(8)

in maximizing an approximation of the likelihood function l(θ) with respect to the parameters, (3). This maximization may be conducted using standard optimization methods (Cappé et al., 2005).

Following Carrassi et al. (2017), the observation probability density function can be decomposed into the product

p

(y₁_:_K;θ)= K k=1

p

(y_k|y₁_:_k₋₁;θ),

(17)

with the convention

y₁_:₀=∅. In the case of sequential applica-

tion of NR maximization in successive K-windows, the a priori probability distribution p(x

0)

can be taken from the previous estimation. For such a case, we leave implicit the conditioning in (17) on all the past observations, p

(y₁_:_K;θ)=

p

(y₁_:_K|y_:₀;θ)

,

y_:₀ = {y₀,y₋₁,y₋₂, . . .}

which is called contextual evidence in Carrassi et al. (2017). The times of the evidencing window, 1

:

K , required for the estimation are the only ones that are kept explicit in (17).

Replacing (17) in (3) yields

l

(θ)= K

k=1

ln p

(y_k|y₁_:_k₋₁;θ)

= K

k=1

ln

p(y

k|xk)

p(x

k|y1:k−1;θ)dxk

.

(18)

If we assume Gaussian distributions and linear dynamical and observational models, the integrand in (18) is exactly the analysis distribution given by a Kalman filter (Carrassi et al., 2017). The likelihood of the observations conditioned on the state at each time is then given by:

p(y

k|xk)= [(2π)^M^/²|R|¹^/²]⁻¹

×

exp

−

1

2

(y_k−H(x_k))^TR⁻¹(y_k−H(x_k))

,

(19)

and the prior forecast distribution,

p

(x_k|y₁_:_k₋₁;θ)= [(

2

π)^N^/²|P_k^f|¹^/²]⁻¹

×

exp

−

1

2

(x_k−x_k^f)^T(P_k^f)⁻¹(x_k−x_k^f)

,

(20)

where

x_k^f =M(x^a_k₋₁)+ηk

is the forecast with

ηk∼N(0,Q_k)

,

x^a_k₋₁

is the analysis state – filter mean state estimate – at time k

−

1 and

P_k^f

is the forecast covariance matrix of the filter.

The resulting approximation of the observation likelihood function which is obtained replacing (19) and (20) in (18), is

l

(θ)≈ −

1 2

K

k=1

(y_k−Hx_k^f)^T(HP_k^fH^T+R)⁻¹

×(y_k−Hx_k^f)+

ln

(|HP_k^fH^T+R|)

+

C (21)

where C stands for the constants independent of

θ

and the ob- servational operator is assumed linear,

H=H. Equation (21) is

exact for linear models

M=M, but just an approximation for

non-linear ones. As in EM, the point we made is that we expect that the likelihood in the iterative method can converge through successive approximations.

The evaluation of the model evidence (21) does not require the smoother. The forecasts

x_k^f

in (21) are started from the analysis – filter state estimates. In this case, the initial likelihood parameters

x₀

and

P₀

need to be good approximations (e.g. an estimation from the previous evidencing window) or they need to be estimated jointly to the other potentially unknown parameters

,

R, andQ. Note that (21) does not depend explicitly onQ

because the forecasts

x_k^f

already include the model error. The steps of the NR method are sketched in Fig. 1b.

For all the cases in which we can find an analytical expres- sion for the maximization step of the EM algorithm, we can also derive a gradient of the likelihood function (Cappé et al., 2005). However, for the application of the NR maximization in both cases; when the EM maximization step can be derived analytically but also when it cannot, we have implemented an NR maximization based on a so-called derivative-free optimiza- tion method, i.e. a method that does not require the likelihood gradient, to be described in the next section.

3. Design of the numerical experiments

A first set of numerical experiments consists of twin experi-

ments in which we first generate a set of noisy observations

using the model with known parameters. Then, the maximum

likelihood estimators are computed using the same model with

the synthetic observations. Since we know the true parameters,

we can evaluate the error in the estimation and the performance

of the proposed algorithms. A second set of experiments applies

the method for model identification. The (imperfect) model rep-

resents the multi-scale system through a set of coarse-grained

dynamical equations and an unknown stochastic physical param-

eterization. The model-identification experiments are imperfect

model experiments in which we seek to determine the stochas-

tic physical parameterization of the small-scale variables from

observations. In particular, the ‘nature’ or true model is the two-

scale Lorenz-96 model and it is used to generate the synthetic

observations, while the imperfect model is the one-scale Lorenz-

96 model forced by a physical parameterization which has to be

(9)

(a) (b)

Fig. 1. (a) Flowchart of the EM algorithm (left panel). (b) NR flowchart (right panel). Each column of the matrixX_kis an ensemble member state X_k≡x_1:N_e(t_k)at timek. Subscript(i)meansith iteration. A final application of the filter may be required to obtain the updated analysis state at i+1. The functionlli kis the log-likelihood calculation from (21). The newuoa function in the optimization step refers to the ’new’ unconstrained optimization algorithm (Powell,2006).

(10)

identified. This parameterization should represent the effects of small-scale variables on the large-scale variables. In this way, the coarse-grained one-scale model with a physical parameterization with tunable deterministic and stochastic parameters is adjusted to account for the (noisy) observed data. We evaluate whether the EM algorithm and the NR method are able to determine the set of optimal parameters, assuming they exist.

The synthetic observations are taken from the known nature integration by, see (2),

y_k = Hx_k+k

(22)

with

H=I, i.e. all the state is observed. Futhermore, we assume

non-correlated observations

R_k=E

k^T_k

=αRI.

3.1. Twin experiments

In the twin experiments, we use the one-scale Lorenz-96 system and a physical parameterization that represents subgrid-scale effects. The nature integration is conducted with this model and a set of ‘true’ physical parameter values. These parame- ters characterize both deterministic and stochastic processes. By virtue of the perfect model assumption, the model used in the estimation experiments is exactly the same as the one used in the nature integration except that the physical parameter values are assumed to be unknown. Although for simplicity we call this

‘twin experiment’, this experiment could be thought as a model selection experiment with parametric model error in which we know the ‘perfect functional form of the dynamical equations’

but the model parameters are completely unknown and they need to be selected from noisy observations.

The equations of the one-scale Lorenz-96 model are:

dX

n

dt

+

X

_n₋₁(

X

_n₋₂−

X

_n₊₁)+

X

n =

G

n(

X

n,

a

₀, . . . ,

a

_J) ,

(23) where n

=

1

, . . . ,

N . The domain is assumed periodic, X

₋₁≡

X

_N₋₁

, X

₀≡

X

_N

, and X

_N₊₁≡

X

₁

.

We have included in the one-scale Lorenz-96 model a physical parameterization which is taken to be,

G

_n(

X

_n,

a

₀, . . . ,

a

₂) = 2

j=0

(

a

_j + ηj(

t

))·(

X

_n)^j,

(24)

where a noise term,

ηj(

t

)

, of the form,

ηj(

t

) =ηj (

t

−

t

) + σj νj(

t

),

(25)

has been added to each deterministic parameter. Equation (25) represents a random walk with standard deviation of the pro-

cess

σj

, the stochastic parameters, and

νj(t)

is a realization of a Gaussian distribution with zero mean and unit variance.

The standard deviation in the Runge–Kutta scheme is taken proportional to the square root of the time step

√

t

(Hansen and Penland, 2006). The parameterization (24) is assumed to repre- sent subgrid-scale effects, i.e. effects produced by the small-scale variables to the large-scale variables (Wilks, 2005).

3.2. Model-identiﬁcation experiments

In the model-identification experiments, the nature integration is conducted with the two-scale Lorenz-96 model (Lorenz, 1996).

The state of this integration is taken as the true state evolution.

The equations of the two-scale Lorenz-96 model, ‘true’ model, are given by N equations of large-scale variables X

n

,

dX

n

dt

+

X

_n₋₁(

X

_n₋₂−

X

_n₊₁)+

X

_n=

=

F

−

h c b

n N_S/N

j=NS/N(n−1)+1

Y

_j ;

(26)

with n

=

1, . . . , N; and N

_S

equations of small-scale variables Y

_m

, given by:

dY

m

dt

+

c b Y

_m₊₁(

Y

_m₊₂−

Y

_m₋₁)+

c Y

_m

=

h c

b X

_int_[(_m₋₁_)/_N_S_/_N_]+₁,

(27)

where m

=

1

, . . . ,

N

_S

. The two set of equations, (26) and (27), are assumed to be defined on a periodic domain, X

₋₁≡

X

_N₋₁

, X

₀ ≡

X

_N

, X

_N₊₁ ≡

X

₁

, and Y

₀ ≡

Y

_N_S

, Y

_N_S₊₁ ≡

Y

₁

, Y

_N_S₊₂≡

Y

₂

.

The imperfect model used in the model-identification experi- ments is the one-scale Lorenz-96 model (23) with a parameter- ization (24) meant to represent small-scale effects (right-hand side of (26)).

3.3. Numerical experiment details

As used in previous works (see e.g. Wilks 2005; Pulido et al.

2016), we set N

=

8 and N

_S =

256 for the large- and small- scale variables, respectively. The constants are set to the standard values b

=

10, c

=

10 and h

=

1. The external forcing for the model-identification experiments is taken to be F

=

18. The ordinary differential equations (26)–(27) are solved by a fourth- order Runge–Kutta algorithm. The time step is set to dt

=

0.001 for integrating (26) and (27).

For the model-identification experiments, we aim to mimic the

dynamics of the large-scale equations of the two-scale Lorenz-

96 system with the one-scale Lorenz-96 system (23) forced by a

(11)

(a) (b)

Fig. 2. Log-likelihood function as a function of (a) model noise for three true observational noise values,α^t_R=0.1,0.5,1.0; and as a function of (b) model noise (αQ) and observational noise (αR) for a case withα^t_Q=1.0 andα^t_R=0.5. Darker red shading represents larger log-likelihood.

(a) (b)

Fig. 3. Convergence of the NR maximization as a function of the iteration of the outer loop (inner loops are composed of 2NC+1 function evaluations, whereN_Cis the control space dimension) for different evidencing window lengths (K=100,500,1000). (a) Log-likelihood function.

(b) Frobenius norm of the model noise estimation error.

physical parameterization (24). In other words, our nature is the two-scale model, while our imperfect coarse-grained model is the forced one-scale model. For this reason, we take 8 variables for the one-scale Lorenz-96 model for the twin experiments (as the number of large-scale variables in the model-identification experiments). Equations (23) are also solved by a fourth-order Runge–Kutta algorithm. The time step in all the experiments is also set to dt

=

0.001. The EnKF implementation we use is the ensemble transform Kalman filter (Hunt et al., 2007) without localization. A short description of the ensemble transform Kalman filter is given in the Appendix. The time interval between observations (cycle) is

0.05 (an elapsed time of 0.2 represents about 1 day in the real atmosphere considering the error growth rates; Lorenz, 1996).

The number of ensemble members is set to N

_e=

50. The number of assimilation cycles (observation times) is K

=

500. This is the ‘evidencing window’ (Carrassi et al., 2017) in which we seek for the optimal likelihood parameters. The measurement variance error is set to

αR =

0.5 except otherwise stated. We do not use any inflation factor, since the model error covariance matrix is estimated.

The optimization method used in the NR maximization is

‘newuoa’ (Powell, 2006). This is an unconstrained minimiza-

tion algorithm which does not require derivatives. It is suitable

(12)

(a) (b)

Fig. 4. Convergence of the EM algorithm as a function of the iteration for different observation time lengths (evidencing window). An experiment withNe=500 ensemble members andK=500 is also shown. (a) Log-likelihood function. (b) The Frobenius norm of the model noise estimation error.

(a) (b)

Fig. 5. Estimated model noise as a function of the iteration in the EM algorithm. (a) Mean diagonal model noise (true value is 1.0). (b) Mean off-diagonal absolute model noise value (true value is 0.0).

for control spaces of about a few hundred dimensions. This derivative-free method could eventually permit to extend the NR maximization method to cases in which the state evolution (1) incorporates a non-additive model error.

4. Results

4.1. Twin experiments: Estimation of model noise param- eters

The nature integration is obtained from the one-scale Lorenz- 96 model (23) with a constant forcing of a

₀ =

17 without higher orders in the parameterization; in other words a one-

scale Lorenz-96 model with an external forcing of F

=

17. Information quantifiers show that for an external forcing of F

=

17, the Lorenz-96 model is in a chaotic regime with maximal statistical complexity (Pulido and Rosso, 2017). The true model is represented by (1) with model noise following a normal den- sity,

ηk∼N(0,Q^t)

. The true model noise covariance is defined by

Q^t =α^t_QI

with

α^t_Q=

1

.

0 (true parameter values are denoted by a t superscript). The observations are taken from the nature integration and perturbed using (22).

A first experiment examines the log-likelihood (21) as a

function of

αQ

for different true measurement errors,

α^t_R =

0

.

1

,

0

.

5

,

1

.

0 (Fig. 2a). A relatively smooth function is found with

a well-defined maximum. The function is better conditioned for

(13)

(a) (b)

Fig. 6. (a) Estimated mean deterministic parameters,a_i, as a function of the EM iterations for the twin parameter experiment. (b) Estimated stochastic parameters,σi.

(a) (b) (c)

Fig. 7. (a) Estimated deterministic parameters as a function of the EM iterations for the model-identification experiment. Twenty experiments with random initial deterministic and stochastic parameters are shown. (b) Estimated stochastic parameters. (c) Log-likelihood function.

the experiments with smaller observational noise,

αR

. Figure 2b shows the log-likelihood as a function of

αQ

and

αR

. The darkest shading is around

(αQ, αR) ≈ (

1

.

0

,

0

.

5

)

. However, note that because of the asymmetric shape of the log-likelihood function (Fig. 2a), the darker red region is shifted toward higher

αQ

and

αR

values. The up-left bottom-right orientation of the likelihood pattern in the plane

αQ

and

αR

reveals a correlation between them: the larger

αQ

, the smaller

αR

for the local maximal like- lihood.

We conducted a second experiment using the same obser- vations but the estimation of model noise covariance matrix is performed through the NR method. The control space is of 8×8 = 64 dimensions, i.e. the full

Q

model error covariance matrix is estimated (note that N

=

8 is the model state dimen- sion). Figure 3a depicts the convergence of the log-likelihood function in three experiments with evidencing window K

=

100, 500 and 1000. The Frobenius norm of the error in the estimated model noise covariance matrix, i.e.

Q−Q^tF =

i j

Q

_{i j}−

Q

^t_{i j} 2

, is shown in Fig. 3b. As the number of cycles used in a single batch process increases, the estimation error diminishes.

The convergence of the EM algorithm applied for the estima-

tion of model noise covariance matrix only (8

×

8 = 64 dimen-

sions) is shown in Fig. 4. This work is focused on the estimation

of model parameters so that the observation error covariance

matrix is assumed to be known. The method would allow to

estimate it jointly through (15), however, this is beyond the main

aim of this work. This is similar to the previous experiment,

using the EM instead of the NR method. In 10 iterations, the EM

algorithm achieves a reasonable estimation, which is not further

improved for larger number of iterations. The obtained log-

likelihood value is rather similar to the NR method. The noise in

the log-likelihood function diminishes with longer evidencing

windows. The amplitude of the log-likelihood function noise for

K

=

100 is about 3%. These fluctuations are caused by sampling

(14)

(a) (b)

Fig. 8. (a) Log-likelihood as a function of theσ0parameter at theσ1andσ2optimal values for the NR estimation (green curve) and with the optimal values for the EM estimation (blue curve) for the imperfect-model experiment. (b) Analysis RMSE as a function of theσ0parameter.

noise. Note that the number of likelihood parameters is 64 and the evidencing window K

=

100 in this case. For larger K , the log-likelihood noise is diminished

<

1%. As mentioned above a certain amount of noise may be beneficial for the convergence of the algorithm.

Comparing the standard N

_e=

50 experiments with N

_e=

500 in Fig. 4a, the noise also diminishes by increasing the number of ensemble members. Increasing the number of members does not appear to impact on the estimation of off-diagonal values, but it does so on the diagonal stochastic parameter values (Fig. 5a and b). The error in the estimates is about 7% in both diagonal and off-diagonal terms of the model noise covariance matrix for K

=

100, and lower than 2% for the K

=

1000 cycles case (Fig. 5).

4.2. Twin experiments: estimation of deterministic and stochastic parameters

A second set of twin experiments evaluates the estimation of deterministic and stochastic parameters from a physical parame- terization. The model used to generate the synthetic observations is (23) with the physical parameterization (24). The length of the assimilation cycle is set to its standard value, 0

.

05. The deterministic parameters to conduct the nature integration are fixed to a

₀^t =

17

.

0, a

^t₁= −

1

.

15 and a

^t₂=

0

.

04 and the model error variance in each parameter is set to

σ₀^t =

0

.

5

, σ₁^t =

0.05, and

σ₂^t =

0.002, respectively. The true parameters are governed by a stochastic process (25). This set of deterministic parameters is a representative physical quadratic polynomial parameterization, which closely resembles the dynamical regime of a two-scale Lorenz-96 model with F

=

18 (Pulido and Rosso, 2017). The observational noise is set to

αR=

0.5. An augmented state space of 11 dimensions is used, which is composed by appending to the 8 model variables the 3 physical parameters

(

a

₀,

a

₁,

a

₂)

. The evolution of the augmented state is represented by (1) for the state vector component and a random walk for the parameters. The EM algorithm is then used to estimate the additive augmented state model error

Q

which is an 11

×

11 covariance matrix. Therefore, the smoother recursion gives an estimate of both the state variables and deterministic parameters.

The recursion formula for the model error covariance matrix (and the parameter covariance submatrix) is given by (14).

Figure 6a shows the estimation of the mean deterministic parameters as a function of the EM iterations. The estimation of the deterministic parameters is rather accurate; a

₂

has a small true value and it presents the lowest sensitivity. The estimation of the stochastic parameters by the EM algorithm converges rather precisely to the true stochastic parameters (Fig. 6b). The convergence requires of about 80 iterations. The estimated model error for the state variables is in the order of 5

×

10

⁻²

. This rep- resents the additive inflation needed by the filter for an optimal convergence. It establishes a lower threshold for the estimation of additive stochastic parameters.

A similar experiment was conducted with NR maximization for the same synthetic observations. A scaling of S

_σ =(

1

,

10

,

100

)

was included in the optimization to increase the condition number. A good convergence was obtained with the optimization algorithm. The estimated optimal parameter values are

σ0 =

0.38

σ1 =

0.060

σ2 =

0.0025 for which the log-likelihood is l

= −491. The estimation is reasonable with a relative error of

about 25%.

4.3. Model-identiﬁcation experiment: estimation of the deterministic and stochastic parameters

As a proof-of-concept model-identification experiment, we now

use synthetic observations with an additive observational noise

of

αR =

0

.

5 taken from the nature integration of the two-

(15)

(a) (b)

(c) (d)

Fig. 9. (a) Scatterplot of the true small-scale effects in the two-scale Lorenz-96 model as a function of a large-scale variable (coloured dots) and scatterplot of the deterministic parameterization with optimal parameters (white dots). (b) Scatterplot from the stochastic paramerization with optimal parameters obtained with the EM algorithm and (c) with the NR method. (d) Scatterplot given by a constrained random walk with optimal EM parameters.

scale Lorenz-96 model with F

=

18. On the other hand, the one-scale Lorenz-96 model is used in the ensemble Kalman filter with a physical parameterization that includes the quadratic polynomial function, (24), and the stochastic process (25). The deterministic parameters are estimated through an augmented state space while the stochastic parameters are optimized via the algorithm for the maximization of the log-likelihood function.

The model error covariance estimation is constrained for these experiments to the three stochastic parameters alone. Figure 7a shows the estimated deterministic parameters as a function of the EM iterations. Twenty experiments with different initial de- terministic parameters and initial stochastic parameter values were conducted. The deterministic parameter estimation does not manifest a significant sensitivity to the stochastic parame- ter values. The mean estimated values are a

₀ =

17.3, a

₁ =

−1.25 and

a

₃ =

0.0046. Note that the deterministic parame- ter values estimated with information quantifiers in Pulido and

Rosso (2017) for the two-scale Lorenz-96 with F

=

18 are

(a0,

a

₁,

a

₂)=(17.27,−1.15,

0.037). Figure 7b depicts the con- vergence of the stochastic parameters. The mean of the optimal stochastic parameter values are

σ0 =

0.60,

σ1 =

0.094 and

σ2 =