Reward prediction error and phase connectivity during declarative learning 1

Abstract

Recent investigations into the role of reward prediction errors (RPEs) demonstrated that both non-declarative (or procedural) and declarative learning are modulated by RPEs. In particular, earlier studies from our lab found that large, positive RPEs boost declarative learning.

However, thus far it remained unclear how RPEs facilitate this effect on the neural level. One plausible hypothesis is that RPEs induce theta (48 Hz) phase synchronization, thereby enabling long-term-potentiation, facilitating synaptic plasticity and the binding of several crucial brain regions relevant to the task at hand. As a first test of this hypothesis, we investigated whether RPEs modulate theta phase connectivity between the frontal and motor cortex in a declarative learning task. Participants learned word categories of 60 Swahili words while EEG was recorded. At the start of performance feedback processing, the data revealed an oscillatory (delta) power signature consistent with unsigned RPE (URPE; a surprise, or different-than-expected signal). Slightly later during performance feedback processing, we observed oscillatory (theta, delta, and high-alpha) power signatures for signed RPE (SRPE; a better-than-expected signal). Interestingly, the current results suggest increased phase synchronization in the delta frequency range predicted by URPE. Finally, in line with our hypothesis, one cluster in the theta frequency range showed increased phase connectivity between the frontal and right motor cortex during performance feedback processing. Activity

1 Based on: Ergo, K., Verbeke, P., De Loof, E., & Verguts, T. (in preparation). Reward prediction error and phase connectivity during declarative learning.

within this cluster reflected a URPE pattern, suggesting an important role of surprise in declarative learning, mediated by phase connectivity.

Reward prediction error and phase connectivity during declarative learning

Introduction

Reward prediction errors (RPEs; i.e., deviations between the predicted (or expected) and received reward) play a fundamental role in learning (Rescorla & Wagner, 1972; Steinberg et al., 2013). Historically, RPEs were primarily studied in procedural memory (e.g., Ohlsson, 1996); the memory for skills and habits that are gradually learned. Interestingly, a recent body of work has drawn important parallels between procedural and declarative memory; the memory for facts, events, and concepts that are rapidly acquired (Bastin & Van der Linden, 2003; Eichenbaum, 2004; Shohamy & Adcock, 2010; Squire, 2004, 2009; Ullman, 2004).

Specifically, several independent research groups established that RPEs also drive declarative learning (for a review, see Ergo, De Loof, & Verguts, 2020). Declarative learning relies heavily on two brain regions: the hippocampus and the medial prefrontal cortex (mPFC) (Preston &

Eichenbaum, 2013). Evidence suggests that these key regions are both anatomically (Jay &

Witter, 1991; Thierry et al., 2000) and functionally connected (Siapas et al, 2005).

Broadly speaking, studies investigating RPE-based declarative learning can be categorized into two main approaches. In the reward-prediction approach (Bunzeck et al., 2010; Davidow et al., 2016; De Loof et al., 2018; Jang et al., 2019; Rouhani et al., 2018;

Wimmer et al., 2014), participants acquire declarative information while concurrently estimating a reward distribution, which is then used to calculate a RPE. The degree of difficulty in estimating this distribution varies among experiments. Another way to investigate the role of RPEs in declarative learning is the multiple-repetition approach (Butterfield & Metcalfe, 2001; Metcalfe et al., 2012; Pine et al., 2018). Here, participants are exposed to general information questions that are repeated throughout the experiment. Participants calculate a RPE based on their confidence in previous presentations of the questions. Compellingly, whereas studies utilizing the reward-prediction approach usually find signed RPE (SRPE;

signifying that the outcome is better or worse than expected) effects; with enhanced memory performance for large positive RPEs, by contrast, the multiple-repetition approach typically

yields unsigned RPE (or surprise; indicating that the outcome is different than expected) (URPE) effects; reporting increased memory performance for large absolute valued RPEs.

Although several recent publications have appeared documenting RPE-based declarative learning, how RPEs drive declarative learning on the neural level has rarely been studied before. Therefore, in the current chapter, we focus on its neural underpinnings.

Neurally, RPEs are coded by dopaminergic neurons located in the ventral tegmental area (VTA) (Cohen et al., 2012) and substantia nigra (SN) (Zaghloul et al., 2009), and are projected to several brain regions, including the medial frontal cortex (MFC) (Nieuwenhuis et al., 2004).

The firing pattern of these dopaminergic neurons changes in response to RPEs: Neurons show increased activity when the outcome is better than expected (i.e., positive prediction error), decreased activity when the outcome is worse than expected (i.e., negative prediction error), and remain at baseline level when the outcome matches the expectation (i.e., no prediction error) (Schultz et al., 1997). RPEs thus trigger the release of neuromodulators within the brain.

One possible neuromodulatory influence of dopamine happens via the modulation of neural oscillations.

For brain regions to functionally cooperate, to optimally communicate, and for learning to take place, it is thought that the brain’s neural activity (or oscillations) must be synchronized (Fries, 2005, 2009; Varela et al., 2001; Verguts, 2017). More concretely, for information to efficiently transfer between brain regions, both the sending and receiving neurons have to be excitable at the same time. In declarative learning, oscillations in the theta (48 Hz) frequency band seem to be of particular importance (Herweg et al., 2019; Hsieh &

Ranganath, 2014; Klimesch & Doppelmayr, 1996). More specifically, it is hypothesized that (theta) phase synchronization between task-relevant brain areas facilitates memory encoding

Reward prediction error and phase connectivity during declarative learning

2008). More precisely, it is hypothesized that this neurochemical process allows (declarative) episodes to be efficiently glued together into a single (declarative) memory (Berens & Horner, 2017). In our previous EEG study (Ergo et al., 2019), we provide preliminary evidence for this idea. During reward feedback, EEG oscillatory signatures were found first in the theta band (48 Hz) reflecting a URPE signal, then in the high-beta (2030 Hz) and high-alpha (1017 Hz) frequency bands, reflecting a SRPE signal.

This raises the question of how RPEs modulate (theta) phase synchronization. One mechanism proposed by Fries (2005) is that during reward feedback, theta phase resetting happens in the medial frontal cortex (MFC). Phase resetting is the realignment of phases of ongoing neural oscillations. This phase reset is broadcasted to relevant processing areas (e.g., motor cortex), which in turn facilitates synchronization (or binding) between these brain areas and learning to take place. In other words, one possibility is that RPEs are projected to MFC (Nieuwenhuis et al., 2004) (via dopaminergic neuromodulatory signaling) where they induce a theta phase reset followed by an increase in theta phase synchronization (i.e., making the theta phase of two brain areas identical so that theta oscillations in both areas ‘go up and down’ together) between relevant brain areas (Canavier, 2015). Indeed, studies showing theta phase synchronization in (reward prediction) error-based learning provide support for this hypothesis (Cavanagh et al., 2010; Cohen et al., 2007). Additionally, the computational model of Verbeke and Verguts (2019) further attests to the idea that RPE-theta interactions drive learning. Specifically, in their model negative RPEs induce increased theta power and thus behavioral change.

The role of theta phase synchronization has already been well-established in procedural learning (e.g., Sauseng et al., 2007). In the current study, we aimed to investigate whether RPE-based declarative learning is also driven by theta phase synchronization. For this reason, we measured electroencephalography (EEG) signals and investigated whether RPEs modulate phase connectivity between the frontal and motor cortex; and how such phase connectivity, in turn, drives declarative learning. Based on our previous findings (Ergo et al., 2019), we hypothesized that: (1) behaviorally, recognition accuracy should increase with SRPE, and (2) neurally, theta (48 Hz) power and theta phase synchronization between the frontal and motor cortex should increase with URPE (i.e., large, absolute valued RPE values).

By subtracting right-hand button presses from left-hand button presses, we expected to find increased phase synchronization between the frontal and right motor cortex. Based on the literature, we predicted that (theta) phase synchronization should enable declarative learning and that the degree of synchronization should scale with RPE. More concretely, increased phase synchronization for large (absolute valued) RPEs should be correlated with enhanced recognition accuracy. To test this, we used an adapted version of the variable-choice paradigm in which participants learned 60 word-category associations. Crucially, we introduced four word categories that remained the same across trials. Two of these categories appeared left-of-center on the screen and were thus tied to left-hand button presses, while the other two categories appeared right-of-center, and were thus coupled with right-hand button presses. By doing so, we assured that RPE conditions were experimentally crossed with left and right-hand button presses. This manipulation allowed us to examine oscillatory power and phase connectivity between the frontal and (left and right) motor cortex during declarative learning.

Methods Participants

The sample size was estimated based on a previous EEG study that used the variable-choice paradigm (Ergo et al., 2019). For the current experiment, a total of 39 participants were tested. All participants were recruited through the online recruitment platform provided by Ghent University. During the analysis, three participants were discarded due to noisy EEG signals. All analyses were run on the remaining 36 (30 female) healthy, with normal or corrected-to-normal vision, Dutch-speaking, right-handed participants. For this experiment,

Reward prediction error and phase connectivity during declarative learning

participant with the best performance in the recognition test was additionally rewarded with a gift voucher worth €20.

Material

A total of 64 words (60 Swahili words and 4 Dutch word categories) (see Appendix) were used. The experiment was run on a Dell Optiplex 9010 mini-tower running E-Prime software (Schneider et al., 2012).

Procedure

Familiarization Task

To familiarize participants with the stimuli used in the experiment and to control for the novelty of the foreign Swahili words, a familiarization task was included at the start of the experiment. All Swahili (N = 60) words were randomly presented for 2 seconds. Participants were instructed to read the words in silence.

Acquisition Task

Participants had to learn the word category of 60 Swahili words. On each trial, one Swahili word was presented on top of the screen together with four word categories (animal, job, color, or plant) below, of which only one was the correct word category (Figure 1a). The presentation of the word categories was counterbalanced across participants. After 4 seconds, either one, two, or four word categories were framed. These frames indicated the eligible word categories for the Swahili word. In the one-option condition, only one word category was framed and participants could choose with certainty (i.e., 100% probability of being rewarded). In the two-option condition, two word categories were framed and participants had a 50% probability of choosing the correct option (and thus of obtaining reward). This probability reduced to 25% when they were presented with the four-option condition in which all four word categories were framed (i.e., 25% probability of being rewarded). Each word category was assigned one response key (‘d’, ‘f’, ‘j’ or ‘k’) and participants had to respond with the middle and index finger of the left and right hand. A crucial manipulation in the paradigm was that, for each participant and on each trial, the same

two word categories were paired with either a left hand (‘d’ and ‘f’) or right hand (‘j’ and ‘k’) response. This allowed us to study lateralized connectivity effects of RPE. There was no time limit to respond. After participants made their choice, a reward anticipation phase of 3 seconds was inserted to exclude motor artifacts contaminating the EEG signal. This was followed by performance feedback for 3 seconds. Next, the to-be-learned Swahili word category pair was presented for 5 seconds. The Swahili word, an equation sign, and its word category appeared on the screen. Participants were instructed to use this time to encode the pair. If the chosen word category was rewarded, a green frame was presented around the Swahili word and the chosen word category. Alternatively, if the chosen word category was unrewarded, a red frame appeared around the Swahili word and one of the other possible word category options. Each trial ended with a reward update for 2.5 seconds, indicating how much money they earned up until the last completed trial. Participants won €0.71 on rewarded trials, while no money was added on unrewarded trials. In Figure 1a, the two-option condition with unrewarded choice is illustrated.

Design.RPEs were calculated by a priori determining the number of options (one, two, or four options) as well as the reward (reward/no reward) on each trial. By doing so, a RPE for each cell of the design could be computed (Figure 1c). Note that by predetermining performance feedback at each trial, participants did not necessarily learn the actual word categories of the Swahili words. For example, if a trial belonged to the rewarded condition, participants received positive feedback irrespective of their choice. Participants were unaware of this manipulation during the experiment but were debriefed afterward. SRPEs were obtained by subtracting reward probability from reward outcome. For rewarded trials, reward outcome is equal to one, whereas reward outcome is equal to zero for unrewarded trials. Reward probability is one divided by the number of eligible options. URPEs are

Reward prediction error and phase connectivity during declarative learning

were sequentially presented onto the screen. Participants pressed ‘f’ for digits smaller than 5 and ‘j’ for digits larger than 5.

Recognition Test

Participants were presented with the 60 Swahili word-category pairs. On each trial, the Swahili word appeared on top of the screen together with the same four word categories from the acquisition task (Figure 1b). The spatial position of the word categories remained the same within participants. This time no word categories were framed. No time constraint was imposed. Participants made their choice by pressing one of the four designated response buttons (‘d’, ‘f’, ‘j’, ‘k’). After indicating their choice, we probed how certain they were of their answer: ‘very uncertain’, ‘rather uncertain’, ‘rather certain’ or ‘very certain’ (measured on a scale from 1 ‘very uncertain’ to 4 ‘very certain’). At the end of each trial, they were given feedback about whether they chose the correct word category for the Swahili word.

Figure 1

Experimental Paradigm (a-b) and Design (c)

Note. Overview of the experiment (a-b) and experimental design (c). (a) In the acquisition task, participants chose between one, two, or four word categories. An acquisition trial is illustrated assuming that this is a trial from the “2 options, rewarded choice” cell in the experimental design, and the participant made a left-hand button response choosing ‘animal’

Reward prediction error and phase connectivity during declarative learning

unsigned RPE (SRPE and URPE). SRPEs were calculated by subtracting the probability of reward from obtained reward; URPE is the absolute value of SRPE.

Data Analysis

All behavioral data were analyzed using the linear mixed effects framework in R software (R Core Team, 2014). For continuous dependent variables (e.g., certainty ratings in the recognition test) linear mixed effects models were used, while for categorical dependent variables (e.g., recognition accuracy) generalized linear mixed effects models were applied. A random intercept for participant was included in each model, while all predictors were mean-centered. Note that SRPEs were treated as a continuous predictor allowing the inclusion of all 60 trials per participant to estimate its regression coefficient, except for invalid trials (i.e., trials on which a non-framed Swahili translation was chosen during the acquisition task). We report the ² statistics from the ANOVA Type III tests. The best-fitting models were selected based on the Akaike’s information criterion (AIC) (with k the number of independent variables and L the log-likelihood estimate):

𝐴𝐼𝐶 = 2 ∗ 𝑘 − 2 ∗ 𝑙𝑛(𝐿)

Lower AIC values indicate a better model, balancing model complexity (k) and empirical fit (L).

EEG

EEG Preprocessing

Continuous EEG activity was collected using a BioSemi ActiveTwo system (BioSemi, Amsterdam, Netherlands) at a sampling rate of 1024 Hz. 64 Ag-AgCl electrodes were arranged in the standard international 10-20 EEG system (Jasper, 1958), with a posterior CMS-DRL electrode pair. Two reference electrodes were positioned at the left and right mastoids.

Vertical eye movements and eye blinks were registered with a pair of electrodes above and below the left eye, while horizontal eye movements were measured with two electrodes at the outer canthi of both eyes.

Data preprocessing was done in MATLAB (MATLAB R2014b, The MathWorks, Inc., Natick, Massachusetts, United States) using in-house EEGLAB preprocessing scripts (Delorme

& Makeig, 2004). First, the data were re-referenced offline to the average of the mastoid electrodes and data stretches with excessive noise were manually removed after visual inspection. Next, the data were band-pass filtered between 0.5 Hz and 48 Hz. Independent component analysis (ICA) was run on the filtered data to remove eye movement artifacts.

Two participants required the interpolation of one electrode due to noisy EEG signals.

Afterward, the ICA weights were superimposed onto the unfiltered but manually cleaned EEG data. Next, the data were epoched time-locked to the onset of the performance feedback in the acquisition task. Baseline activation was extracted between -600 ms to 0 ms relative to performance feedback onset and subtracted from all epochs. Finally, the data were downsampled to 512 Hz before time-frequency analysis.

EEG Time-Frequency Analysis

For the EEG time-frequency analysis, the procedure was based on the code provided by Cohen (2014). Time-frequency power and phase information were extracted by applying complex Morlet wavelets for frequencies between 2 and 48 Hz defined in 25 logarithmically spaced steps. For each frequency, we used between 3 and 8 cycles, which were also defined in 25 logarithmically spaced steps. Time-frequency decomposition was done separately for each participant, electrode, and trial. Next, a baseline correction was performed. Baseline correction was done for each participant, frequency, and electrode separately, based on the average baseline activity across all 60 trials. Baseline power was computed from -600 to 0 ms before performance feedback onset. Finally, the baseline-corrected data underwent a decibel conversion. In analogy to the behavioral analysis, trials were removed when an invalid (i.e.,

Reward prediction error and phase connectivity during declarative learning

are represented in zero-phase differences only (Bruña et al., 2018; Nolte et al., 2004). Here, we computed the iPLV between electrode FCz and all other scalp electrodes for every time point and RPE condition in the performance feedback-locked data. The following equation was used:

Which computes the average phase angle () difference over trials (n trials, indexed by t).

Non-Parametric Clustering Analysis

Power. Here, we verify whether oscillatory power in several frequency bands reflected the occurrence of SRPEs. To account for multiple comparisons, we applied a non-parametric clustering procedure (Maris & Oostenveld, 2007) tailored to our paradigm. First, we computed for each participant correlations between SRPE and the power estimates at each point in electrode, frequency, and time. Next, the 0.5% highest positive and lowest negative correlations were entered into the non-parametric clustering analysis with 1000 iterations. From these, we clustered adjacent neighbors in the electrode, frequency, and time domains. Correlations were considered neighbors when electrodes were located within 4 cm of each other, when they were within one frequency step and within one time step. For each of these clusters, we then calculated a cluster-level statistic by multiplying the number of items (i.e., (electrode, frequency, time) points) in the cluster with the largest correlation statistic of that cluster. Next, to determine the significance of the observed positive and negative clusters, we performed a non-parametric permutation test on the observed cluster-level statistics. During this procedure, the SRPE labels were randomly permuted for each participant before computing the random correlations between SRPE and the power estimates. A significance threshold of 5% was imposed on the subsequent non-parametric permutation test with 1000 iterations. Finally, clusters that survived this permutation test were taken into the consecutive analyses (see also Ergo et al., 2019). iPLV. In line with Cavanagh et al. (2010), for the phase connectivity analyses, we considered FCz as a seed electrode and all other scalp electrodes as receiver electrodes. The iPLV was computed for each time-point in the performance feedback-locked data. Note that the iPLV is computed

across trials, this means that we have one iPLV value for each RPE condition for each participant. We again applied a non-parametric clustering procedure (the same procedure as described above for power, with the exception that condition labels are shuffled instead of

Dans le document Reward and reward prediction error in declarative memory (Page 52-92)