• Aucun résultat trouvé

Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

N/A
N/A
Protected

Academic year: 2022

Partager "Combining predictive coding and neural oscillations enables online syllable recognition in natural speech"

Copied!
7
0
0

Texte intégral

(1)

Article

Reference

Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

HOVSEPYAN, Sevada, OLASAGASTI RODRIGUEZ, Miren Itsaso, GIRAUD MAMESSIER, Anne-Lise

Abstract

On-line comprehension of natural speech requires segmenting the acoustic stream into discrete linguistic elements. This process is argued to rely on theta-gamma oscillation coupling, which can parse syllables and encode them in decipherable neural activity. Speech comprehension also strongly depends on contextual cues that help predicting speech structure and content. To explore the effects of theta-gamma coupling on bottom-up/top-down dynamics during on-line syllable identification, we designed a computational model (Precoss—predictive coding and oscillations for speech) that can recognise syllable sequences in continuous speech. The model uses predictions from internal spectro-temporal representations of syllables and theta oscillations to signal syllable onsets and duration.

Syllable recognition is best when theta-gamma coupling is used to temporally align spectro-temporal predictions with the acoustic input. This neurocomputational modelling work demonstrates that the notions of predictive coding and neural oscillations can be brought together to account for on-line dynamic sensory processing.

HOVSEPYAN, Sevada, OLASAGASTI RODRIGUEZ, Miren Itsaso, GIRAUD MAMESSIER, Anne-Lise. Combining predictive coding and neural oscillations enables online syllable recognition in natural speech.

Nature Communications

, 2020, vol. 11, no. 1, p. 3117

DOI : 10.1038/s41467-020-16956-5 PMID : 32561726

Available at:

http://archive-ouverte.unige.ch/unige:143429

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Supplementary information

Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

Hovsepyan et al.

(3)

Supplementary Figures

Supplementary Figure 1. Extraction of spectrotemporal patterns. The figure illustrates how the spectrotemporal pattern (STf ) of each syllable was calculated. We divide each syllable into 8 bins of equal duration. As we have six frequency channels and eight gamma units per syllable, we created 6x8 matrices where each entry corresponds to the average amplitude of the associated frequency channel over the duration of each of the 8 temporal bins. For example, the entry ST27 (f=2 and =7), the dashed entry on the figure, represents the average activation of the second frequency band within the seventh gamma cycle.

(4)

Supplementary Figure 2: Dynamics of syllable and gamma units and performance evaluation. Panels a and b show the dynamics of syllable and gamma units for an example sentence; “Brakes shrieked behind us”. For the syllable units, each colour corresponds to a syllable (except the bright red line that is reserved for a silent unit - a unit that represents

“silence” in the input). For the gamma units, each colour corresponds to a specific gamma unit in the sequence (blue corresponding to the first unit and teal corresponding to the last (8-th) unit). Panels on the right side of the plot illustrate order and duration of each syllable in the input sentence (panel d) and the model’s output (panel c) based on the dynamics of the syllable units (panel a). We select the syllable unit that has the highest average activation between two consecutive gamma-1 unit activations (highlighted blue unit in panel b) as those represent when the model started a new gamma sequence for a new syllable. Therefore, we quantify how long the selected syllable units correspond to the syllable in the input (red arrowed lines in panel c). For this particular sentence (with a duration of 975 ms) the duration of the overlap was 582ms; hence model performance was 59.7% (100 * 582/975).

(5)

Supplementary Figure 3. Schematics of the model. The figure shows all the variables in the model. Black boxes include the hidden states of each level, whereas grey boxes correspond to the information that each level passes to the level below. The top-level sends information about the slow amplitude modulation A(1) and the dynamics of the syllable (1)

and gamma units (1). The output of the first level (A(0) for the slow amplitude modulation and

f(0) for the frequency channels) is then compared with the input signal (S(t) and Ff(t) respectively).

(6)

Supplementary Tables

Supplementary Table 1.

Model variant

A B C D E F

A 0.765 8.41e-11 1.16e-11 9.93e-35 7.26e-35

B 3.35e-9 9.8e-12 2.68e-35 1.14e-35

C 0.1733 6.19e-31 5.62e-31

D 6.84e-28 4.59e-28

E 0.8177

F

p-values for pairwise comparisons. For pairwise comparisons (overall N=15), the Wilcoxon signed-rank test was used. All differences, except A vs. B, C vs. D and E vs. F, are highly significant with p < 1e-7/N (Bonferroni correction).

(7)

Supplementary Table 2.

Model variant

A B C D E F

A 2520 1607 5721 34834 36035

B -912 3201 32314 33516

C 4114 33227 34428

D 29113 30314

E 1201

F

Bayesian Information Criterion. Each row shows how much the BIC value of the corresponding model variant is bigger/smaller than the BIC value of other variants.

Références

Documents relatifs

It has been shown that the appropriate multidimensional quaternion representation of acoustic features, alongside with the Hamilton product, help QCNN and QRNN to well-learn

Keywords : Automatic speech recognition (ASR) Universal background model (GMM-UBM) G729 VoIP Resynthesized speech T-norm scoring. Centre de Recherche en Technologies Industrielles

In the first experiment (see Section 4) the speaker identification and verification performance degradation due to the utilization of the three GSM speech coders was assessed.. In

The optimal Neural Net Work is that of non-ordered data base, it consists of one layer, five nodes and 150 steps, because it gives a satisfying error rate and works better than that

The present study investigates the duration of syllables with relation to position within phrases and the pattern of segment omissions within syllables in a text read by 12 French

Phonemes are first grouped into relevant classes and rules are then divided in two categories: general rules, that give the boundary between 2 vowels depending on the number

To reduce the complexity of LP analysis and design an efficient speech coding in fixed-point arithmetic, different algorithms are implemented with Number Theoretic Transforms (N T T

The results of testing for assessing intelligibility using experts and the proposed approach for healthy speakers with and without the use of tongue in pronunciation and patients