Récemment recherché

Aucun résultat trouvé

Étiquettes

Aucun résultat trouvé

Document

Aucun résultat trouvé

Accueil Écoles Thèmes

Connexion

Emergence of attention in a neural model of visually grounded speech

Partager "Emergence of attention in a neural model of visually grounded speech"

N/A

N/A

Protected

Année scolaire: 2021

Info

Protected

Academic year: 2021

Partager "Emergence of attention in a neural model of visually grounded speech"

Copied!

2

0

0

2

0

0

Chargement.... (Voir le texte intégral maintenant)

Télécharger maintenant ( 2 Page )

Texte intégral

(1)

HAL Id: hal-01970514

https://hal.archives-ouvertes.fr/hal-01970514

Submitted on 5 Feb 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Emergence of attention in a neural model of visually grounded speech

William Havard, Jean-Pierre Chevrot, Laurent Besacier

To cite this version:

William Havard, Jean-Pierre Chevrot, Laurent Besacier. Emergence of attention in a neural model of visually grounded speech. Learning Language in Humans and in Machines 2018 conference, Jul 2018, Paris, France. �hal-01970514�

(2)

Emergence of attention in a neural model of visually grounded speech

William N. Havard ¹² , Jean-Pierre Chevrot ¹ , Laurent Besacier ²

1

LIDILEM, Univ. Grenoble Alpes, 38000 Grenoble, France

2

LIG, Univ. Grenoble Alpes, CNRS, Grenoble INP, 38000 Grenoble, France

1. Introduction

• Context provides children with all necessary information to build a coherent mental representation of the world. While acquiring their native language, children learn to map portions of this mental representation to whole or part of acoustic realisations they perceive from surrounding speech: this process is known as lexical acquisition.

• The goal of this research is to study how a visually grounded neural network segments a continuous flow of speech hoping it could shed light on how children perform the same process. To do so, we analyses how the final representation learnt by the attention mechanism changes over time and more specifically, we focus on the attention models of the network.

• An attention mechanism is a component that computes a vector representing the focus of the system on different parts of the speech input.

2. Architecture

r@1 r@5 r@10 Avg. rank Base model 0.108 0.305 0.443 13

Our model 0.059 0.193 0.301 25

• Based on Chrupała et al. (2017)

• Two attentions mechanisms (inspired by Matthys et al., 2005)

• Model regularly saved during training

3. Attention weights

Attention weights for GRU1 and GRU5 evolve differently over time suggesting they encode speech in a different manner

Peaks are first randomly positioned and gradually shift to part of the speech signal that are useful in describing the image before reaching their final position.

0

^th

epoch (random weights)

1

^th

epoch - 26k training instances

15

^th

epoch - 8.4M training instances

4. Peak detection and statistics

• Peak detection for each level of attention

• Statistics on peak number and height and speech segments located underneath

• Attention unevenly distributed over time between GRU1 and GRU5:

– Network gradually reduces the number of peaks for GRU5 while making them higher.

The network focuses on very one specific part of the speech signal: the main concept of the image

– On the contrary, the number of peaks for GRU1 increases while also making them smaller.

GRU1 attention is distributed over the whole speech signal and is thus less specific

5. Word level

Ref. frequencies GRU1 GRU5

1-gram 2-gram 1-gram 2-gram 1-gram 2-gram

a

<S0>

</S0>

on of the

in with

and is

<S0> a on a

in a of a with a a man in the on the next to

<S0> two

a of cat dog

is white tennis

in man with

a cat a man

a dog a tennis a giraffe

on a

<S0> a a white

a table a plate

with toilet

</S0>

<unk>

skateboard in

is sign and giraffe

a skateboard a toilet a giraffe living room

a baseball stop sign baseball player

a kitchen fire hydrant

a laptop

Top 10 occurrences

• Word bigrams for GRU1 and GRU5 very different from reference frequencies

• GRU5 focuses on the end of the captions (</S0>)

6. Phone level

GRU1 GRU5

1-gram 2-gram 1-gram 2-gram æ

t]

eı n]

ŋ]

t E ı n a ı

ı ŋ]

[k æ [2 v]

O ô aı t]

[d O u m]

[dZ Ç l eı

l A

n]

t]

ô 2 k]

d]

s]

Ç]

z]

sp

O ô eı n]

ı ŋ]

2 n]

u m]

aı n]

t Ç]

2 l]

k 2 2 s]

Top 10 occurrences

• GRU5 peaks mainly located above phones ending a word: cue for word segmen- tation?

7. Conclusion

• GRU5 highlights speech segments that match word bigrams and isolated word that correspond to the main concept of the image.

• Behaviour of the first attention mechanism (GRU1) is still unclear, and it is hard to tell so far which linguistic units are targeted by this attention mechanism. However, removing the attention layer on GRU1 yields worse results, suggesting it does bring useful information to the network.

• Further linguistics units could be considered, such as POS and syllables.

• Diachronic study : How does the distribution of the words under the peaks evolve over time?

References

[1] Grzegorz Chrupała, Lieke Gelderloos, and Afra Alishahi. Representations of language in a model of visually grounded speech signal. arXiv:1702.01991 [cs], February 2017. arXiv: 1702.01991.

[2] Sven L. Mattys, Laurence White, and James F. Melhorn. Integration of Multiple Speech Segmentation Cues: A Hierarchical Framework. Journal of Experimental Psychology: General, 134(4):477–500, 2005.

This work was supported by grants from NeuroCoG IDEX UGA in the framework of the "Investissements d’avenir" program (ANR-15-IDEX-02)

Références

Télécharger maintenant ( PDF - 2 Page - 918.78 KB )

Documents relatifs

How healthy is a food product? Neural bases of the use of nutritional information

Marie Prevost, Pascal Hot, Laurent Muller, Bernard Ruﬀieux, Paolo Crosetto, Emilie Cousin, Cédric Pichat, Monica Baciu.. To cite

How healthy is a food product? Neural bases of the use of nutritional information

Contrary to our hypotheses, both chromatic and numerical information activated a cerebral network underlying number processing 3 (inferior parietal lobule, middle frontal

End-to-end optimization of goal-driven and visually grounded dialogue systems Harm de Vries

Finally, the guesser model takes the generated dialogue D and the list of objects O and predicts the correct object..

Maternal exposure to low levels of corticosterone during lactation protects adult rat progeny against TNBS-induced colitis: A study on GR-mediated anti-inflammatory effect and prokineticin system

A reduction in PK2 mRNA levels was observed in CORT-nursed colitic animals (Fisher’s LSD post-hoc, *** p < 0.001 vs. control colitic rats). The colonic relative mRNA

Spatio-temporal metamodeling for West African monsoon

Here, we have used a data-driven clustering method, called smoothing spline clustering (SSC), that overcomes the aforementioned obstacles using a mixed-effect smoothing spline model

A neural network approach for the diagnosis of the continuous pulp digester

Moisture content (a) and Carbohydrate density (b): normalized variations introduced (solid) and neural network response (dashed) in trained conditions.. Moisture content (a)

Grounded Analytic Research: Building Theory from a Body of Research

The search problem could be formulated thus: find a method to integrate the empirical findings from research conducted by the same researcher (FA) and insights from a

View of Patient-ventilator interaction

L’objectif de cette étude était d’utiliser l’effet d’une résistance inspiro‑ expiratoire en l’absence et en présence de ventilation assistée pour étudier

Téléchargez tous les documents en téléchargeant vos documents d'étude.

Votre document sera enrichi, partagé sur 123dok FR pour vous aider à étudier.

Documents relatifs

Twitter bootstrap succinctly pdf

Twitter bootstrap succinctly pdf

39

0

0

Surficial Sediments, Grand Bank, Offshore Newfoundland

Surficial Sediments, Grand Bank, Offshore Newfoundland

44

0

0

Silicon-on-Insulator photonic crystal multi-tap microwave photonics filter

Silicon-on-Insulator photonic crystal multi-tap microwave photonics filter

3

0

0

Les protéines CCN : quand multimodulaire rime avec multifonctionnel

Les protéines CCN : quand multimodulaire rime avec multifonctionnel

12

0

0

Élaboration et caractérisation de matériaux à base de polymères thermoplastiques et de nanocharges lamellaires et métalliques pour emballage

Élaboration et caractérisation de matériaux à base de polymères thermoplastiques et de nanocharges lamellaires et métalliques pour emballage

166

0

0

Conserved rules in embryonic development of cortical interneurons

Conserved rules in embryonic development of cortical interneurons

36

0

0

Controlled inline fluid separation based on smart process tomography sensors

Controlled inline fluid separation based on smart process tomography sensors

13

0

0

Quelques problèmes additifs : bases, pseudo-puissances et ensembles k-libres

Quelques problèmes additifs : bases, pseudo-puissances et ensembles k-libres

96

0

0