• Aucun résultat trouvé

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis

N/A
N/A
Protected

Academic year: 2021

Partager "Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-00734464

https://hal.inria.fr/hal-00734464

Submitted on 22 Sep 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis

Ingmar Steiner, Korin Richmond, Slim Ouni

To cite this version:

Ingmar Steiner, Korin Richmond, Slim Ouni. Using multimodal speech production data to evalu-

ate articulatory animation for audiovisual speech synthesis. 3rd International Symposium on Facial

Analysis and Animation - FAA 2012, Sep 2012, Vienna, Austria. �hal-00734464�

(2)

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis

Ingmar Steiner Korin Richmond Slim Ouni University College Dublin University of Edinburgh Universit´e de Lorraine

& Trinity College Dublin LORIA, UMR 7503

1 Introduction

The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Never- theless, while state-of-the-art, data-driven approaches to facial an- imation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Us- ing appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.

2 Articulatory animation

To complement a purely data-driven AV synthesizer employing bi- modal unit-selection [Musti et al. 2011], we have implemented a framework for articulatory animation [Steiner and Ouni 2012] us- ing motion capture of the hidden articulators obtained through elec- tromagnetic articulography (EMA) [Hoole and Zierdt 2010]. One component of this framework compiles an animated 3D model of the tongue and teeth as an asset usable by downstream components or an external 3D graphics engine. This is achieved by rigging static meshes with a pseudo-skeletal armature, which is in turn driven by the EMA data through inverse kinematics (IK). Subjectively, we find the resulting animation to be both plausible and convincing.

However, this has not yet been formally evaluated, and so the moti- vation for the present paper is to conduct an objective analysis.

3 Multimodal speech production data

Themngu0articulatory corpus1contains a large set of 3D EMA data [Richmond et al. 2011] from a male speaker of British English, as well as volumetric magnetic resonance imaging (MRI) scans of that speaker’s vocal tract during sustained speech production [Steiner et al. 2012]. Using the articulatory animation framework, static meshes of dental cast scans and the tongue (extracted from the MRI subset of themngu0corpus) can be animated using motion capture data from the EMA subset, providing a means to evaluate the synthesized animation on the generated model (Figure 1).

4 Evaluation

In order to analyze the degree to which the animated articulators match the shape and movements captured by the natural speech pro- duction data, several approaches are described.

• The positions and orientations of the IK targets are dumped to data files in a format compatible with that of the 3D articulo- graph. This allows visualization and direct comparison of the

ingmar.steiner@ucd.ie

korin@cstr.ed.ac.uk

slim.ouni@loria.fr

1freely available for research purposes fromhttp://mngu0.org/

Figure 1:Animated articulatory model in bind pose, with and with- out maxilla; EMA coils rendered as spheres.

animation with the original EMA data, using external analysis software.

• The distances of the EMA-controlled IK targets to the sur- faces of the animated articulators should ideally remain close to zero during deformation. Likewise, there should be colli- sion with a reconstructed palate surface, but no penetration.

• A tongue mesh extracted from a volumetric MRI scan in the mngu0 data, when deformed to a pose corresponding to a given phoneme, should assume a shape closely resembling the vocal tract configuration in the corresponding volumetric scan.

These evaluation approaches are implemented as unit and integra- tion tests in the corresponding phases of the model compiler’s build lifecycle, automatically producing appropriate reports by which the naturalness of the articulatory animation may be assessed.

References

HOOLE, P.,ANDZIERDT, A. 2010. Five-dimensional articulogra- phy. InSpeech Motor Control: New developments in basic and applied research, B. Maassen and P. van Lieshout, Eds. Oxford University Press, 331–349.

MUSTI, U., COLOTTE, V., TOUTIOS, A., ANDOUNI, S. 2011.

Introducing visual target cost within an acoustic-visual unit- selection speech synthesizer. InProc. 10th International Con- ference on Auditory-Visual Speech Processing (AVSP), 49–55.

RICHMOND, K., HOOLE, P.,ANDKING, S. 2011. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. InProc. Interspeech, 1505–1508.

STEINER, I.,ANDOUNI, S. 2012. Artimate: an articulatory an- imation framework for audiovisual speech synthesis. InProc.

ISCA Workshop on Innovation and Applications in Speech Tech- nology.

STEINER, I., RICHMOND, K., MARSHALL, I.,ANDGRAY, C. D.

2012. The magnetic resonance imaging subset of the mngu0 articulatory corpus.Journal of the Acoustical Society of America 131, 2 (Feb.), 106–111.

Références

Documents relatifs

Table 6 lists the classification accuracies in terms of frames correctly classified for each AF value for the SVM classification systems (trained on 100K training frames)

The study showed that improvements on the order of 35% can be achieved in acoustic-to-articulatory inversion by adding a small number of articulatory measures of the subject’s

Index Terms: speech corpus, speech production, speech syn- thesis, 3D MRI data, real-time MRI data, multi-modal database, French

The speech intelligibility carried out by the vision of our lip model has been evaluated by Le Goff et al. They presented 18 non-sense French words to 20 subject with normal vision

This work comprises two aspects: (i) fusing the different imag- ing and articulatory modalities – this mainly consists in registering US images and stereovision data within the

In this context, we build on available multimodal articulatory data and present an inver- sion framework to recover articulation from audio- visual speech information.. Speech

The im- pact of a better comprehension of speech production thus cov- ers several theoretical and applied areas of automatic speech processing: links between production and

In this paper, EMA data is processed from a motion-capture perspective and applied to the visualization of an existing mul- timodal corpus of articulatory data, creating a kinematic