• Aucun résultat trouvé

ASR performance prediction on unseen broadcast programs using convolutional neural networks

N/A
N/A
Protected

Academic year: 2021

Partager "ASR performance prediction on unseen broadcast programs using convolutional neural networks"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-02088829

https://hal.archives-ouvertes.fr/hal-02088829

Submitted on 3 Apr 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

ASR performance prediction on unseen broadcast programs using convolutional neural networks

Zied Elloumi, Laurent Besacier, Olivier Galibert, Juliette Kahn, Benjamin Lecouteux

To cite this version:

Zied Elloumi, Laurent Besacier, Olivier Galibert, Juliette Kahn, Benjamin Lecouteux. ASR perfor- mance prediction on unseen broadcast programs using convolutional neural networks. ICASSP, 2018, Alberta, Canada. �hal-02088829�

(2)

ASR Performance Prediction on Unseen Broadcast Programs using Convolutional Neural Networks

Zied Elloumi, Laurent Besacier, Olivier Galibert, Juliette Kahn, Benjamin Lecouteux

Contact :

[email protected]

 We used an existing tool named TranscRater for baseline regression approach

 We adapted the TranscRater tool from English to French that requires engineered features to predict the WER performance

 Analysis of predicted WERs

 We presented an evaluation framework for evaluating ASR performance prediction on unseen broadcast programs

 CNNs were very efcient encoding both textual (ASR transcript) and signal to predict WER.

Future work : Analyze the learnt representations of our performance prediction system

 We built our own French ASR system based on the KALDI toolkit

 We used a French data from diferent broadcast collections : ESTER, ETAPE, REPERE, Quaero

 We proposed a new End-to-End performance prediction system based on CNNs.

 Several approaches to encode speech signal

Input : textual, speech signal or the both textual+ speech signal

Output : two proposed methods to predict a continous value ( CNN Softmax and CNN ReLU )

 Distribution of speech turns according to their WER

(a) real (b) Regression

Pred

(c) CNN

Pred

Regression Baseline Regression Baseline

Conclusion Conclusion

Test prediction WER=31.20

59h27 04h42

04h17 100h51

Train prediction WER=22.29

Train Acoustic

Evaluation framework

Evaluation framework Our proposed approach Our proposed approach Abstract

Abstract

2018 IEEE International Conference on Acoustics, Speech and Signal Processing 15–20 April 2018 • Calgary, Alberta, Canada

Automatic Transcription

Feature Extraction SIG LEX LM POS

PP System

 Propose an heterogenous French corpus dedicated to performance prediction task

 Compare two prediction approaches : regression (engineered features) based vs a new strategy based on convolutional neural networks (learnt features).

 The joint use of textual and signal features did not work for the regression baseline while the combination of inputs for CNNs leads to the best WER prediction performance.

 CNN prediction predicts the WER distribution on a collection of speech recordings

Results & Analysis

Results & Analysis

Références

Documents relatifs

This proposed approach is tested on a simulated environment, using a real propagation model based on measurements, and compared with the classi- cal matrix completion approach, based

Recently, deep learning models such as deep belief network (DBN) [6], recurrent neural network (RNN) [7], and long short-term memory (LSTM) [8] have been used for traffic

 pushing forward to weakly supervised learning the problem Medical Image Segmentation of Thoracic Organs at Risk in CT-images..

Modern signal processing techniques previously used in the field of speech recognition have also made significant progress in the field of automatic detection of abnormal voice..

Figure 6: Accuracy stability on normal dataset (for class ”mountain bike”), showing a gradual increase of accuracy with respect to decreasing neurons’ criticality in case of

In general, the spectrogram of bird sound audio is regarded as the input and the model would treat the bird-call identification task as an image classification problem.. This

We present a three-layer convolutional neural network for the classification of two binary target variables ’Social’ and ’Agency’ in the HappyDB corpus exploiting lexical density of

SSN 1 The CLNN model used the given 33 layers of spatial feature maps to predict the taxonomic ranks of species.. Every layer of each tiff image was first center-cropped to a size