Automatic speech recognition

Top PDF Automatic speech recognition:

Automatic speech recognition predicts speech intelligibility and comprehension for listeners with simulated age-related hearing loss

Automatic speech recognition predicts speech intelligibility and comprehension for listeners with simulated age-related hearing loss

Lionel Fontan, a,b Isabelle Ferrané, b Jérôme Farinas, b Julien Pinquier, b Julien Tardieu, c Cynthia Magnen, c Pascal Gaillard, d Xavier Aumont, a and Christian Füllgrabe e Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/ hearing-aid dispensers in the fine-tuning of hearing aids. Method: Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language
En savoir plus

13 En savoir plus

Development of Automatic Speech Recognition Techniques for Elderly Home Support: Applications and Challenges

Development of Automatic Speech Recognition Techniques for Elderly Home Support: Applications and Challenges

A rising number of recent projects in the smart home domain include the use of Automatic Speech Recognition (ASR) in their design [4][5][26][15][16][9][22] and some of them take into account the challenge of Distant Speech Recogni- tion [34][23]. These conditions are more challenging because of ambient noise, reverberation, distortion and acoustical environment influence. However, one of the main challenges to overcome for successful integration of VUIs is the adap- tation of the system to elderly. From an anatomical point of view, some studies have shown age-related degeneration with atrophy of vocal cords, calcification of laryngeal cartilages, and changes in muscles of larynx [32][24]. Thus, ageing voice is characterized by some specific features such as imprecise production of consonants, tremors and slower articulation [29]. Some authors [1][37] have re- ported that classical ASR systems exhibit poor performances with elderly voice. These few studies were relevant for their comparison between ageing voice vs. non-ageing voice on ASR performance, but their fields were quite far from our
En savoir plus

13 En savoir plus

Automatic Speech Recognition for African Languages with Vowel Length Contrast

Automatic Speech Recognition for African Languages with Vowel Length Contrast

This paper is organized as following. In Section 2, we summarize the works done in automatic speech recognition for under-resourced languages and mention some studies dedicated to vowel length contrast modeling in ASR. In Section 3, we describe the data we collected and used for our experiments. Then, Section 4 illustrates the vowel length contrast in Hausa using a large scale machine assisted analysis. In Section 5, we compare our ASR systems for Hausa and Wolof when handling a vowel duration model or not. We also present a simple way to combine length- contrasted and non length-contrasted CD-DNN-HMM models for ASR. Finally, Section 6 concludes this paper and gives a few perspectives.
En savoir plus

9 En savoir plus

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

1 Introduction While cascade speech-to-text translation (ST) systems operate in two steps: source language automatic speech recognition (ASR) and source-to-target text machine translation (MT), recent works have at- tempted to build end-to-end ST without using source language transcription during decoding (Bérard et al., 2016; Weiss et al., 2017; Bérard et al., 2018). After two years of extensions to these pioneering works, the last results of the IWSLT 2020 shared task on offline speech translation (Ansari et al., 2020) demonstrate that end-to-end models are now on par (if not better) than their cascade counterparts. Such a finding motivates even more strongly the works on multilingual (one-to-many, many-to-one, many-to- many) ST (Gangi et al., 2019; Inaguma et al., 2019; Wang et al., 2020a) for which end-to-end models are well adapted by design. Moreover, of these two approaches: cascade proposes a very loose integration of ASR and MT (even if lattices or word confusion networks were used between ASR and MT before end-to-end models appeared) while most end-to-end approaches simply ignore ASR subtask, trying to directly translate from source speech to target text. We believe that these are two edge design choices and that a tighter coupling of ASR and MT is desirable for future end-to-end ST applications, in which the display of transcripts alongside translations can be beneficial to the users (Sperber et al., 2020).
En savoir plus

15 En savoir plus

Dynamic adjustment of language models for automatic speech recognition using word similarity

Dynamic adjustment of language models for automatic speech recognition using word similarity

Index Terms— ASR, language modeling, OOV, word embeddings, lexicon extension 1. INTRODUCTION Automatic speech recognition (ASR) systems are often trained on large but static text corpora and with a fixed vocab- ulary. For a system whose goal is to recognize speech about current events, this can pose a problem, since new words are continually introduced based on the events that occur. A particular issue is proper nouns (PNs): the names of newly important people or locations may not be in the vocabulary of the system, but recognizing them can be paramount to un- derstanding the topic. It is not possible to simply find a large enough corpus to cover all of the important words, as novel names will always be introduced into a language. Therefore, a competent ASR system dealing with current events should accommodate adding new words to its vocabulary dynami- cally. Updating the n-gram language model (LM) of a deep
En savoir plus

8 En savoir plus

Parallel Recognizer Algorithm for Automatic Speech Recognition

Parallel Recognizer Algorithm for Automatic Speech Recognition

NRC Publications Archive Record / Notice des Archives des publications du CNRC : https://nrc-publications.canada.ca/eng/view/object/?id=860b4263-410a-4a17-b71b-3df920903733 https://publi[r]

4 En savoir plus

A Soft Computing Approach for On-Line Automatic
Speech Recognition in Highly Non-Stationary
Acoustic Environments.

A Soft Computing Approach for On-Line Automatic Speech Recognition in Highly Non-Stationary Acoustic Environments.

When we consider human-computer communications, like ASR and machine dialog systems, it is essential to monitor the audio streams of source speakers, background acoustic environments, and channel changes since they represent signicant challenges in maintaining the performance of the ASR system. Nevertheless, it is hard to design such a human-like environment-aware intelligent speech recognizer that explores the nature of noise [12]. Hardly few algorithms in the current literature have been shown to monitor and track acoustic environments properly and analyze their noises on-line, so as to adapt the acoustic models of an ASR system to its changing conditions.
En savoir plus

246 En savoir plus

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof

6. Conclusion This paper presented the data collected and ASR systems developped for 4 sub-saharan african languages (Swahili, Hausa, Amharic and Wolof). All data and scripts are avail- able online on our github 8 repository. More precisely, we focus on Wolof language by explaining our text and speech collection methodology. We trained two language models: one from some data we already owned and another one with the addition of data crawled from the Web. Finally, we present the first ASR system ever built in this language. The system which obtains the best score is the one using the LM2 and the DNNs, for which we got 27.21% of WER. Perspectives. In the short run, we intend to improve the quality of the LM2 by using neural networks. We also currently work on a duration model for the Wolof and the Hausa ASR systems.
En savoir plus

6 En savoir plus

Analysis and modeling of non-native speech for automatic speech recognition

Analysis and modeling of non-native speech for automatic speech recognition

Since we only have manual word transcriptions for the utterances in the corpus, we create forced phonetic transcriptions or forced paths: We begin with a set of exi[r]

88 En savoir plus

Natural Interaction for Serious Games : Enhancing Training Simulators through Automatic Speech Recognition

Natural Interaction for Serious Games : Enhancing Training Simulators through Automatic Speech Recognition

Publisher’s version / Version de l'éditeur: 2010 Interservice/Industry Training, Simulation and Education Conference I/ITSEC 2010 [Proceedings], 2010-12-02 NRC Publications Archive Recor[r]

11 En savoir plus

Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition

Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition

Since the phase difference is defined modulo 2π only, we use its cosine and sine as features, as used in [17] for speaker localization and in [9] for speech separation. We refer to these features as cosine-sine interchannel phase difference (CSIPD) features. These features are given as inputs along with the magnitude spectrum of ˆ c j,DS to train a neural network to estimate a mask. We highlight the fact that the dimension of the input features to train the mask estimation network does not depend on the number of microphones since the dimension of the CSIPD feature is the same after DS beamfoming, for any number of microphones. In theory, we can use the same network for any number of microphones in the array. D. Adaptive beamforming
En savoir plus

6 En savoir plus

Belief Hidden Markov Model for speech recognition

Belief Hidden Markov Model for speech recognition

Index Terms—Speech recognition, HMM, Belief functions, Belief HMM. I. INTRODUCTION The automatic speech recognition is a domain of science that attracts the attention of the public. Indeed, who never dreamed of talking with a machine or at least control an appa- ratus or a computer by voice. The speech processing includes two major disciplines which are the speech recognition and the speech synthesis. The automatic speech recognition allows the machine to understand and process oral information provided by a human. It uses matching techniques to compare a sound wave to a set of samples, compounds generally of words or sub-words. On the other hand, the automatic speech synthesis allows the machine to reproduce the speech sounds of a given text. Nowadays, most speech recognition systems are based on the modelling of speech units known as acoustic unit. Indeed, speech is composed of a sequence of elementary sounds. These sounds put together make up words. Then, from these units we seeks to derive a model (one model per unit), which will be used to recognize continuous speech signal. Hidden Markov Models (HMM) are very often used to recognize these units. HMM based recognizer is a widely used technique that allows as to recognize about 80% of a given speech signal, but this recognition rate still not yet satisfying. Also, this method needs many hours of speech for training which makes the automatic speech recognition task very expensive.
En savoir plus

6 En savoir plus

Automatic Assessment of Speech Capability Loss in Disordered Speech

Automatic Assessment of Speech Capability Loss in Disordered Speech

1. INTRODUCTION In speech disorders, an assessment of a person’s communication ability is often needed to complement clinical evaluations at the physiological level. However, assessment of speech abilities is very time-consuming, which does not necessarily fit clinical means for people’s evaluation. From this perspective, automatic tools constitute convenient solutions to gather information about each person’s speech impairments. Such tools have been developed and are broadly used in Computer-Assisted Language Learning (CALL) systems, with early works reported in the 1990s [Bernstein et al. 1990]. In order to evaluate nonnative pronunciation at the segmental level (individual error de- tection) and/or at the suprasegmental level (overall pronunciation assessment), these tools rely on Automatic Speech Recognition (ASR) techniques [Eskenazi 2009]. Con- cerning the individual error detection, several approaches are used to identify phoneme mispronunciations. They range from the analysis of raw recognition scores [Sevenster et al. 1998], likelihood ratios such as native-likeness, and Goodness of Pronunciation (GOP) to the definition of scores derived from classification methods such as linear dis- criminant analysis and the like [Strik et al. 2007]. Contrary to native-likeness scores, which rely on the comparison of speakers’ productions with nonnative acoustic models, the GOP algorithm makes use of only native phone models. The algorithm calculates a ratio representing the likelihood of a phone to be the realization of a specific phoneme in the target language [Witt 1999; Witt and Young 2000]. Since GOP scores solely rely on native phone models, their scope may not be limited to the assessment of foreign learner pronunciation skills, but rather to all kinds of nontypical speech productions, such as in speech disorders.
En savoir plus

15 En savoir plus

Semantic Context Model for Efficient Speech Recognition

Semantic Context Model for Efficient Speech Recognition

Introduction Automatic speech recognition system (ASR) contains three main parts: an acoustic model, a lexicon and a language model. ASR in noisy environments is still a challenging goal because the acoustic information is not reliable and decreases the recognition accuracy. Better language model gives limited performance improvement, modeling mainly local syntactic information. In this paper, we propose a new semantic model to take into account the long-term semantic context information and thus to remove the acoustic ambiguities of noisy ASR.

2 En savoir plus

Improving Speech Recognition through Automatic Selection of Age Group Specific Acoustic Models

Improving Speech Recognition through Automatic Selection of Age Group Specific Acoustic Models

5 Conclusions and Future Work This paper presented an age group classification system that automatically determines the age group of a speaker from an input speech signal. There were three possible age groups: children, young to middle-aged adults and the elderly. What sets our study apart from other studies on age classification is that we used our age group classifier together with an automatic speech recogniser. More specifically, we carried out ASR experiments in which the automatically determined age group of speakers was used to select age group –specific acoustic models, i.e., acoustic models optimised for chil- dren’s, young to middle-aged adults’ and elderly people’s speech. The ASR results showed that using the results of the age group classifier to select age group –specific acoustic models for children and the elderly leads to considerable gains in automatic speech recognition performance, as compared with using “standard” acoustic models trained with young to middle-aged adults’ speech for recognising their speech, as well. This finding can be used to improve the speech recognition performance of speech-enabled applications that are used by people of widely varying ages. What makes the approach particularly interesting is that it is a user-friendly alternative for speaker adaptation, which requires the user to spend time training the system.
En savoir plus

14 En savoir plus

Amharic Speech Recognition for Speech Translation

Amharic Speech Recognition for Speech Translation

michael.melese@aau.edu.et, laurent.besacier@imag.fr, million.meshesha@aau.edu.et ABSTRACT The state-of-the-art speech translation can be seen as a cascade of Automatic Speech Recognition, Statistical Machine Translation and Text-To-Speech synthesis. In this study an attempt is made to experiment on Amharic speech recognition for Amharic-English speech translation in tourism domain. Since there is no Amharic speech corpus, we developed a read-speech corpus of 7.43hr in tourism domain. The Amharic speech corpus has been recorded after translating standard Basic Traveler Expression Corpus (BTEC) under a normal working environment. In our ASR experiments phoneme and syllable units are used for acoustic models, while morpheme and word are used for language models. Encouraging ASR results are achieved using morpheme-based language models and phoneme-based acoustic models with a recognition accuracy result of 89.1%, 80.9%, 80.6%, and 49.3% at character, morph, word and sentence level respectively. We are now working towards designing Amharic-English speech translation through cascading components under different error correction algorithms.
En savoir plus

12 En savoir plus

Improving Speech Recognition through Automatic Selection of Age Group Specific Acoustic Models

Improving Speech Recognition through Automatic Selection of Age Group Specific Acoustic Models

5 Conclusions and Future Work This paper presented an age group classification system that automatically determines the age group of a speaker from an input speech signal. There were three possible age groups: children, young to middle-aged adults and the elderly. What sets our study apart from other studies on age classification is that we used our age group classifier together with an automatic speech recogniser. More specifically, we carried out ASR experiments in which the automatically determined age group of speakers was used to select age group –specific acoustic models, i.e., acoustic models optimised for chil- dren’s, young to middle-aged adults’ and elderly people’s speech. The ASR results showed that using the results of the age group classifier to select age group –specific acoustic models for children and the elderly leads to considerable gains in automatic speech recognition performance, as compared with using “standard” acoustic models trained with young to middle-aged adults’ speech for recognising their speech, as well. This finding can be used to improve the speech recognition performance of speech-enabled applications that are used by people of widely varying ages. What makes the approach particularly interesting is that it is a user-friendly alternative for speaker adaptation, which requires the user to spend time training the system.
En savoir plus

13 En savoir plus

Using Speech for Handwritten Mathematical Expression Recognition Disambiguation

Using Speech for Handwritten Mathematical Expression Recognition Disambiguation

More recently, the speech recognition community has been interested by the problem of mathematical expres- sion recognition (MER) using automatic speech recognition (ASR) [3], [4]. Most of the works rely on an ASR system that provides the basic automatic transcription of the speech signal. Then, this latter is sent to a parsing module to convert the simple text describing the ME (1D) into its mathematical language writing (2D) [5], [4]. Here again, the systems set up are far from being hundred percent reliable. In addition to the resulting errors during the recognition step (common to all ASR systems), the transition from the textual description of the ME to its 2D writing is not obvious at all (Fig.1). The example in Fig.1 not only shows the cases where the two systems are in failure, but also that the two modalities are complementary. One can see this complementarity inasmuch that the problems encountered by both modalities are of different kinds. This leads to the fact that the missing information in one modality is generally available in the other one.
En savoir plus

7 En savoir plus

The Airbus Air Traffic Control speech recognition 2018 challenge: towards ATC automatic transcription and call sign detection

The Airbus Air Traffic Control speech recognition 2018 challenge: towards ATC automatic transcription and call sign detection

limited vocabulary, was shown to be easier to transcribe than Approach and Tower speech interactions. Transcribing pilots’ speech was found to be twice as harder as controllers’ speech. Some participants attempted to use external ATC speech data for semi-supervised acoustic model training, and it was re- vealed to be unsuccessful. This technique usually brings perfor- mance gains, such as in [22]. This may be due to the fact that the eval subset is very close to the trained one so that adding external data just adds noise. This outcome reveals a robustness issue that needs to be addressed. A large-scale speech data col- lection is very much needed to solve ATC ASR. Several criteria should be considered for this data collection: diversity in the airports where speech is collected, diversity in foreign accents, acoustic devices used for ATC, among others.
En savoir plus

6 En savoir plus

Consonant landmark detection for speech recognition

Consonant landmark detection for speech recognition

Stevens first defines three features that specify the broad classes of segments: consonants, vowels, and glides. Consonantal segments can be further classified with additional distinctive features: sonorant, continuant, and strident. A sonorant feature contrasts the sounds that are produced with spontaneous vibration of the vocal folds versus the ones with suppressed vibration or without vibration. A continuant feature distinguishes the speech sound produced with a complete closure inside the oral tract from the sound produced with a narrow constriction, which results in turbulence noise during the sound. A strident feature contrasts the continuant non-sonorant sounds based on the amplitude of the high-frequency. When the cavities and obstacles around the constriction are positioned in a way that the spectrum amplitude in the high-frequency region is higher than that of adjacent vowel, it is called to be strident. When a speech sound has the characteristic defined by the feature, the feature is represented with a + sign in front of it, and when a speech sound lacks the charac-
En savoir plus

197 En savoir plus

Show all 2276 documents...