A non linear analysis for clean and noisy speech

(1)

A NON L I N E A R A N A L Y S I S F O R C L E A N A ND N O I S Y S P E E C H

J e a n R o u at, Yong C h u n Liu an d S ylvain Le nui eux

Dépt des sciences appliquées, Université du Québec à CMcoutiim,

CHICOUTIMI, Québec, Canada, G7H 2B1

1 . I n t r o d u c t i o n

The research in speech analysis is recognized to be an important aspect in the a rea o f speech p ro cessin g , w ith applications in speech coding, speech re c o g n itio n , etc. D e p en d in g on the a p p lic a tio n , th e speech a n a ly z e r h a s to e x tra c t th e m ost appropriate param eters. This p aper will focus on the problem of sp e e c h a n a ly s is w ith p o s s i b l e a p p l ic a t io n s in sp e e c h recognition. It is know n that sp eaker-independent recognition of continuous speech is a very com plicated task w hich has not yet been fully m astered. T he better the quality o f the analysis, the easier it becom es to recognize w hat has been spoken. T he a u to m a tic " d e m o d u la tio n " o f s p e e c h w ith n o n lin e a r operators, based on perceptive k n o w led g e is a problem which has not yet b een fully ad d ressed, and speech "dem odulation" m ight assist the researcher in the understanding of speech and / or in the design o f a simple and efficient speech analysis.

2 . M o d u l a t i o n

R esearch w ork on autom atic dem o d u latio n o f speech can be ju stified becau se o f the h y pothesis p ro p o sin g that the hum ain brain has neural cells specialized in A m plitude M odulation (AM) and Frequency M odulation (FM ) detection [1] [7] [8].

It is k now n fro m previous stu d ies (see [5], for exam ple) that fluctuations in the envelope rath er than in the fine structure of the sp eech sig n a l carry th e in f o rm a tio n re g a rd in g speech intelligibility. T he flu ctu atio n s o f the en v elo p e o f a speech signal are d ep en d en t on the rela tiv e freq u en cy , the relative en erg y and th e rela tiv e p h a se o f th e fo rm an ts (fo r voiced speech). This suggests again an analysis b ased on a m odel of speech "m odulation".

3 . N o n l i n e a r o p e r a t o r s

Recently Kaiser [2] proposed a nonlinear operator (called Teager energy operator) able to extract the en ergy of a signal based on m echanical and physical considerations. It has been shown [3] that this operator is able to tract the envelope o f an am plitude or frequency m odulated signal.

A n o th e r n o n lin e a r o p e ra to r has b e e n p ro p o s e d [6] (called "D y n ") b a se d on p e rc e p tiv e c o n sid e ra tio n s. T his o p erato r show s surprising ability to enhan ce the FM -A M m odulation in speech, and can be com pared to the T eager energy operator. One could consider that the T e a g e r operator is the envelope of the D yn o p e ra to r f o r v o w els. F ig u re 1 illu s tra te s th is idea. D e p en d in g on the ap p licatio n , one can use the D yn or the T eager operators. T he top section o f Fig. 1. show s fo u r pitch p eriods taken from the original speech (ftf) o f a m ale speaker (pitch frequency = 130Hz). T he second section presents the band-pass-filtered speech [4] w ith a center frequency of 2300Hz. T he third section show s the output o f the D yn operator on the band-pass-filtered speech. A nd the fourth section illustrates the output o f the T e a g e r energy o p e ra to r on the sam e band-pass- filtered speech. T he D yn a n d the T e a g e r o perators show the m odulated energy pulses characteristics o f the speech signal.

Is there a link betw een these pulses and the form ants? Are these pulses characteristic o f a no n lin ear coupling betw een the glottal source and the vocal tract? Is it possible to use these nonlinear o p e ra to rs to e x tra c t an in f o rm a tio n s u ita b le fo r speech recognition? Most of these questions do not have a clear answ er yet. A nd this p aper proposes an analysis w hich can be used to answ er the above questions.

4 . T h e a n a l y s i s

T he actual version o f the an a ly z e r is com prised of a ban k of twenty filters from 300H z to 3300H z. T hese filters sim ulate the frequency analysis perform ed by the cochlea. There are rounded exponential filters w ith the E quivalent R ectangular Bandw idths (ERB) proposed by M oore and G lasberg [4]. T he output of each filter is then p ro c e sse d by eith er the Dyn op erato r or by the T eager E nergy o p erato r dep en d in g on the application, see [6] for more details.

This m odel is very sim ple and no attem pt has been m ade to represent the exact transduction m echanism from the inner hair cells (for afferent inform ation) o r from the outer hair cells (for active feedback). T he Dyn o p erato r needs one m ultiplication and one ad d itio n p e r sam p le and the T e a g e r o p erator, tw o m ultiplications and one addition p e r sam ple.

5 . C o m m e n t s

Figure 2 show s the output o f the m odel for three pitch periods from the tra n sitio n /d //i/ in th e sy llab le ”di" from a male speaker.

It can be seen that the form ant inform ation is preserved and that high-frequency c o m p o n en ts are enh an ced in co m parison with low -frequency com ponents. T he pulses of energy are better seen in high-frequency. From Figure 2, it can be seen that the low- frequency com ponents are not in phase with the high-frequency c o m p o n e n ts .

6 . C o n c l u s i o n

A n e w a n a ly s is h a s b e e n p r o p o s e d b a s e d on n o n lin e a r transform ations o f the signal and on perceptive k n o w ledge. It has b e e n show n th a t th is a n a ly sis p re s e rv e s the fo rm a n t structure [6] on the frequency dim ension and has a very good tim e reso lu tio n . T h is an a ly sis g iv e s a th ree dim en sio n n al representation o f speech: "m echanical energy". B ark scale and tim e scale. In co m p ariso n w ith a spectrogram representation, this analysis show s a m ore im portant redundancy in frequency (B ark scale) and in tim e (m odulation) and consequently should be m ore resistant to noise.

7 . R e f e r e n c e s

[1] G ard n er, R .B . and W ilso n J.P. (1979). E vidence for d ire c tio n -sp e c ific c h a n n e ls in th e p ro c e s s in g o f freq u en cy m odulation. J. A coust. soc. A m er. 66, p. 704.

[2] Kaiser, I.F. (1990). On a sim ple algorithm to calculate the 'e n e rg y ' o f a sig n a l. P r o c e e d in g s o f IE E E -IC A S S P '9 0 , A lbuquerque, pp. 381-384.

(2)

-[3] M aragos, P. Quatieri, T. and K aiser, J.F. (1991). Speech nonlinearities, m odulations and energy operators. Proceedings of the IEEE - IC A SS F 9 1 , Toronto, pp. 421-424.

[4] M oore. B .C.J. and G lasberg, B.R. (1983). S u ggested fo r m u la e f o r c a l c u la t in g a u d i to r y - f il te r b a n d w id th s and excitation patterns. J. Acoust. Soc. Am. 74, pp. 750-753. [5] Plom p. R. (1988). E ffect of am plitude com p ressio n in hearing aids in the light o f the m odulation-transfer function. J. Acous. Soc. A m er. 83 (6), pp. 2322-2327.

[6] R ouat. J. (1991). D yn: a n onlinear operator fo r speech analysis. U niversité du Q uébec à Chicoutim i, dépt des sciences appliquées, rapport interne.

[7] Tansley, B.W. and SUFF1ELD. J.B. (1983). Time course of adaptation and recover}' o f c h a n n e ls selectiv ely sensitive to frequency and am plitude modulation. J. Acoust. Soc. Amer. '74, p. 765.

[81 W akefield, G .H. and V iem eister. N.F. (1984). Selective adaptation to linear frequency-m odulated sw eeps: Evidence for direction-specific FM channels ? J. Acoust. Soc. A mer. 75, p.

1588.

/ V v / V \ A A

(s

yw

Fig . I : non linear operators o n b an d -p ass-filtered s p e e c h , C . F . 2 3 0 0 H z

HI f ro m a male

three pitch periods ot the •me ch anic al energy

scale