• Aucun résultat trouvé

Design of a variable rate algorithm for CS-ACELP coder

N/A
N/A
Protected

Academic year: 2022

Partager "Design of a variable rate algorithm for CS-ACELP coder"

Copied!
5
0
0

Texte intégral

(1)

Design of a variable rate algorithm for CS-ACELP coder

G. Madre, E.H. Baghious, S. Azou and G. Burel

Laboratoire d’Electronique et Systèmes de Télécommunications - UMR CNRS 6165 6, avenue Le Gorgeu - BP 809 - 29285 BREST cedex - FRANCE

e-mail : [email protected]

Abstract - This paper is about the reduction of the com- putational complexity of the CS-ACELP codec, described in ITU recommendation G.729, and used for the transmis- sion of voice over IP. A Voice Activity Detection module is proposed to replace the G.729 Annex B algorithm. The new procedure was developed to allow its implementation with Number Theoretic Transforms. The use of Fermat Number Transforms can reduce the cost of variable rate algorithm implementation on Digital Signal Processor (DSP).

I. IN T R O D U C T IO N

Recently, the idea of transmitting the voice on Internet, on a large scale, has appeared. Since then, Voice over Internet Protocol (V oIP) has been a matter of very high interest and development. But the generalization of VoIP remains currently confronted with problems of rate.

International Telecommunication Union(IT U)recommenda- tionG.729standardizes the speech coding algorithm at8kbits/s, based on the conjugate-structure algebraic code excited linear prediction (CS-ACELP) and targeted for digital simultaneous voice and data (DSV D) applications [1] [2] . Its relative low complexity makes it an attractive choice for Internet telephony.

To consider the discontinuous voice activity in a conversa- tion, a low bit rate silence compression scheme, defined inG.729 Annex B, can be employed [3]. However, to design an efficient Voice Activity Detector(V AD)infixed-point arithmetic, a new procedure based on work of Aksuet al. [4] is introduced and im- plemented with Number Theoretic Transforms (N T T), which present the following advantages compared to Discrete Fourier Transform(DF T)[5] :

•They require few or no multiplications

•They suppress the use of floating point complex numbers and allow error-free computation

• All calculations are executed on a finite ring of integers, which is interesting for implementation into DSP

Hence, use of Number Theoretic Transform will reduce the delay features, by minimizing the computational complexity.

The special case of Fermat Number Transforms (F N T), with arithmetic carried out modulo Fermat numbers, is particularly appropriate for digital computation. Its application to different functions (filter, correlation) can provide real benefits for low computational complexity.

The rest of the paper is organized as follows. In section II, we will introduce the CS-ACELP coder of ITU recommenda-

P res enta tio n o f th is w o rk w a s p a rtia lly s u p p o rted by th e Fre n ch B ritta ny R eg io n .

tionG.729. In a third part, we will present a variable rate al- gorithm. Section IV presents the concept of Number Theoretic Transform and details, more particularly, the Fermat Number Transform, which will be implemented in Voice Activity Detec- tion procedure. In thefinal part, numerical results for the new algorithm are given.

II. CS-ACELP CO D E R G.729

The analog voice signal, sampled 8000 times per second, is taken as input signal of the coder G.729 [1]. Figure II-1 shows its principal blocks. The coder operates on frames of 10 ms.

For each frame, the speech signal is analyzed to extract the parameters of the CELP model (Linear Prediction (LP) filter coefficients, adaptive andfixed codebook index), which are en- coded and transmitted.

F ig u re I I-1 . P rin c ip a l b lo ck s o f th e c o d e r G .7 2 9

The input signal is high-pass filtered in the preprocessing block. A 10th order linear prediction analysis yields a set of

LP filter coefficients, which are converted to Line Spectrum

Pairs (LSP) and quantized using Vector Quantization (V Q).

The excitation signal is chosen and an open-loop pitch delay is estimated with a perceptually weighted and low-passfiltered speech signal.

The excitation parameters are determined and the gains of the adaptive and fixed codebook contributions are quantized, for subframes of 5 ms each. Finally, the filter memories are updated using the excitation signal.

III. VA R IA B L E RAT E AL G O R IT H M

The problem of discriminating between speech and silence is one of the most difficult problems in speech analysis, due to large dynamic range of the speech signals and degradations of telephone lines. The standard solution to this problem is to use level tests to discriminate silence from speech and improve the decision with some measured features of the signal (energy, pitch calculation, ...).

(2)

F ig u re I I I-1 . S p ee ch C o m m u n ic a tio n sy s tem w ith VA D

Energy measurements and zero-crossings are used in VAD of G.729 annex B [3]. This algorithm performs well at high Signal- to-Noise-Ratio (SNR) levels, however its performance degrades in the presence of noise.

A. Voice Activity Detector

To replace the G.729 algorithm VAD, our study is drawn from previous results of Aksu et al. [4] that we adapted to CS-ACELP codec.

The VAD proposed is based on distance computations between output signals, from different filters, and references.

The algorithm tracks the variation of energy levels overfive sub- bands, belonging to frequency band voice [0,4000] Hz.

The frequency responses of various sub-bands [0,500], [500,1000], [1000,2000], [2000,3000] and [3000,4000] Hz are shown infigure III-2.

0 500 1000 1500 2000 2500 3000 3500

0 0.5 1

0 500 1000 1500 2000 2500 3000 3500

0 0.5 1

0 500 1000 1500 2000 2500 3000 3500

0 0.5 1

0 500 1000 1500 2000 2500 3000 3500

0 0.5 1

0 500 1000 1500 2000 2500 3000 3500

0 0.5 1

F ig u re I I I-2 . Fre q u e n c y R e sp o n se s o f th e5s u b -b a n d s

The algorithm, detailed by the flowchart in figure III-3, re- quires two major steps and a smoothing decision.

Step 1 - The VAD searches two consecutive frames with gains Gk1 and Gk, smaller than a preset value Ga. If the condition is not met, the current frame is labelled as an ordinary CELP frame.

The gain of the kthframe, of the sampled input signals, is calculated as :

Gk= 10 log10 Ã80

X

i=1

sk(i)2

!

(1) For a better comparison, Ga can be made adaptive. If the current frame is classified as a voice frame, a new reference valueGais calculated :

Ga= 0.95Ga+ 0.05 (Gk1−5) (2) The initial value forGa is equal to40dB.

F ig u re I I I-3 . F low ch a rt fo r VA D

Step 2 - If condition 1 is met, an energy parameter is cal- culated for the current frame as follows :

• Computation of the five outputs (denoted sfn) of sub- bands filters, whose impulse responses are noted hn, with n = 1, ...,5. The signalssfn are obtained by a convolution sfn=hn⊗s(⊗denotes the convolution operator).

•Calculation offive energies : En=

X80 i=1

sfn(i)2 withn= 1, ...,5 (3) For each new silent frame (following a voice frame), thefive energies will serve as a reference, denoted Ern. These values are used for all the following frames of silence. In this case, the decision is made compared to a long term gainGltof the signal, with initial value equal to20dB :

Glt= 0.95Glt+ 0.05Gk1 (4)

•Forcount= 0, thekthframe will be labelled as ”silence” if Gk< Glt. For the following silent frame, a distance between energies is calculated as :

D= 10 log10 Ã 5

X

n=1

En

Ern

!

−7 (5)

If the distanceDbetweenEnandErnis smaller than10dB, the current frame is declared as ”silence”.

(3)

Smoothing decision -The initial VAD decision is smoothed to avoid discontinuities in output signal. This smoothing and correction uses the value of Glttoo.

• In afirst case, an active voice decision is extented to the

current frame if the previous frame was active voice frame and the current frame energy is aboveGlt.

•In a second case, an inactive voice decision is extended if the three previous frame are inactive voice and if the current frame energy is underGlt.

B. Sub Rate Coder

If voice activity is detected, all information is transmitted (LSFs, pitch, codebook index, gains) at 8kbits/s [2] . The bit allocation of the speech coder parameters is shown in Table I.

TA B L E I

BI T AL L O C A T I O N O F T H E CO D E R G .7 2 9 S u b fra m e To ta l P a ra m ete rs n1 n2 p e r fra m e

L S F s 1 + 7 + 5 + 5 1 8

A d a p t i v e C o d e b o o k D e l a y 8 5 1 3

P i t c h D e l a y P a r i t y 1 - 1

F i x e d C o d e b o o k I n d e x 1 3 1 3 2 6

F i x e d C o d e b o o k S i g n 4 4 8

C o d e b o o k G a i n s 3 + 4 3 + 4 1 4

To ta l 8 0

For silent frame, LSF and gain information is transmitted only for thefirst frame. The LSF index, obtained by two-stage vector quantization, is represented with 13 bits. The gain is scalar quantized using 7 bits nonuniform quantizer.

These parameters, in the decoder, are repeated for the rest of the region labeled as ”silence” unless the variation of energy, between two consecutive silent frames, is significant. In this case, the gain is transmitted again for a better characterization of the ”comfort noise”.

IV. NU M B E R TH E O R E T IC TR A N S F O R M

To develop the previous VAD and CS-ACELP codec infixed point arithmetic with low computational complexity, we pro- pose to use Number Theoretic Transforms (NTT).

An NTT [5] presents the same form as the DFT but is defined

over finite rings. All arithmetic must be carried out modulo

M. The moduloM may be equal to a prime number or to a multiple of primes, since NTTs are defined over Galois Field (GF(M)). TheNthroot of the unit inC,ejN , is replaced by the Nth root of the unit in GF(M) represented by the term hαiM, whereh.iM denotes the moduloM operation.

An NTT of a discrete time signalxand its inverse are given respectively by :

X(k) =

*N1 X

n=0

x(n)αnk +

M

(6)

x(n) =

* N1

NX1 k=0

X(k)αnk +

M

(7) withn, k= 0,1, ..., N−1. αrepresents the generating term and N the length of the tranform. N being a prime number, there exists an integerN1 such­

N.N1®

M = 1.

Different conditions must be satisfied for an NTT to exist over a Galois FieldGF(M)[6] [7] :

•gcd (α, M) = gcd (N, M) = 1i.e. ­

αN= 1®

M

•N|gcd(pi−1, pj−1),∀(i, j6=i)∈[1, k]2forM=Qk i=1pi

•gcd((αi−1), M) = 1,∀i∈[1, N−1]

(gcd is the greatest common divisor anda|bmeans that the remainder of ab is equal to zero)

For an efficient implementation of Number Theoretic Trans- form on processor, the choice of parameters is important. If possible, the values N and α are chosen as a power of 2 to allow replacement of multiplications by bit shifts.

A. Cyclic Convolution Property

The Number Theoretic Transforms, originaly developed for rapid computation of convolution, have all the Cyclic Convolu- tion Property (CCP) [6] :

U⊗V =TN1{TN(U)•TN(V)} (8) whereU andV represent the two sequences whose convolu- tion is desired, TN an NTT of length N and TN1 its inverse.

The two operators ⊗, •denote the convolution and the term by term multiplication, respectively.

B. Fermat Number Transform

The choice of a modulo equal to a Fermat number [7] , Ft= 22t+ 1witht∈N, offers numerous possibilities for length N of the transform (see Table II). The values ofN andαasso- ciated to a Number Theoretic Transform, defined for a modulo Ft, are given byN= 2t+1iandα= 22i withi < t. The NTT defined over the Galois FieldGF(Ft), is called Fermat Number Transform (FNT) :

X(k) =

*2t+1−iX1 n=0

x(n)22ink +

Ft

(9)

x(n) =

*

−22t−i1(ti)

2t+1−iX1

k=0

X(k)22ink +

Ft

(10) withk, n= 0,1, ...,2t+1i−1.

TA B L E I I

PO S S I B L E CO M B I N A T I O N S O F PA R A M E T E R S O F A N F N T

m o d u lo N N

t F t fo rα= 2 fo rα=

2

1 22+ 1 = 5 4 -

2 24+ 1 = 1 7 8 1 6

3 28+ 1 = 2 5 7 1 6 3 2

4 21 6+ 1 = 6 5 5 3 7 3 2 6 4

5 23 2+ 1 6 4 1 2 8

6 26 4+ 1 1 2 8 2 5 6

A Fermat Number Transform satisfies the Cyclic Convolu- tion Property and requires about Nlog2N simple operations (bit shifts, additions) but no multiplication, while a DFT re- quires a number of multiplications of about Nlog2N. More- over, this transform admits a fast NTT-type computational structure (butterfly implementation).

(4)

C. Convolution analysis

The convolution computation of two sequences ofNsamples, using the Cyclic Convolution Property, requires the use of three NTTs of length2N (the sequences are extended by zeros). To avoid zeros addition, an algorithm is summarized below [8].

To calculate the correlation of x and h, two sequences of length N, the following procedure is applied :

•First, new sequences are formulated fromx:







x1(k) =

½ x(k) 06k < N2 0 N2 6k < N x2(k) =

½ 0 06k < N2 x(k) N2 6k < N

(11)

•Two other sequences, denotedh1 andh2, are in the same way defined fromh, withk= 0, ..., N−1.

•LetX1,X2,H1 andH2, be the respective NTTs ofx1,x2, h1 andh2. Then,X,H are determined as :

½ X(k) =X1(k) +X2(k)

H(k) =H1(k) +H2(k) (12)

•Now, define three new sequences :



U(k) =H(k)X(k) V(k) =H1(k)X1(k) W(k) =H2(k)X2(k)

(13)

• Finally the convolution y = x⊗h is obtained from the inverse NTTs ofU,V andW :







y(n) =v(n) 06n < N2 y(n) =u(n)−w(n) N2 6n < N y(n) =u(n−N)−v(n−N) N6n <3N2 y(n) =w(n) 3N2 6n <2N

(14)

V. NU M E R IC A L RE S U LT S

Numerical simulations have been conducted to evaluate the performances of the presented VAD. A comparison with the standard VAD, described in G.729 annex B, has also been realized under different environmental conditions.

In these tests, we compare the decisions to a database which was hand-labeled as voice or silence.

Averaged results are computed as a function of SNR, across a Monte Carlo simulation consisting of 3000 runs, for a same sentence pronounced by a female and a male talker. The noise is additive, white and gaussian.

0 2000 4000 6000 8000 10000 12000 14000

-3 -2 -1 0 1 2 3x 104

Amplitude

0 2000 4000 6000 8000 10000 12000 14000

0 0.2 0.4 0.6 0.8 1

Samples

Decision

Speech

Silence

F ig u re V -1 . Fem a le Ta lker (u p p e r p lo t) S ile n c e/ Vo ic e Fra m e s (lowe r p lo t)

0 2000 4000 6000 8000 10000 12000 14000

-3 -2 -1 0 1 2 3x 104

Amplitude

0 2000 4000 6000 8000 10000 12000 14000

0 0.2 0.4 0.6 0.8 1

Samples

Decision

Speech

Silence

F ig . V -2 . M a le Ta lke r (u p p e r p lo t) S ile n c e / Vo ice Fra m e s (low e r p lo t)

As shown by Figures V-1 and V-2, the right decisions the VADs are expected to do, for recording extracts without noise, are known in advance.

A. Performances Comparisons

Both VAD algorithms give similar performances for clean speech or high SNR. On the other hand, as the SNR decreases, the difference between the VADs becomes more pronounced.

In Figures V-3 and V-4, the dotted curves represent the stan- dard deviation of the results around the average values shown by the solid curves. The dashed lines correspond to the mis- classification percentage observed and to the theoretical rate for nondisturbed signals.

0 10 20 30

0 2 4 6 8 10 12 14 16 18 20

SNR (dB)

% of misclassification

0 10 20 30

4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8

SNR (dB)

Variable rate (kbits/s)

F ig u re V -3 . C o m p a riso n s o f VA D fo r th e fe m a le vo ice

G .7 2 9 B VA DP ro p o se d VA D

0 10 20 30

0 2 4 6 8 10 12 14 16 18 20

SNR (dB)

% of misclassification

0 10 20 30

5 5.2 5.4 5.6 5.8 6 6.2 6.4

SNR (dB)

Variable rate (kbits/s)

F ig u re V -4 . C o m p a ris o n s o f VA D fo r th e m a le vo ice

G .7 2 9 B VA DP ro p o se d VA D

(5)

The rate evolutions can be explained by the misclassification increases of both VADs. The less the number of voice frames are detected, the more the rate decreases. For SNRs lower than 10dB, the rates are nonsignificant of VAD performances.

B. Implementation of FNT-based VAD

Variousfiltering involved in the described VAD are easily im- plemented using FNTs. The length of thefive sub-bandfilters impulse responses, illustrated by Figure V-5, has been chosen equal to128, with samples16bits quantized.

0 20 40 60 80 100 120

-0.20 0.2

0 20 40 60 80 100 120

-0.2 0 0.2

0 20 40 60 80 100 120

-0.2 0 0.2

0 20 40 60 80 100 120

-0.2 0 0.2

0 20 40 60 80 100 120

-0.2 0 0.2

F ig u re V -5 . Im p u lse R es p o n se s o f th efive su b -b a n d sfilte rs

Comparing with a DFT implementation, the proposed FNT solution, using the convolution computation presented pre- viously, requires a reduced number of multiplications to obtain

the fivefilter outputs. For each filter processing, a frame ad-

mitting gains{Gk1, Gk}smaller than thresholdGa, the FNT algorithm requires about400multiplications against3100for a DFT implentation.

In the other functions of VAD, the computational complexity

offloating point code is slightly smaller, but comparable, than

32bitsfixed-point arithmetic algorithm (a table for logarithm

operations, already existing in the coder G.729, is used) [9].

1 2 3 4 5 6 7 8 9 10

0 0.5 1 1.5 2 2.5

3x 105

Number of operations

Number of computed frames

F ig u re V -6 . O p e ra tio n s re q u ire d fo r th e5filte rs

A lg o rith m w ith D F Tw ith F N T

The computation gain is more significant for the filtering process. In Figure V-6, the operations number, required for

the five filters, is represented in dotted lines for simple opera-

tions (additions, bit shifts) and in solid lines for multiplications.

VI. CO N C L U S IO N

In this paper, an efficient implementation of a variable rate algorithm for CS-ACELP codec is proposed. The input speech frames are classified into voice and silence parts by a novel VAD algorithm, which replaces the G.729 Annex B algorithm. This robust VAD tracks the variation of energy levels offive spectral

sub-bands and makes a decision with moderate complexity. Fol- lowing stages would be to make similar tests for different types of noise (nonstationary noise) and to evaluate the performances of both VADs comparing the subjective speech quality, in the form of Mean Opinion Score (MOS).

Although the selected application is the CS-ACELP coder G.729, the proposed silence detection procedure can be adapted to other types of coder using a VAD.

To limit the cost implementation of the variable rate proce- dure, Number Theoretic Transforms have been used. In parti- cular, it is shown that Fermat Number Transform reduces the computational complexity of different algorithms involved in the VAD procedure.

Moreover, the FNT implementation could benefit to other functions of associated speech coder (autocorrelation [10], LP coefficients computation [9],filters, ...).

RE F E R E N C E S

[1 ] IT U -T R ec o m m en d a tio n . G .7 2 9 , ” C o d in g o f S p e e ch a t 8 k b its / s u s in g C o n ju g a te A lg e b ra ic C o d e -E x c ite d L in e a r P re d ic tio n (C S - A C E L P )” , J u n e 1 9 9 5 .

[2 ] R . S a la m i, C . L afla m m e, B . B e sse tte , J .P. A d o u l, ” IT U -T G .7 2 9 A n n ex A : R ed u c ed c o m p le x ity 8 k b its/ s C S -A C E L P c o d ec fo r d ig ita l sim u lta n e o u s vo ic e a n d d a ta ” , IE E E C o m m u n ica tio n s M a g a z in e , vo l. 3 5 , p p . 5 6 -6 3 , S e p te m b e r 1 9 9 7 .

[3 ] A . B enya ssin e , E . S h lo m o t, H .-Y . S u , ” IT U -T re c o m m en d a tio n G .7 2 9 A n n ex B : A sile n c e c o m p re s sio n sch e m e fo r u se w ith G .7 2 9 o p tim iz ed fo r V .7 0 d ig ita l s im u lta n e o u s vo ic e a n d d a ta a p p lic a tio n s ” , IE E E C o m m u n ic a tio n s M a g ., vo l. 3 5 , p p . 6 4 -7 3 , S e p te m b e r 1 9 9 7 .

[4 ] E .B . A k su , A .E . E rta n , H .G . Ilk , H . K a rci, O . K a rp a t, T . K o l- c a k , L . S en d u r, M . D e m ire k le r, E . C e tin , ” Im p le m e nta tio n o f a Va ria b le -B it R a te M E L P Vo co d e r o n T M S 3 2 0 C 5 4 8 ” ,2ndE u ro - p e a n D S P E d u c a tio n a n d R e se a rch C o n fere n c e , N o isy -le -G ra n d , S e p te m b e r 1 9 9 8 .

[5 ] G .A . J u lie n , ” N u m b e r T h eo re tic Tech n iq u e s in D ig ita l S ig n a l P ro c es sin g ” , B o o k C h a p ter A d va n ce s in E le c tro n ics a n d E le c- tro n P hy sic s, A c a d e m ic P res s In c ., vo l. 8 0 , C h a p ter 2 , p p . 6 9 -1 6 3 , 1 9 9 1 .

[6 ] R . B la h u t, ” Fa st a lg o rith m s fo r d ig ita l sig n a l p ro c e ssin g ” , A d d iso n -Wes le y P u b lis h in g C o m p a ny, 1 9 8 5 .

[7 ] R .C . A g a rw a l, C .S . B u rru s, ” Fa s t c o n vo lu tio n u s in g Fe rm a t nu m b e r tra n sfo rm w ith a p p lic a tio n to d ig ita l filte- rin g ” , IE E E Tra n s . o n A c o u stic s, S p ee ch a n d S ig n a l P ro ce s- sin g , vo l. A S S P -2 2 , N2, p p . 8 7 -9 7 , 1 9 7 4 .

[8 ] W . S hu , Y . T ia n ren , ” A lg o rith m fo r lin ea r C o nvo lu tio n u sin g N u m b er T h eo re tic Tra n sfo rm s ” , E le ctro n ics L e tters , vo l. 2 4 , N5, M a rch 1 9 8 8 .

[9 ] A .V . A h o , J .E . H o p c ro ft, J .D . U llm a n , ” T h e d es ig n a n d a n a ly sis o f c o m p u te r a lg o rith m s” , A d d is o n -We sle y P u b lish in g C o m p a ny, 1 9 7 4 .

[1 0 ] S . X u , L . D a i, S .C . L ee , ” A u to c o rrela tio n A n a ly s is o f S p e e ch S ig n a ls U s in g Fe rm a t N u m b er Tra n s fo rm ” , IE E E Tra n s.

o n S ig n a l P ro c ., vo l. 4 0 , N8 , A u g u st 1 9 9 2 .

Références

Documents relatifs

conjugate gradient, deflation, multiple right-hand sides, Lanczos algorithm AMS subject classifications.. A number of recent articles have established the benefits of using

The atomic decomposition technique [14] has been used in [3, 8] to divide the sets of role fillers local to a domain element into mutually disjoint atomic sets and these

When no active speech is detected (VAD=0) the regular coder operation is switched off, and a simpler coder produces a 47-bit silence description frame which is used by the decoder

Due to the fixed approximation of the quantization values reported in Table 3, this multiplierless implementation produces only one degree of image compression, i.e., al- ways with

Unité de recherche INRIA Rennes : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France) Unité de recherche INRIA Rhône-Alpes : 655, avenue de l’Europe -

The main feature of the proposed algorithm is that it ensures that the consistency constraint (the sum of the solution components summing to one) is satisfied at every step, and

According to the voiced and unvoiced speech features, we propose a new Pitch Modelling procedure using linear- Filtering (P M F ), which splits the excitation signal up into

The research objective provides for development of the algorithm for selecting candi- dates to an IT project implementation team with use of the classic simplex method for reduction