Strategies for Audio Communication - MULTIMEDIA OVER IP AND WIRELESS NETWORKS

Dinei Florêncio

3.1 INTRODUCTION

In this chapter we review the main techniques for error concealment in packet audio. As explained in Chapters 7–10, forward error correction (FEC) or repeat request solutions are often adequate for streaming media and broadcast. These can virtually eliminate information loss, guaranteeing that every bit is actually received at the decoder side. Nevertheless, these techniques will also require the introduction of additional delay, and the higher the protection level desired, the higher the delay required. Real-time communication (RTC) applications are very delay sensitive and will not be able to fully exploit these techniques to reduce 100% of the losses. For this reason, RTC needs are quite unique. We need error concealment, and we need FEC techniques that can be applied without excessive increase in delay. In this chapter we look at some of the techniques used in error concealment for speech and look at media-aware FEC techniques, with particular interest in RTC.

Compression and error concealment are tightly related. Compression tries to remove as much redundancy from the signal as possible, but the more redun-dancy is removed, the more important each piece of information is, and therefore the harder it is to conceal lost packets. More specifically, speech is a dynamic but slowly varying signal; the key way of compressing speech is by only trans-mitting signal changes in relation to the previous or expected state. Nevertheless, only transmitting these changes in a differential form means that if you lose some

information (e.g., due to a packet loss), the decoder does not know the current state of the signal any more. It is always expected that the segment corresponding to the missing data will not be properly decoded. But with differential coding, subsequent frames may also be affected. Furthermore, it is easier to replace any missing speech segments if one has received the correct signal in the vicinity of the missing segment. For all these reasons, error concealment may significantly depend on the compression technology used.

We will start this chapter by looking at some of the basic ideas behind packet loss concealment for speech. With that objective, in Section 3.2 we introduce the basic concealment techniques used in nonpredictive speech codecs. The job of concealing losses becomes harder as the codec removes more and more redun-dancy from the signal. In Section 3.3, we discuss some of the techniques used to reduce the impact of the feedback loop in CELP (Codebook Excited Linear Prediction) and other predictive codecs. In Section 3.4, we present some recent results in loss concealment for transform coders, which are used both in speech and in audio applications. Finally, in Section 3.5 we discuss recent research in media-aware FEC techniques. Particular attention is paid to speech, due to its im-portance in RTC, but many of the recent advances in loss concealment techniques we will discuss apply also to audio. For example, the same principles used in the overlapped transform concealment techniques can be used for most audio codecs, and the media-aware FEC can be applied to most audio or video coders. We also point out that this chapter is closely related to the ideas presented in Chapters 15 and 16.

3.2 LOSS CONCEALMENT FOR WAVEFORM SPEECH CODECS

When digital systems started replacing analog equipment a few decades ago, processing power was scarce and expensive, and coding techniques still prim-itive. For those reasons, most early digital systems used a very simple coding scheme: PCM (Pulse Code Modulation). In this digital representation of speech, there isn’t really any coding in the compression sense. The signal is simply sam-pled and quantized. More specifically, the speech signal is typically samsam-pled at 8 KHz, and each sample is encoded with 8-bit precision, using one of two quanti-zation schemes, usually referred to as A-law andμ-law. This gives a total rate of 64 Kbps. The PCM system used in telephony has been standardized by the ITU (International Telecommunication Union) in the standard G.711 [1]. For Voice over Internet Protocol (VoIP) or other packet network applications, the speech samples will be grouped into frames (typically 10 ms in duration) and sent as packets across the network, one frame per packet. Note that a frame corresponds to a data unit in the terminology of Chapter 2. Note that, since there is no real coding, there is no dependence across packets: packets can be received and de-coded independently. When G.711 was first adopted, the main motivation was

quality: A digital signal was not subject to degradation. At the same time, a 64-Kbps digital channel had a significant cost, and there was a strong push toward increased compression. With the evolution of speech compression technology, and increased processing power, more complex speech codecs were also standardized (e.g., [3–6]), providing better compression. Curiously, today, in many applica-tions bandwidth is not necessarily a significant constraint any more, and we are starting to see basic PCM-coded speech increasing in usage again. Furthermore, many error concealment techniques operate in the time domain, and therefore are best understood as applying to PCM-coded speech. For this reason, in this section we review the basic concept of packet loss as applied to speech and look at some common techniques to conceal loss in PCM coded speech.

We assume speech samples are PCM coded and grouped in 10-ms frames be-fore transmission. Since we assume packets are either received error free or not received at all, this implies that any loss incurred in the transmission process will imply a missing segment of 10 ms (or a multiple thereof). Figure 3.1 shows a seg-ment of a speech signal. The signal is typical of a voiced phoneme. Figure 3.1(a) shows the original signal, whereas 3.1(b) shows a plot where 20 ms (i.e., two packets) is missing. As can be inferred from the picture, a good concealment al-gorithm would try to replace the missing segment by extending the prior signal with new periods of similar waveforms. This can be done with different levels of complexity, yielding also different levels of artifacts. We will now investigate a

0 200 400 600 800 1000 1200

FIGURE 3.1:

(a) A typical speech signal. (b) Original signal with two

missing frames. (c) Concealed loss using Appendix I of G.711.

simple concealment technique, described in the Appendix I of Recommendation G.711 [2]. The results of applying that algorithm are illustrated in Figure 3.1(c).

3.2.1 A Simple Algorithm for Loss Concealment: G.711 Appendix I

The first modification needed in the G.711 decoder in order to allow for the er-ror concealment is to introduce a 30 sample delay. This delay is used to smooth the transition between the end of the original (received) segment and the start of the synthesized segment. The second modification is that we maintain a circular buffer containing the last 390 samples (48.75 ms). The signal in this buffer is used to select a segment for replacing the lost frame(s).

When a loss is detected, the concealment algorithm starts by estimating the pitch period of the speech. This is done by finding the peak of the normalized cross-correlation between the most recent 20 ms of signal and the signal stored in the buffer. The peak is searched in the interval 40 to 120 samples, corresponding to a pitch of 200 to 66 Hz.

After the pitch period has been estimated, a segment corresponding to 1.25 periods is taken from the buffer and is used to conceal the missing segment. More specifically, the selected segment is overlap-added with the existing signal, with the overlap spanning 0.25 of the pitch period. Note that this overlap will start in the last few samples of the good frame (which is the reason we had to insert the 30 sample delay in the signal). The process is repeated until enough samples to fill the gap are produced. The transition between the synthesized signal and the first good frame is also smoothed by using an overlap-add with the first several samples of the received frame.

Special treatment is given to a number of situations. For example, if two or more consecutive frames are missing, the method uses a segment several pitch pe-riods long as the replication method, instead of repeating several times the same pitch period. Also, after the first 10 ms, the signal is progressively attenuated, such that after 60 ms the synthesized signal is zero. This can be seen in Figure 3.1(c), where the amplitude of the synthesized signal starts to decrease slightly after 160 samples, even though the synthesized signal is still based on the same (preceding) data segment. Also, note that since the period of the missing segment is not iden-tical to the synthesized segment, the transition to the new next frame may present a very atypical pitch period, which can be observed in Figure 3.1(c) around sam-ple 1000.

The reader is directed to the ITU Recommendation [2] for more details of the algorithm. Results of the subjective tests performed with the algorithm, as well as some considerations about bandwidth expansion, can be found in [7]. Al-ternatively, the reader may refer to Chapter 16, which gives details of a related timescale modification procedure. For our purposes, it suffices to understand that the algorithm works by replicating pitch periods. Other important elements are

the gradual muting when the loss is too long and the overlap-add to smooth tran-sitions. These elements will be present in most other concealment algorithms.

By the nature of the algorithm, it can be easily understood why it works well for single losses in the middle of voiced phonemes. As expected, the level of ar-tifacts is higher for unvoiced phonemes and transitions. More elaborate conceal-ment techniques will address each of these issues more carefully, further reducing the level of artifacts, at the cost of complexity. One possibility is to use an LPC filter and do the concealment in the “residual domain” [8,9]. Note that this is un-related to the concealment of CELP codecs (which we will investigate in the next section). Here we simply use LPC to improve the extrapolation of the signal; the coefficients are actually computed at the decoder. In CELP codecs, we have to handle the problem of lost LPC coefficients.

3.3 LOSS CONCEALMENT FOR CELP SPEECH CODECS

In the previous section we looked at error concealment for PCM coded speech.

In PCM coded speech, each speech frame is encoded independently (in fact, each sample is encoded independently). For this reason, the loss of one packet does not impair the decoding of subsequent frames. However, since no redundancy is removed from the signal, toll quality speech using G.711 requires 64 Kbps.

Many other codecs will remove more redundancy from the signal, and there-fore require a lower rate. More recent codecs are actually quite aggressive in removing redundancy. For example, several flavors of CELP coding have been used in speech codecs standardized by the ITU, including G.728 [3], G.729 [4], and G.722.2 [6]. Other organizations have also standardized several other CELP codecs, including the European Telecommunications Standards Institute (ETSI), which standardized several GSM (Global System for Mobile Communications) codecs [10] and the 3GPP (Third Generation Partnership Project) AMR (Adap-tive Multi-Rate) codec [11], as well as the US Department of Defense (DoD), which standardized one of the first LPC codecs, the DoD FS-1016 [12], and more recently a 2.4-Kbps mixed excitation linear prediction (MELP) codec, the MIL-STD-3005 [14].

While a full understanding of a CELP codec is outside the scope of this chap-ter, we will need a basic understanding in order to deal with the concealment techniques used in association with these codecs. We will now present a quick summary of important elements of a CELP codec.

Figure 3.2 shows a block diagram of a typical CELP decoder. The first impor-tant element in these codecs is the use of a Linear Prediction (LP) filter, indi-cated as “LPC Synthesis Filter” in the figure. The second element is the use of a codebook as the input to the filter (thus the name “code excited linear predic-tion, CELP”). We are mostly concerned with the decoding operation so that we

FIXED CODEBOOK

ADAPTIVE CODEBOOK

+ ^LPC^SYNTHESIS_FILTER ^OUT

Dans le document MULTIMEDIA OVER IP AND WIRELESS NETWORKS (Page 80-85)