FIGURE 5.22: A three-band lifting-like scheme

However, the simplest choice, corresponding to a Haar-like transform, is to have identity predict operators and linear update operators. In this case, the analy-sis equations become

⎧⎪

⎪⎪

⎪⎨

⎪⎪

⎩

H_t⁺(n)=x_3t₊₁(n)−x_3t

n−v_3t^3t₊₁ , H_t⁻(n)=x_3t₋₁(n)−x_3t

n−v_3t^3t₋₁ , L_t(p)=x3t(p)+1

4 H_t⁺

p+v^3t_3t₊₁ +H_t⁻

p+v^3t_3t₋₁ .

More complex lifting-like schemes (as in Figure 5.22) have been proposed in [26], as well as other possibleM-band motion-compensated temporal structures.

These structures allow a frame rate adaptation from 30 to 10 fps, for example, or from 60 to 20 fps. Flexible frame rate changes can be achieved by cascading dyadic andM-band filter banks.

Another direction for the extension of spatiotemporal transforms is to replace the 2D wavelet decomposition by other spatial representations, such as wavelet packets [27] or general filter banks [29], which also allow for more flexible spatial scalability factors [28].

5.4.4 Switching Spatial and Temporal Transforms

The interframe wavelet video coding schemes presented in the previous sections employ MCTF before the spatial wavelet decomposition is performed. Through-out the chapter we refer to this class of interframe wavelet video coding schemes ast+2D MCTF. Despite their good coding efficiency performance and low com-plexity, these types of MCTF structures have also several drawbacks.

1. Limited motion-estimation efficiency.t+2D MCTF are inherently limited by the quality of the matches provided by the employed motion estimation algorithm. For instance, discontinuities in the motion boundaries are rep-resented as high frequencies in the wavelet subbands, and the “Intra/Inter”

mode switch for motion estimation is not very efficient int+2D MCTF schemes, as the spatial wavelet transform is applied globally and cannot encode the resulting discontinuities efficiently. Moreover, the motion esti-mation accuracy, motion model, and adopted motion estiesti-mation block size are fixed for all spatial resolutions, thereby leading to suboptimum imple-mentations compared with nonscalable coding that can adapt the motion estimation accuracy based on the encoded resolution. Also, because the motion vectors are not naturally spatially scalable int+2D MCTF, it is necessary to decode a large set of vectors even at lower resolutions.

2. Limited efficiency spatial scalability. If the motion reference duringt+2D MCTF is, for example, at HD resolution and decoding is performed at a low resolution (e.g., QCIF), this leads to “subsampling phase drift” for the low resolution video.

3. Limited spatiotemporal decomposition structures. In t+2D MCTF, the same temporal decomposition scheme is applied for all spatial subbands.

Hence, the same levels of temporal scalability are provided independent of the spatial resolution.

A possible solution for the aforementioned drawbacks is to employ “in-band temporal filtering” schemes, where the order of motion estimation and compensa-tion and that of the spatial wavelet transform (2D-DWT) are interchanged, which we denote as 2D+tMCTF schemes. The spatial wavelet transform for each frame is entirely performed first and multiple separate motion compensation loops are used for the various spatial wavelet bands in order to exploit the temporal correla-tion present in the video sequence (see Figure 5.23). In contrast to the method of Figure 5.15, where spatial decomposition steps were interleaved with the temporal tree, MCTF can now also be applied to spatial high-pass (wavelet) bands. Subse-quently, coding of the wavelet bands after temporal decorrelation can be done us-ing spatial-domain codus-ing techniques such as bit plane codus-ing followed by arith-metic coding or transform-domain coding techniques based on DCT, wavelets, and so on.

5.4.5 Motion Estimation and Compensation in the Overcomplete Wavelet Domain Due to the decimation procedure in the spatial wavelet transform, the wavelet co-efficients are not shift invariant with reference to the original signal resolution.

Hence, translation motion in the spatial domain cannot be accurately estimated and compensated from the wavelet coefficients, thereby leading to a significant

5.4:MOTION-COMPENSATEDWAVELETVIDEOCODECS149

FIGURE 5.23: Multiresolution motion compensation coder using “in-band prediction.”

FIGURE 5.24:

Shift variance of the Haar wavelet transform. Right sig-nal shifted by one sample to the right, low-pass and high-pass coefficients in Haar DWT and Haar ODWT.

coding efficiency loss (see Haar 1D-DWT case in Figure 5.24). To avoid this in-efficiency, motion estimation and compensation should be performed in the over-complete wavelet domain rather than in the critically sampled domain (see Haar 1D-ODWT case in Figure 5.24). Overcomplete discrete wavelet data (ODWT) can be obtained through a similar process as the critically sampled discrete wavelet signals by omitting the subsampling step. Consequently, the ODWT generates more samples than DWT, but enables accurate wavelet domain motion compen-sation for the high-frequency components, and the signal does not bear frequency-inversion alias components.

Despite the fact that ODWT generates more samples, an ODWT-based encoder still needs to only encode the critically sampled coefficients. This is because the overcomplete transform coefficients can be generated locally within the decoder.

Moreover, when the motion shift is known before analysis and synthesis filtering are performed, it is only necessary to compute those samples of the overcomplete representation that correspond with the actual motion shift.

Thet+2D MCTF schemes (Figure 5.25a) can be easily modified into 2D+t MCTF (Figure 5.25b).

(a)

(b)

FIGURE 5.25:

(a) The encoding structure that performs oploop en-coding in the spatial domain –

2D MCTF. (b) The encoding struc-ture that performs open-loop encoding in the wavelet domain (in-band) – 2D

MCTF.

More specifically, in 2D+t MCTF, the video frames are spatially decomposed into multiple subbands using wavelet filtering, and the temporal correlation within each subband is removed using MCTF (see [19,20]). The residual signal after the MCTF is coded subband by subband using any desired texture coding technique (DCT based, wavelet based, matching pursuit, etc.). Also, all the recent advances in MCTF can be employed for the benefit of 2D+t schemes, which have been first introduced in [46–48].

5.5 MPEG-4 AVC/H.264 SCALABLE EXTENSION

As scalable modes in other standards, MPEG-4 AVC/H.264 scalable extension enables scalabilities while maintaining the compatibility of the base layer to the MPEG-4 AVC/H.264 standard. MPEG-4 AVC/H.264 scalable extension provides temporal, spatial, and quality scalabilities. Those scalabilities can be applied si-multaneously. In MPEG-4 AVC/H.264, any frame can be marked as a reference frame that can be used for motion prediction for the following frames. Such flexibility enables various motion-compensated prediction structures (see Fig-ure 5.26).

The common prediction structure used in the MPEG-4 AVC/H.264 scalable extension is the hierarchical-B structure, as shown in Figure 5.26. Frames are categorized into different levels. B-frames at leveli use neighboring frames at leveli−1 as references. Except for the update step, MCTF and hierarchical-B have the same prediction structure. Actually, at the decoder, the decoding process of hierarchical-B and that of MCTF without the update step is the same. Such a hierarchical prediction structure exploits both short-term and long-term tem-poral correlations as in MCTF. The other advantage is that such a structure can inherently provide multiple levels of temporal scalability. Other temporal scala-bility schemes compliant with MPEG-4 AVC/H.264 have been presented in [25]

and are shown to provide increased efficiency and robustness on error-prone net-works.

To achieve SNR scalability, enhancement layers, which have the same motion-compensated prediction structure as the base layer, are generated with finer quan-tization step sizes. At each enhancement layer, the differential signals to the pre-vious layer are coded. Basically it follows the scheme shown in Figure 5.26.

To achieve spatial scalability, the lower resolution signals and the higher reso-lution signals are coded into different layers. Also, coding of the higher resoreso-lution signals uses bits for the lower resolution as prediction. In contrast to previous cod-ing schemes, the MPEG-4 AVC/H.264 scalable extension can set a constraint on

Dans le document MULTIMEDIA OVER IP AND WIRELESS NETWORKS (Page 168-173)