SCALABILITY MODES IN CURRENT VIDEO CODING STANDARDS

5 Scalable Video Coding for Adaptive Streaming

5.2 SCALABILITY MODES IN CURRENT VIDEO CODING STANDARDS

5.2.1 Spatial, Temporal, and SNR Coding Structures

There are three basic types of scalability in scalable video coding: spatial, tempo-ral, and quality (or SNR) scalabilities. In a spatial scalable scheme, full decoding leads to high spatial resolution, while partial decoding leads to reduced spatial resolutions (reduction of the format). In a temporal scalable scheme, partial de-coding provides lower decoded frame rates (temporal resolutions). In an SNR scalable scheme, temporal and spatial resolutions are kept the same, but the video quality (SNR) varies depending on how much of the bit stream is decoded.

Current standards, such as H.263, H.264, MPEG-2, and MPEG-4 (both part 2 and part 10), are based on a predictive video coding scheme (see Figure 5.1).

FIGURE5.1:Predictive(hybrid)videocodingscheme.

Although they were not initially designed to address these issues, current stan-dards tried to upgrade their video coding schemes in order to include scalability functionalities. However, this integration generally came at the expense of coding efficiency (performance).

In a standard environment, scalability is achieved through a layered structure, where the encoded video information is divided into two or more separated bit streams corresponding to the different layers (see Figure 5.2).

• The base layer (BL) is generally highly and efficiently compressed by a nonscalable standard solution.

• The enhancement layer(s) (EL) encode(s) the residual signal to produce the expected scalability (it delivers, when combined with the base layer decoding, a progressive quality improvement in case of SNR scalability, a higher spatial resolution for spatial scalability, and a higher frame rate for temporal scalability).

To achieve spatial scalability in the hybrid scheme presented in Figure 5.3, the input video sequence is first spatially decimated to yield the lowest resolution layer, which is encoded by a standard encoder. A similar coding scheme is em-ployed for the enhancement layer. To transmit a higher resolution version of the current frame, two predictions are formed: one is obtained by spatially interpolat-ing the decoded lower resolution image of the current frame (spatial prediction) and the other by temporally compensating the higher resolution image of the pre-dicted frame with motion information (temporal prediction). The two predictions are then adaptively combined for a better prediction and the residue after predic-tion is coded and transmitted. In Figure 5.3, a scheme with two resolupredic-tion levels is depicted, but the same principle can be used to produce several spatial resolu-tion enhancement levels. This soluresolu-tion corresponds to a Laplacian pyramid and is noncritically sampled, or redundant (the number of output samples is higher than the number of input samples).

The drawback of this approach is that the different encoding loops with their own motion estimation steps are used in parallel, at the encoder side, and sev-eral motion compensation loops are necessary at the decoder side, thus in-creasing the computational complexity both at the encoder and at the decoder.

A possible advantage of this scheme is the flexibility in choosing the downsam-pling/upsampling filters, in particular for reducing aliasing at lower resolutions.

Related to the spatial scalability, there is the issue of motion vector scalability.

Indeed, the different resolution levels will need motion vector fields with different resolutions and, possibly, accuracies. For the aforementioned Laplacian pyramid coding, the simplest approach is to estimate and encode the motion vectors, start-ing from the lowest resolution and gostart-ing to the highest. From one layer to the other, the motion vector size needs to be doubled. Additionally, a refinement of the motion vector can be performed at higher resolutions. At this point, the

pre-5.2:SCALABILITYMODESINCURRENTVIDEOCODINGSTANDARDS121

FIGURE 5.2: Global structure of a layered scalable video-coding scheme.

Chapter5:SCALABLEVIDEOCODING

FIGURE 5.3: Layered spatial scalability.

cision and the accuracy of the motion can also be increased at higher levels. By precision we understand here the size of the block considered for motion estima-tion and compensaestima-tion. When doubling the resoluestima-tion, the dimensions of the block also double, and the motion representation loses in precision. Therefore, it may be convenient to split the block in smaller subblocks (two rectangular or four square ones) and look for refinement vectors in the subblocks. The decision to split or keep the lower resolution precision may be taken based on a rate–distortion cri-terion. Once the lowest resolution motion vector field is encoded, the next levels can be either encoded independently, with a possible loss in efficiency, or only the refinement vector(s) can be encoded in the refinement layer. The interested reader is referred to [22] for a more detailed discussion on motion vector scalability and its impact on the prediction complexity.

Temporal scalability involves partitioning of the group of pictures (GOP) into layers having the same spatial resolution. A simple way to achieve temporal scal-ability is to put some of the B frames from an IBBP . . . stream into one or several enhancement layers. This solution comes at no cost in terms of coding efficiency.

In a more general setting, the base layer may contain I, P, and B frames at the low frame rate, while the enhancement layers can only use frames from the immedi-ately lower temporal layer and previous frames from the same enhancement layer for temporal prediction. Generally, temporal prediction from future frames in the same enhancement layer is prohibited in order to avoid reordering in the enhance-ment layers. An example with one enhanceenhance-ment layer is presented in Figure 5.4.

The layered solution can be seen as an upgrade of standard solutions in order to provide scalability. The main shortcoming of these schemes comes from the fact that the information redundancy between the different layers cannot be fully exploited. This functionality is thus achieved at the expense of implementation complexity and coding efficiency.

Dans le document MULTIMEDIA OVER IP AND WIRELESS NETWORKS (Page 139-144)