FIGURE 5.4: General framework for layered temporal scalability

A general problem with introducing scalability in a predictive video coding scheme is the so-called drift effect. It occurs when the reference frame used for motion compensation in the encoding loop is not available or not completely available at the decoder side. Therefore both the encoder and the decoder have to maintain their synchronization on the same bit rate in the case of SNR scalabil-ity, resolution level for spatial scalabilscalabil-ity, and frame rate in the case of temporal scalability.

For SNR scalability, a layered encoder exploits correlations across subflows to achieve better overall compression: the input sequence is compressed into a number of discrete layers arranged in a hierarchy that provides progressive refine-ment. A strategy often used in the scalable extensions of current standards (i.e., in MPEG-2 and H263) is to encode the base layer using a large quantization step, whereas the enhancement layers have a refinement goal and use finer quantizers to encode the base layer coding error. This solution is illustrated in Figure 5.5 and is discussed in more detail later.

5.2.2 Successive Approximation Quantization and Bit Planes

To realize the SNR scalability concept discussed earlier, an important category of embedded scalar quantizers is the family of embedded dead zone scalar quantiz-ers. For this family, each transform coefficientxis quantized to an integer

i_b=Q_b(x)=

whereadenotes the integer part ofa;ξ <1 determines the width of the dead zone; >0 is the basic quantization step size (basic partition cell size) of the quantizer family; and b∈Z+ indicates the quantizer level (granularity), with higher values ofb indicating coarser quantizers. In general,bis upper bounded by a valueB_max, selected to cover the dynamic range of the input signal. The reconstructed value is given by the inverse operation,

y_i^p=Q⁻_b¹(i_b)=

where 0≤δ <1 specifies the placement of the reconstructed value y_i^b within the corresponding uncertainty interval (partition cell), defined asC^b_i

b, andiis the partition cell index, which is bounded by a predefined value for each quantizer level (i.e., 0≤i≤M_b−1, for eachb). Based on the aforementioned formulation, it is rather straightforward to show that the quantizerQ₀has embedded within it

5.2:SCALABILITYMODESINCURRENTVIDEOCODINGSTANDARDS125

FIGURE 5.5: Layered SNR scalability.

all the uniform dead zone quantizers with step sizes 2^b,b∈Z₊. Moreover, it can be shown that, under the appropriate settings, the quantizer index obtained by dropping thebleast-significant bits (LSBs) ofi0is the same as that which would be obtained if the quantization was performed using a step size of 2^b,b∈Z+

rather than. This means that if thebLSBs ofi0are not available, one can still dequantize at a lower level of quality using the inverse quantization formula.

The most common option for embedded scalar quantization is successive ap-proximation quantization (SAQ). SAQ is a particular instance of the general-ized family of embedded dead zone scalar quantizers defined earlier. For SAQ, MB_max =MB_max−1= · · · =M0=2 and ξ =0, which determines a dead zone width twice as wide as the other partition cells, andδ=1/2, which implies that the output levels y_i^b are in the middle of the corresponding uncertainty inter-valsC_i^b

p. SAQ can be implemented via thresholding by applying a monotonically decreasing set of thresholds of the form

Tb−1=T_b 2 ,

withB_max≥b≥1. The starting thresholdT_B_max is of the formT_B_max =αx_max, wherexmaxis the highest coefficient magnitude in the input transform decompo-sition, andαis a constant that is taken asα≥1/2.

Let us consider the case of using a spatial transform for the compression of the frames. By using SAQ, the significance of the transform coefficients with respect to any given thresholdT_bis indicated in a corresponding binary map, denoted by W^b, called the significance map. Denote byw(k)the transform coefficient with coordinates k=(κ₁, κ₂)in the two-dimensional transform domain of a given in-put. The significance operators^b(·)maps any valuex(k)in the transform domain to a corresponding binary valuew^b(k)inW^b, according to the rule

w^b(k)=s^b(x(k))=

0, if|x(k)|< Tb, 1, if|x(k)| ≥Tb.

In general, embedded coding of the input coefficients translates into coding the significance mapsW^b, for everybwithB_max≥b≥0.

In most state-of-the-art embedded coders, for every b this is effectively per-formed based on several encoding passes, which can be summarized in the fol-lowing:

Nonsignificance pass: encodess^b(x(k))in the list of nonsignificant coefficients (LNC). If significant, the coefficient coordinates k are transferred into the re-finement list (RL).

Block Significance pass: For a block of coefficients with coordinates kblock, this pass encodess^b(x(kblock))and sign(x(kblock))if they have descendant blocks

(under a quad tree decomposition structure) that were not significant compared to the previous bit plane.

Coefficient Significance pass: If the coordinates of the coefficients of a signifi-cant block are not in the LNC, this pass encodes the significance of coefficients in blocks containing at least one significant coefficient. Also, the coordinates of new significant coefficients are placed into the RL. This pass also moves the coordinates of nonsignificant coefficients found in the block into the LNC for the next bit plane level(s).

Refinement pass: For each coefficient in the RL (except those newly put into the RL during the last block pass), encode the next refinement of the significance map.

5.2.3 Other Types of Scalability

In addition to the aforementioned scalabilities, other types of scalability have been proposed.

• Complexity scalability: the encoding/decoding algorithm has less complex-ity (CPU/memory requirements or memory access) with decreasing tempo-ral/spatial resolution or decreasing quality [40].

• Content (or object) scalability: a hierarchy of relevant objects is defined in the video scene and a progressive bit stream is created following this impor-tance order. Such methods of content selection may be related to arbitrary-shaped objects or even to rectangular blocks in block-based coders. The main problem of such techniques is how to automatically select and track visually important regions in video.

• Frequency scalability: this technique, popular in the context of transform coding, consists of allocating coefficients to different layers according to their frequency. Data partitioning techniques may be used to implement this functionality. The interested reader is referred to Chapter 2 of this book for more information on data partitioning.

Among existing standards, the first ones (MPEG-1 and H.261) did not provide any kind of scalability. H.263+and H.264 provide temporal scalability through B-frames skipping.

5.3 MPEG-4 FINE GRAIN SCALABLE (FGS) CODING AND ITS NONSTANDARDIZED VARIANTS

5.3.1 SNR FGS Structure in MPEG-4

The previously discussed conventional scalable coding schemes are not able to efficiently address the problem of easy, adaptive, and efficient adaptation to

time-varying network conditions or device characteristics. The reason for this is that they provide only coarse granularity rate adaptation and their coding efficiency often decreases due to the overhead associated with an increased number of layers.

To address this problem, FGS coding has been standardized in the MPEG-4 standard, as it is able to provide fine-grain scalability to easily adapt to various time-varying network and device resource (e.g., power) constraints [6,44]. More-over, FGS can enable a streaming server to perform minimal real-time processing and rate control when outputting a very large number of simultaneous unicast (on-demand) streams, as the resulting bit stream can be easily truncated to ful-fill various (network) rate requirements. Also, FGS is easily adaptable to unpre-dictable bandwidth variations due to heterogeneous access technologies (Internet, wireless cellular or wireless LANs) or to dynamic changes in network conditions (e.g., congestion events). Moreover, FGS enables low-complexity decoding and low-memory requirements that provide common receivers (e.g., set top boxes and digital televisions), in addition to powerful computers, the opportunity to stream and decode any desired streamed video content. Hence, receiver-driven stream-ing solutions can only select the portions of the FGS bit stream that fulfill these constraints [40,45].

In MPEG-4 FGS, a video sequence is represented by two layers of bit streams with identical spatial resolution, which are referred to as the base layer bit stream and the fine granular enhancement layer bit stream, as illustrated in Figure 5.6.

Dans le document MULTIMEDIA OVER IP AND WIRELESS NETWORKS (Page 144-149)