Fundamentals of Video Compression - Motion Estimation for Video Coding

In this section, we explore the operation of the basic video codec. The major video coding standards released since the early 1990s have been based on the same generic design (or model) of a video CODEC that incorporates a motion estimation and

Studies in Computational Intelligence 590, DOI 10.1007/978-3-319-14376-7_1

2 1 Introduction

Fig. 1.1 Block Diagram for a generic video encoder

compensation front end (sometimes described as Differential Pulse Code Modulation (DPCM)), a transform stage and an entropy encoder. The model is often described as a hybrid DPCM/DCT CODEC [1, 2]. Any CODEC that is compatible with H.261/263/264/265 or MPEGs (1/2/4) has to implement a similar set of basic cod-ing and decodcod-ing functions (although there are many differences of detail between the standards and their actual implementations). Figure1.1shows the structure of an encoder for video compression (generic DPCM/DCT hybrid encoder). In an encoder, video frame n(Fn)is processed to produce a coded (compressed) bitstream while in a decoder, the compressed bitstream is decoded to produce a reconstructed video frame Fˆn, not usually identical to the source frame. Functionality of each block is explained in the following subsections.

1.1.1 Transform Block

The first step in Discrete Cosine Transform (DCT) based image/video coding is to divide the image into small blocks, usually of size 8×8 pixels. Then, DCT opera-tion is performed on each block to convert each pixel value into frequency domain. It takes 64 input values and yields 64 frequency domain coefficients. This transform is fully reversible; the original block can be reconstructed by applying an Inverse DCT (IDCT). DCT not only converts the pixels into the corresponding frequency values such that the lower frequencies appear at the top-left side of the block, while the higher frequencies appear at the bottom right. As the human eye is sensitive to only low frequencies, subsequently steps tend to discard the high frequency values to achieve compression. Hence, DCT helps separate more perceptible information from less per-ceptible information. DCT converts a block of image pixels into a block of transform coefficients of the same dimension [3]. These DCT coefficients represent the original pixels values in the frequency domain. Any gray-scale 8×8 pixel block can be fully represented by a weighted sum of 64 DCT basis functions where the weights are just

the corresponding DCT coefficients [1,2]. The two-dimensional DCT transform of an N×N pixel block is described in Eq. (1.1), where f(j,k)is the pixel value at the position(j,k)andF(u,v)is the transform coefficient at the position(u,v).

F(u,v)= 2 Corresponding Inverse DCT is given by

f(j,k)= 2

Using formulae (1.1) and (1.2), the forward and inverse transforms require a large number of floating point computations. Simple calculation shows that a total of 64×64=4,096 computations are needed to transform a block of 8×8 pixels.

It may be noted that the forward and inverse DCT transforms are separable, implying that the two dimensional transform coefficients can be obtained by applying a one-dimensional transform first along the horizontal direction and then along the vertical direction separately. This can reduce the number of computations required for each 8×8 block from 4,096 to 1,024.

1.1.2 Quantization

For a typical block in a photographic image, most of the high-frequency DCT coef-ficients will be nearly zero. On an average, the DC coefficient and the other low-frequency coefficients often have relatively large amplitudes. This is because in an image with smooth natural scene, most blocks tend to contain little high-frequency contents; in general, only a few of the DCT coefficients have significant values [1,2].

The DCT coefficients are quantized so that the near-zero coefficients are set to zero and the remaining coefficients are represented with reduced precision. To quantize each coefficient, it is divided by the quantizer step size and the result is rounded to the nearest integer. Therefore, larger quantizer step sizes mean coarser quan-tization. Although this results in information loss, compression is achieved since most of the coefficient values in each block now are zero. Coarser quantization (i.e., larger quantizer step size) gives higher compression and poorer decoded image quality.

4 1 Introduction

1.1.3 Entropy Coding

After quantization, nonzero coefficients are further encoded using an entropy coder such as Huffman coder. In Huffman coding [4] (and in other entropy coding schemes) [1,2], the more frequent values are represented with shorter codes and the less fre-quent values with longer codes. Zero coefficients can be efficiently encoded using run-length encoding. Instead of transmitting all the zero values one by one, run length coding simply transmits the total number of the current run of zeros. The result is a compressed representation of the original image. To decode the image, the reverse procedure is carried out. First, the variable-length codes (entropy codes) are decoded to get back the quantized coefficients. These are then multiplied by the appropriate quantizer step size to obtain an approximation to the original DCT coefficients. These coefficients are put through the inverse DCT to get back the pixel values in the spa-tial domain. These decoded pixel values will not be identical to the original image pixels since a certain amount of information is lost during quantization. A lossy DCT CODEC produces characteristic distortions due to the quantization process.

These include “blocking” artifacts [5], where the block structure used by the encoder becomes apparent in the decoded image, and “mosquito noise” [5], where lines and edges in the image are surrounded by fine lines. DCT-based image coding systems can provide compression ratios ranging between 10 and 20 while maintaining a reason-ably good image quality. The actual efficiency depends to some extent on the image content. As images with considerable detail contain many nonzero high-frequency DCT coefficients, they need to be coded at higher rates than images with less detail.

Compression can be improved by increasing the quantization step size. In general, higher compression is obtained at the expense of poorer decoded image quality.

1.1.4 Motion Estimation and Compensation

Motion Estimation (ME) and Motion Compensation (MC) play an important role in video coding system. Motion estimation has been adopted in many video coding standards like MPEG-X series and H.26X series [6]. In motion estimation and motion compensation, the previous or the future frame is used as the reference frame to predict the current frame [7]. Both the frames are divided into blocks of fixed sizes, usually 16×16 pixels known as the Macroblocks (MBs). For each MB in the current frame, one estimates its motion by searching for the best matched MB (within a search range) in a previously available reference frame.

The displacement between the MB in the current frame and the best matched one in the reference frame is known as the Motion Vector (MV). It is only the difference between the two MBs and the displacement used to describe the position of the reference MB, which needs to be coded. The amount of coded data is therefore much less than the original, and a high compression ratio can be achieved. However, modern coding standards like H.264 allow ME for blocks of variable sizes to achieve a better prediction for objects with complex motion [1,2].

As mentioned previously, the purpose of motion estimation is to search for the most similar reference block for a given block in the current frame and to find a MV for the same. This process is repeated for all the blocks in the current frame and a separate motion vector is found for all the blocks. On the other hand, the process of motion compensation is concerned with reconstructing the current frame from the MVs which have been obtained from the motion estimator.

Motion estimation happens to be the most computationally expensive and resource hungry operation in the entire video compression process. Hence, this field has seen the highest activity and research interest in the past two decades.

Many fundamental block matching algorithms [7–21] are available in the lit-erature from the mid-1980s to the recent years. Efficient motion estimation algo-rithms and architectures are therefore indeed necessary for the design of an efficient video encoder. The following chapters of this book present an in-depth of discus-sion on some important block motion estimation algorithms and their hardware implementation.

Dans le document Motion Estimation for Video Coding (Page 17-21)