Lightweight real-time decoding of video on J2ME devices

(1)

Lightweight Real-Time Decoding of Video on J2ME Devices

Nabil Madrane LIADS Labs Faculty of Ain Chock

Casablanca, Morocco

Abderrahim Sekkaki LIADS Labs Faculty of Ain Chock

Casablanca, Morocco

Said Jai-Andaloussi LIADS Labs Faculty of Ain Chock Casablanca, Morocco

Abstract—This paper presents a video coding and decoding

technique based on vector quantization. The objective is to develop a lightweight, efficient and real-time video decoder able to decode in real-time several video streams simultaneously, on low-end hardware. We show how to layout the output bitstream in order to facilitate decoding on highly constrained devices.

Experiments show that it is possible to bring high quality, full- screen rendering of concurrent video clips on devices that have very limited processing power.

I. I

NTRODUCTION

Most video decoding algorithms require significant re- sources in term of CPU and memory. Even if today ”low-end”

devices tend to have more and more processing capabilities, battery consumption is directly proportional to the CPU usage so it is important to minimize the energy required for decoding a video stream.

The field of video coding/decoding based on vector quan- tization has received a special attention the last few years [1]. This paper addresses the problem of ”encoding for low complexity decoding” in the context of vector quantization.

Several propositions have emerged in the field of ”encoding for low complexity decoding” [2], while interesting work is conducted on lightweight coding [3].

Vector quantization codecs were popular in the early days of computer multimedia when common PCs were not yet fast enough for codecs that were based on methods such as transform coding (discrete cosine transform for example).

Today, computers are so much faster than vector quantization codecs have fallen out of favor compared to transform codecs which are far more efficient to encode.

However a huge number of embedded and personal devices have a Java virtual machine (Micro Edition Java ME) which is not powerful enough to decode a video in real-time in pure software. On these Java-enabled devices, playing a video stream is performed by the underlying operating system, which in turn may rely on hardware acceleration.

A pure J2ME video decoder could solve one of the major problems for developers and service providers : the ”fragmen- tation” of the handsets market makes almost impossible to have a universal video format that can be decoded anywhere, even on the most constrained device. And even if the J2ME

specification proposes an API for playing video (in the form of a JSR - Java Specification Request), it is very limited by the fact that the manufacturer’s implementation is often incomplete. In particular, features such as HTTP or RTSP streaming are often missing in most mass-market devices.

II. O

BJECTIVES

The encoding/decoding scheme presented in this article focuses on providing the following features :

•

No-floating point operation for decoding video. Vector quantization is a powerful technique when applied to video compression, thanks to the possibility to decode and render video by using lookup indexes (i.e. integer numbers) on a predefined codebook.

•

Ability to decode several video streams at the same time.

The consequence of this feature is that we can create multi-angle videos, and give the end-user the possibility to instantly switch between two or more angles. Under the woods, all streams are decoded at the same time but only one of them is rendered on the user’s screen. Switching from one stream to another can thus be achieved with- out any delay. Sport events and surveillance are typical sample applications.

•

Streaming-capable bitstream. The output bitstream gener- ated by our encoding scheme is optimized for streaming applications.

III. E

NCODING

A

LGORITHM

The first step of the encoding process concerns the extrac- tion of structure information by segmenting the video clip into scenes. Then, each scene is encoded separately by using a clustering algorithm so that we can determine a codebook of representative image blocks for the whole scene. Finally, we encode each scene by a series of indexes, each index pointing to a given block of the codebook.

A number of techniques are used at each level of this

process in order to ensure a good compression ratio, while

keeping the decoding phase fast and lightweight.

(2)

A. Scene cut detection

The first step of the encoding process concerns the seg- mentation of the video into scenes. A number of approaches have been proposed for effective detection of scene cuts ([4]

and [5]). Our algorithm is based on the comparison of the intensity histograms of 2 consecutive images, along with an adaptive threshold technique that allows us to detect different kind of scenes transitions (sharp scene cuts but also fade in/out transitions). The distance between 2 consecutive frames f and g, with histograms H(f) and H(g) respectively, is :

d(H

f

, H

g

) = q

P

255

i=0

(H

f

(i) − H

g

(i))

²

At the end of the segmentation process, the resulting set of scenes is noted :

S

ⁱ

and the total number of images in scene S

ⁱ

is noted L

ⁱ

(length of scene).

B. Blocks extraction

Vector quantization is performed on a n-dimensional space.

Each point in this space represents an image block of b × b pixels (typical values for b are 2 and 3). If the video size is W × H pixels, then there are L

ⁱ

× N × M blocks of pixels in scene S

ⁱ

, where N =

^H_b

and M =

^W_b

.

Fig. 1. Image blocks

If we work on the RGB space, then each point of the n- dimensional space is constituted n = b × b × 3 components.

This is the most naive approach where each vector of the n- dimensional space represents the raw RGB data of the pixels of a b × b block.

C. Clustering

The next step of the encoding process concerns the cluster- ing. Since our main objective is to focus on how to encode for lightweight decoding, we have used standard techniques for clustering. In particular, we implemented a variant of the K-Means in order to cluster the set of blocks extracted from all the images of scene S

ⁱ

. Dimension reduction, prior to clustering, is performed with the Principal Component Analysis (PCA) algorithm.

The number of clusters is fixed to K, and the objective is to find the cluster centers and assign the objects to the nearest cluster center, such that the squared distances from

the cluster are minimized. For a given scene S

ⁱ

the set of resulting clusters is :

C

_jⁱ

with j = 1...K

A Best Bin First algorithm has been used in order to efficiently find an approximate solution to the nearest neighbor search problem.

D. Codebook and indexes encoding

The codebook for a given scene S

ⁱ

is constituted by the centers of the clusters, i.e. by K blocks, each block having a size of b × b pixels. Typical values of K that we have experimented are 1024, 2048 and 4096. They are all in the form of 2

^p

so that we optimize the allocation of bits when it comes to binary encode the indexes.

E. Output bitstream

1) Encoded frames: Two kind of frames are defined : I (integral) frames and D (differential) frames. An encoded scene S

ⁱ

contains one I frame and several D frames.

•

Integral frames contain the indices of the blocks. Each integral frame I is thus constituted of M ×N indexes and each index is encoded on k bits, where k = log

2

(K).

•

Differential frames contain only the indexes of the blocks that have changed between the previous frame and the current frame. Two blocks (at the same position) in frame j and j+1 are considered as different if dist(B

j

, B

j+1

) >

σ, where σ is estimated on a database of various types of video streams. The position of the blocks that have changed is encoded in a bitmask of N × M bits where bitmak(z) = 1 if the z

^th

block have changed between frame j and j+1. Then, each changed bloc is represented by its index (encoded on k bits) in the codebook.

Fig. 2. Representation of an encoded sceneSⁱ

2) Interleaving: The encoding represented in Figure 2 is not suitable for realtime decoding : when the last D frame of scene S

ⁱ

is decoded and rendered, it takes a significant time to read the codebook of scene S

ⁱ⁺¹

before actually rendering the first I frame of S

ⁱ⁺¹

. This implies a small delay in decoding and doesn’t allow a smooth playback of video.

We solve this problem by interleaving the codebook of S

ⁱ⁺¹

inside the bitstream of S

ⁱ

, as shown in Figure 3.

Since the decoding of a D frame is very fast, we insert a

portion of the codebook of S

ⁱ⁺¹

between each D frame of S

ⁱ

.

The codebook is evenly distributed between the L

ⁱ

frames of

S

ⁱ

.

(3)

Fig. 3. Representation of an encoded sceneSⁱ

3) Well-known blocks: It is possible to enhance the com- pression ratio by using what we call ”well-known blocks”

(WKB). A WKB is a block with a known pattern. For example, a completely black or white block is a well-known block. A block with a color gradient is also a WKB, so we can define 4 additional well-known blocks corresponding to the following gradient schemes :

•

From WEST to EST

•

From NORTH to SOUTH

•

From NORTH-WEST to SOUTH-EST

•

From NORT-EST to SOUTH-WEST

Because the decoder has a built-in codebook of these 5 well- known blocks, they don’t have to be injected in the output bitstream. However, using well-known blocks implies that each codebook index must be prepended with an additional bit (flag bit) so that we can differentiate between ”normal”

codebook indexes (which are pointing to codebook entries and are encoded on k bits) and ”WKB” codebook indexes (which are pointing to well-known blocks and are encoded on 3 bits, since we have 5 different types of WKBs). Furthermore, each WKB must be parameterized with either one RGB color (for plain WKBs) or two colors (for gradient-based WKBs).

This results in an additional compression in the overall encoding algorithm : instead of storing 1 bit (for the flag) + b × b × 24 bits for a normal RGB codebook entry, a WKB codebook entry only uses 1 bit (for the flag) + 3 bits (index of the WKB block) + 2 × 24 bits (for the gradient colors).

Depending on how many WKBs are detected in the video clip, we decide if it is advantageous or not to use this additional compression technique.

IV. E

XPERIMENTS

A. Video material

Three sets of video sequences are used for our experiments.

First one is constituted by a single CIF (352x288) video clip called DANCE. Second set is constituted by a single CIF (352x288) video clip called CARTOON. Third one is called FOOTBALL and is a multi-angle video constituted by 2 QCIF (176x144) video clips corresponding to the recording of 2 cameras at different angles.

B. Software implementation

The compiled code footprint is about 8 KB and, as expected, only uses lookup operations and copy of blocks of pixels for decoding the video.

Fig. 4. Samples used for the experiments

C. Target hardware

In order to validate our approach, we have chosen to run our tests on a mass-market phone, the Nokia 6300 phone, which is compliant to MIDP 2.0 and CLDC 1.1. Java applications running on that phone can only use up to 1 MB, so this phone constitute a good candidate to determine how many full- screen video streams can be decoded at the same time on that hardware. It is important to note that we do not benefit from any hardware acceleration : the decoding process is entirely done in pure software, with no floating point operations and without any help of the GPU of the phone.

D. Compression ratio

Obviously, the compression ratio depends on the type of video. Long scenes give a good compression ratio since the same codebook is used for a longer sequence of the video.

TABLE I COMPRESSION RATIO

Video Clip Size Average scene length Encoded bitrate

DANCE 352×288 1.5s 538 kbits/s

CARTOON 352×288 3s 415 kbits/s

FOOTBALL 176×144 2s 300 kbits/s

E. Simultaneous decoding

We achieved real-time decoding of 3 video streams at the same time on the Nokia 6300 (see Figure 5). The upper video is the DANCE sample. The other rendered video is the FOOTBALL sample, which is a 2-angles video stream.

Internally, 3 streams are decoded at the same time and the user has the possibility to instantly switch between the 2 camera angles on the lower video.

F. HTTP streaming

The particular format of our output bitstream allows us to efficiently stream the video data over HTTP. As previously mentioned, interleaving the codebooks allows a smooth tran- sition between scenes.

V. C

ONCLUSION

In this paper we have shown how to efficiently encode a

video by using vector quantization techniques so that we can

achieve lightweight decoding in pure software. This enables

new features at the decoding stage, such as concurrent decod-

ing (for multi-angles clips), good compression ratio thanks

to several techniques used at each stage of the encoder, and

HTTP streaming capability.

(4)

Fig. 5. Real-time decoding of 3 streams at the same time : 1 for the upper video and 2 for the lower video (2 angles video stream)

The visual quality can be significantly enhanced by using better vector quantization techniques, but it will result in an increased encoding time. We are currently focusing on addressing this problem.

R

EFERENCES

[1] J.E. Fowler Jr, K.C. Adkins, S.B. Bibyk, S.C. Ahalt Real-time video compression using differential vector quantization, IEEE Transactions on Circuits and Systems for Video Technology (impact factor: 1.65).

03/1995; DOI:10.1109/76.350774, 1995Video Compression Using Mul- tiwavelet and Multistage Vector Quantization 385 Video Compression Using Multiwavelet and Multistage Vector Quantization, 2007 [2] K. Ugur, J. Lainela, A Hallapuro and M. Gabbouj.Generating H.264/AVC

compliant bitstreams for lightweight decoding operation suitable for mobile multimedia systems, ICASSP, 2006.

[3] M. Wu, G. Hua, C. Wen Chen.Syndrome-based light-weight video coding for mobile wireless application, ICME, 2006.

[4] Z. Cernekova, C. Nikou, and I. Pitas,Shot detection in video sequences using entropy-based metrics, Proc. 2002 IEEE Int. Conf. Image Process- ing, Rochester, N.Y.,USA, 22-25 September, 2002

[5] M. S. Drew, Z.-N. Li, and X. Zhong,Video dissolve and wipe detection via spatio-temporal images of chromatic histogram differences, Proc. 2000 IEEE Int. Conf. on Image Processing, 2000, vol. 3, pp. 929932.