Raster images 1 - Digital Video and HD

Avector image comprises data describing a set of geometric primi-tives, each of which is associated with grey or colour values.

A process of interpretation– raster-izing, or raster image processing, or ripping– is necessary to convert a vector image to a raster. Vector suggests a straight line but paradox-ically, “vector” images commonly contain primitives describing curves.

A digital image is represented by a rectangular array (matrix) of picture elements (pels, or pixels). Pixel arrays of several image standards are sketched in Figure 1.1. In a greyscale system each pixel comprises a single compo-nent whose value is related to what is loosely called brightness. In a colour system each pixel comprises several components– usually three – whose values are closely related to human colour perception.

Historically, a video image was acquired at the camera, conveyed through the channel, and displayed using analog scanning; there was no explicit pixel array.

Modern cameras and modern displays directly represent the discrete elements of an image array having fixed structure. Signal processing at the camera, in the pipe-line, or at the display may perform spatial and/or temporal resampling to adapt to different formats.

1920 Figure 1.1 Pixel arrays of

several imaging standards are shown, with their counts of image columns and rows. The 640×480 square sampled structure common in computing is included;

however, studio and consumer 480i standards are sampled 704×480 or 720×480 with nonsquare sampling.

In art, the frame surrounds the picture; in video, the frame is the picture.

The pixel array is for one image is aframe. In video, digital memory used to store one image is called aframestore; in computing, it’s aframebuffer. The total pixel count in an image is the number of image columns N_C (or in video, samples per active line, S_AL) times the number of image rows N_R (or active lines, L_A). The total pixel count is usually expressed in megapixels (Mpx).

In video and in computing, a pixel comprises the set of all components necessary to represent colour (typi-cally red, green, and blue). In the mosaic sensors typical of digital still cameras (DSCs) a pixel is any colour component individually; the process of demosaicking interpolates the missing components to create a fully populated image array. In digital cinema cameras the DSC interpretation of pixel is used; however, in a digital cinema projector, a pixel is a triad.

A computer enthusiast refers to the image column and row counts (width×height) as resolution. An image engineer reserves the term resolution for the image detail that is acquired, conveyed, and/or deliv-ered. Pixel count imposes an upper limit to the image detail; however, many other factors are involved.

The value of each pixel component represents bright-ness and colour in a small region surrounding the corre-sponding point in the sampling lattice.

Pixel component values are quantized, typically to an integer value that occupies between 1 and 16 bits – and often 8 or 10 bits – of digital storage. The number of bits per component, or per pixel, is called the bit depth. (We use bit depth instead of width to avoid confusion: The term width refers to the entire picture.) Aspect ratio

Aspect ratio is simply the ratio of an image’s width to its height. Standard aspect ratios for film and video are sketched, to scale, in Figure 1.2. What I call simply aspect ratio is sometimes called display aspect ratio

SD video 4 : 3 1.33:1

35 mm still film 3:2

Figure 1.2 Aspect ratio of video, HD, and film are compared.

Aspect ratio is properly written width:height (not height:width).

Conversion among aspect ratios is fraught with difficulty.

(DAR) or picture aspect ratio (PAR). Standard-definition (SD) television has an aspect ratio of 4:3.

Equation 1.1 relates picture and sample aspect ratios.

To assign n square-sampled pixels to a picture having aspect ratio AR, choose image column and image row counts (c and r, respectively) according to Equation 1.2.

Cinema film commonly uses 1.85:1 (which for histor-ical reasons is called either flat or spherhistor-ical), or 2.4:1 (“CinemaScope,” or colloquially, ’scope). Many films are 1.85:1, but “blockbusters” are usually 2.4:1. Film at 2.4:1 aspect ratio was historically acquired using an aspherical lens that squeezes the horizontal dimension of the image by a factor of two. The projector is equipped with a similar lens, to restore the horizontal dimension of the projected image. The lens and the technique are called anamorphic. In principle, an anamorphic lens can have any ratio; in practice, a ratio of exactly two is ubiquitous in cinema.

Widescreen refers to an aspect ratio wider than 4:3.

High-definition (HD) television is standardized with an aspect ratio of 16:9. In video, the term anamorphic usually refers to a 16:9 widescreen variant of a base video standard, where the horizontal dimension of the 16:9 image occupies the same width as the 4:3 aspect ratio standard. Consumer electronic equipment rarely recovers the correct aspect ratio of such conversions (as we will explore later in the chapter.)

HD is standardized with an aspect ratio of 16:9 (about 1.78:1), fairly close to the 1.85:1 ordinary movie aspect ratio. Figure 1.3 below illustrates the origin of the 16:9 aspect ratio. Through a numerological coinci-dence apparently first revealed by Kerns Powers, the

N n AR N n

In Europe and Asia, 1.66:1 was the historical standard for cinema, though 1.85 is increasingly used owing to the worldwide market for entertainment imagery.

FHA: Full-height anamorphic

Schubin, Mark (1996),

“Searching for the perfect aspect ratio,” in SMPTE Journal 105 (8):

460–478 (Aug.).

Figure 1.3 The choice of 16:9 aspect ratio for HD came about because 16:9 is very close to the geometric mean of the 4:3 picture aspect ratio of conventional television and the 2:4:1 picture aspect ratio of CinemaScope movies.

geometric mean of 4:3 (the standard aspect ratio of conventional television) and 2.4 (the aspect ratio of a CinemaScope movie) is very close – within a fraction of a percent – to 16:9. (The calculation is shown in the lower right corner of the figure.) A choice of 16:9 for HD meant that SD, HD, and CinemaScope shared the same “image circle”: 16:9 was a compromise between the vertical cropping required for SD and the horizontal cropping required for CinemaScope.

Geometry

In mathematics, coordinate values of the (two-dimen-sional) plane range both positive and negative. The plane is thereby divided into four quadrants (see Figure 1.4). Quadrants are denoted by Roman numerals in the counterclockwise direction. In the continuous image plane, locations are described using Cartesian coordinates [x, y] – the first coordinate is associated with the horizontal direction, the second with the vertical. When both x and y are positive, the location is in the first quadrant (quadrant I). In image science, the image lies in this quadrant. (Adobe’s Postscript system uses first-quadrant coordinates.)

In matrix indexing, axis ordering is reversed from Cartesian coordinates: A matrix is indexed by row then column. The top row of a matrix has the smallest index, so matrix indices lie in quadrant IV. In mathematics, matrix elements are ordinarily identified using 1-origin indexing. Some image processing software packages use 1-origin indexing – in particular, matlab and Mathe-matica, both of which have deep roots in mathematics.

The scan line order of conventional video and image processing usually adheres to the matrix convention, but with zero-origin indexing: Rows and columns are usually numbered [r, c] from [0, 0] at the top left. In other words, the image is in quadrant IV (but eliding the negative sign on the y-coordinate), but ordinarily using zero-origin indexing.

Digital image sampling structures are denoted width×height. For example, a 1920×1080 system has columns numbered 0 through 1919 and rows (histori-cally, “picture lines”) numbered 0 through 1079.

Figure 1.4 Cartesian coordi-nates [x, y] define four quad-rants. Quadrant I contains points having positive x and y values. Coordinates in quadrant I are used in some imaging systems. Quadrant IV contains points having positive x and negative y. Raster image coordinates are ordinarily represented with image row numbers increasing down the height of the image – that is, in quadrant IV, but omitting the negative sign on the y values.

Image capture

In human vision, the three-dimensional world is imaged by the lens of the eye onto the retina, which is popu-lated with photoreceptor cells that respond to light having wavelengths ranging from about 400 nm to 700 nm. In video and in film, we build a camera having a lens and a photosensitive device, to mimic how the world is perceived by vision. Although the shape of the retina is roughly a section of a sphere, it is topologically two dimensional. In a camera, for practical reasons, we employ a flat image plane, sketched in Figure 1.5 above, instead of a section of a sphere. Image science involves analyzing the continuous distribution of optical power that is incident on the image plane.

Digitization

Signals captured from the physical world are translated into digital form by digitization, which involves two processes: sampling (in time or space) and quantization (in amplitude), sketched in Figure 1.6 below. The oper-ations may take place in either order, though sampling usually precedes quantization.

Figure 1.5 Scene, lens, image plane

Figure 1.6 Digitization comprises sampling and quantization, in either order.

Sampling density, expressed in units such as pixels per inch (ppi), relates to resolu-tion. Quantization relates to the number of bits per pixel (bpp) or bits per compo-nent/channel (bpc). Total data rate or data capacity depends upon the product of these two factors.

Sampling of time/space

Quantization of amplitude

Digitization

LSB dh dv

Quantization Quantization assigns an integer to signal amplitude at an instant of time or a point in space, as I will explain in Quantization, on page 37. Virtually all image

exchange standards – TIFF, JPEG, SD, HD, MPEG, H.264 – involve pixel values that are not proportional to light power in the scene or at the display: With respect to light power, pixel values in these systems are nonlin-early quantized.

1-D sampling A continuous one-dimensional function of time, such as audio sound pressure level, is sampled through forming a series of discrete values, each of which is a function of the distribution of a physical quantity (such as intensity) across a small interval of time.

Uniform sampling, where the time intervals are of equal duration, is nearly always used. (Details will be

presented in Filtering and sampling, on page 191.) 2-D sampling A continuous two-dimensional function of space is

sampled by assigning, to each element of the image matrix, a value that is a function of the distribution of intensity over a small region of space. In digital video and in conventional image processing, the samples lie on a regular, rectangular grid.

Analog video was not sampled horizontally; however, it was sampled vertically by scanning and sampled temporally at the frame rate. Historically, samples were not necessarily digital: CCD and CMOS image sensors are inherently sampled, but they are not inherently quantized. (On-chip analog-to-digital conversion is now common in CMOS sensors.) In practice, though,

sampling and quantization generally go together.

Perceptual uniformity

A perceptual quantity is encoded in aperceptually uniform manner if a small perturbation to the coded value is approximately equally perceptible across the range of that value. Consider the volume control on your radio. If it were physically linear, the roughly loga-rithmic nature of loudness perception would place most of the perceptual “action” of the control at the bottom of its range. Instead, the control is designed to be perceptually uniform. Figure 1.7 shows the transfer function of a potentiometer with standard audio taper:

Angle of rotation is mapped to sound pressure level such that rotating the knob 10 degrees produces Figure 1.7 Audio taper imposes

perceptual uniformity on the adjustment of volume. I use the term perceptual uniformity instead of perceptual linearity:

Because we can’t attach an oscilloscope probe to the brain, we can’t ascribe to perception a mathematical property as strong as linearity. This graph is redrawn from Bourns, Inc.

(2005), General Application Note – Panel Controls – Taper.

Angle of rotation, degrees

Sound pressure level, relative

0 300

a similar perceptual increment in volume across the range of the control. This is one of many examples of perceptual considerations built into the engineering of electronic system.s. (For another example, see

Figure 1.8.)

Compared to linear-light encoding, a dramatic improvement in signal-to-noise performance can be obtained by using nonlinear image coding that mimics human lightness perception. Ideally, coding for distribu-tion should be arranged such that the step between pixel component values is proportional to ajust notice-able difference (JND) in physical light power. The CIE standardized the L* function in 1976 as its best esti-mate of the lightness sensitivity of human vision.

Although the L* equation incorporates a cube root, L* is effectively a power function having an exponent of about 0.42; 18% “mid grey” in relative luminance corresponds to about 50 on the L* scale from 0 to 100.

The inverse of the L* function is approximately a 2.4-power function. Most commercial imaging systems incorporate a mapping from digital code value to linear-light luminance that approximates the inverse of L*.

EOCF: Electro-optical conversion function. See Chapter 27, Gamma, on page 315.

Different EOCFs have been standardized in different industries:

• In digital cinema, DCI/SMPTE standardizes the refer-ence (approval) projector; that standard is closely approximated in commercial cinemas. The standard digital cinema reference projector has an EOCF that is a pure 2.6-power function.

Figure 1.8 Grey paint samples exhibit perceptual uniformity:

The goal of the manufacturer is to cover a reasonably wide range of reflectance values such that the samples are uniformly spaced as judged by human vision. The manufacturer’s code for each chip typically includes an approximate L* value. In image coding, we use a similar scheme, but with code (pixel) value V instead of L*, and a hundred or a thousand codes instead of six.

CIE: Commission Internationale de L’Éclairage. See Chapter 25, on page 265.

• In SD and HD, EOCF was historically poorly stan-dardized or not stanstan-dardized at all. Consistency has been achieved only through use of de facto industry-standard CRT studio reference displays having EOCFs well approximated by a 2.4-power function. In 2011, BT.1886 was adopted formalizing the 2.4-power, but reference white luminance and viewing conditions are not [yet] standardized.

• In high-end graphics arts, the Adobe RGB 1998 industry standard is used. That standard establishes a reference display and its viewing conditions. Its EOCF is a pure 2.2-power function.

• In commodity desktop computing and low-end graphics arts, the sRGB standard is used. The sRGB standard establishes a reference display and its viewing conditions. Its EOCF is a pure 2.2-power function.

Colour

Vision when only the rod cells are active is termed scotopic. When light levels are sufficiently high that the rod cells are inactive, vision is photopic. In the mesopic realm, both rods and cones are active.

To be useful for colour imaging, pixel components represent quantities closely related to human colour vision. There are three types of photoreceptor cone cells in the retina, so human vision is trichromatic: Three components are necessary and sufficient to represent colour for a normal human observer. Rod cells consti-tute a fourth photoreceptor type, responsible for what can loosely be called night vision. When you see colour, cone cells are responding. Rod (scotopic) vision is disre-garded in the design of virtually all colour imaging systems.

Colour images are generally best captured with sensors having spectral responsivities that peak at about 630, 540, and 450 nm – loosely, red, green, and blue – and having spectral bandwidths of about 50, 40, and 30 nm respectively. Details will be presented in Chapters 25 and 26.

The term multispectral refers to cameras and scanners, or to their data representations. Display systems using more than three primaries are called multiprimary.

In multispectral and hyperspectral imaging, each pixel has 4 or more components each representing power from different wavelength bands. Hyperspectral refers to a device having more than a handful of spectral components. There is currently no widely accepted definition of how many components constitute multi-spectral or hypermulti-spectral. I define a multimulti-spectral system as having between 4 and 10 spectral components, and a hyperspectral system as having 11 or more.

Hyper-spectral systems may be described as having colour, but they are usually designed for purposes of science, not vision: A set of pixel component values in a hyperspec-tral system usually has no close relationship to colour perception. Apart from highly specialized applications such as satellite imaging and the preservation or repro-duction of fine art, multispectral and hyperspectral techniques are not used in commercial imaging.

Luma and colour difference components

Some digital video equipment uses R’G’B’ components directly. However, human vision has considerably less ability to sense detail in colour information than in lightness. Provided achromatic detail is maintained, colour detail can be reduced by subsampling, which is a form of spatial filtering (or averaging).

A colour scientist might implement subsampling by forming relative luminance as a weighted sum of linear RGB tristimulus values, then imposing a nonlinear transfer function approximating CIE lightness (L*). In video, we depart from the theory of colour science, and implement an engineering approximation that I will describe in Constant luminance, on page 107. Briefly, component video systems convey image data as a luma component, Y’, approximating lightness and coding the achromatic component, and two colour difference components– in the historical analog domain, P_B and P_R, and in digital systems, C_B and C_R – that represent colour disregarding lightness. The colour difference components are subsampled (horizontally, or both hori-zontally and vertically) to reduce their data rate. Y’C_BC_R and Y’P_BP_R components are explained in Introduction to luma and chroma, on page 121.

Digital image representation

Many different file, memory, and stream formats are used to convey still digital images and motion sequences. Most formats have three components per pixel (representing additive red, green, and blue colour components). In consumer electronics and commodity computing, most formats have 8 bits per component. In professional applications such as studio video and digital cinema, 10, 12, or more bits per component are typically used.

Imaging systems are commonly optimized for other aspects of human perception; for example, the JPEG and MPEG compression systems exploit the spatial frequency characteristics of vision.

Such optimizations can also be referred to as perceptual coding.

Virtually all commercial imaging systems use percep-tual coding, whereby pixel values are disposed along a scale that approximates the capability of human vision to distinguish greyscale shades. In colour science, capital letter symbols R, G, and B are used to denote tristimulus values that are proportional to light power in three wavelength bands. Tristimulus values are not perceptually uniform. It is explicit or implicit in nearly all commercial digital imaging systems that pixel component values are coded as the desired display RGB tristimuli raised to a power between about ¹/_2.2 (that is, about 0.45) and ¹/_2.6 (that is, about 0.38). Pixel values so constructed are denoted with primes: R’G’B’ (though the primes are often omitted, causing confusion).

In order for image data to be exchanged and inter-preted reasonably faithfully, digital image standards define pixel values for reference black and reference white. Digital image standards typically specify a target luminance for reference white. Most digital image stan-dards offer no specific reflectance or relative luminance for reference black; it is implicit that the display system will make reference black as dark as possible.

It is a mistake to place a linear segment at the bottom of the sRGB EOCF. (A linear segment is called for in the OECF defined in sRGB, but that’s a different matter.)

In computing, 8-bit digital image data ranges from reference black at code 0 to reference white at code 255. The sRGB standard calls for an exponent

Dans le document Digital Video and HD (Page 44-60)