Dewarping - Pre-processing for Digitized Documents

5.2 Pre-processing for Digitized Documents

5.2.2 Dewarping

Dewarping is the term usually exploited to denote a recent area of interest for digi-tized document pre-processing, aimed at compensating the noise introduced by the flat (2D) representation of spatially spread (3D) documents. In fact, when scanning

an open book the pages often cannot be completely flattened on the scanner bed, and in the spine region the page shape appears deformed so that document compo-nents, e.g., text lines, become curved rather than straight. The problem is even worse if documents are photographed rather than scanned because this effect is amplified and additional perspective distortions are introduced. In fact, this latter option is be-ing given more and more attention, due both to the availability of cheap and compact high-resolution cameras and to its causing a reduced stress on the document under digitization (an extremely important requirement in the case of historical and fragile items).

Dewarping is particularly crucial for ensuring a sufficiently good outcome from application of OCR to the digitized document. Indeed, although an option could be developing OCR systems that can directly deal with the problem and fix it, a more straightforward approach consists in specific pre-processing of the page to remove this kind of noise before applying standard OCR technologies. The approaches pro-posed so far as potential solutions to restore the original flat version of a warped document have focused either on the hardware side (by exploiting dedicated acqui-sition devices that can better capture the 3D nature of the digitized object) or on the software one (by working on a standard picture of the item obtained using a traditional means). The latter option is cheaper and ensures wider applicability, for which reason it is being given much attention. In particular, two kinds of approaches exist for identifying the deformations to be corrected, one based on the geometrical features of the overall document image, and one based on the distortions detected on specific document components, such as text lines (a quite significant and straight-forward indicator of the problem). Techniques attempted to carry out the image dewarping task include splines, snakes, linear regression, grid modeling and even fuzzy sets.

Segmentation-Based Dewarping A dewarping technique that exploits segmen-tation of the document image and linear regression was proposed in [25]. Starting from a black&white image, it consists of two macro-steps whose algorithms are sketched below. In the following, elements having the same superscript, subscript or bars are to be interpreted as associated to the same object. The former step pre-liminarily identifies words and text lines:

1. Find the most frequent heighthamong the connected components of black pixels in the page (since most of them will correspond to single characters, it is likely to represent the average character height);

2. Remove the non-text and noise components (in [25], all components having height or width less thanh/4, or height greater than 3h);

3. Apply horizontal smoothing (see Sect.5.3.2) with thresholdh(the resulting con-nected components should correspond to words);

4. Consider the setP of smoothed connected components (words) in the page (as-sume a wordwhas bounding box coordinates(x_l, y_t, x_r, y_b), such thatx_l< x_r andy_t< y_b) and set a thresholdT (=5hin [25]) for linking adjacent words in a line;

5. whileP = ∅:

(a) Initialize a new lineL← {w}, wherewis the uppermost word in the page;

(b) Set the current pivot wordw^p←w;

(d) while∃w∈Owhose distance from the right bound ofw^pisx_l−x_r^p< T: (i) L←L∪ {w^r}, wherew^r is the closest such word;

(ii) w^p←w^r; (e) w^p←w;

(f) while∃w∈Owhose distance from the left bound ofw^pisx_l^p−x_r< T: (i) L←L∪ {w^r}, wherew^r is the closest such word;

(ii) w^p←w^r; (g) P ←P\L;

(h) L←L\ {w∈L|x_r −x_l<2h} (too small a word width is insufficient to provide a significant estimate of the word’s actual slope);

(i) Merge the two leftmost words inLif the first one has width less thanT. The latter step estimates the slope of single words, where the deformation is likely to be quite limited (a simple rotation), and based on that restores the straight page image by translating each pixel of each word therein to its correct position:

1. Initialize an empty (white) upright page imageI

2. for each smoothed connected component (word) w, determine its upper and lower baselines y=ax+b and y_⊥=a_⊥x+b_⊥, respectively, with cor-responding slopes θ=arctana andθ_⊥=arctana_⊥, by linear regression of the set of top and bottom pixels in each of its columns

3. for each line,

(a) Assign to its leftmost wordwthe slopeθ=min(|θ|,|θ_⊥|)(supposed to be the most representative), and to each of the remaining wordsw^cthe slopeθ^c orθ_⊥^c, whichever is closer toθ

(b) Straighten each wordw^cin turn according towand to the immediately pre-vious wordw^pby moving each of its pixels(x, y)toI (x, r+d), such that

r=(x−xl)·sin(−θ )+y·cosθ applies rotation, and

d=y_∗−y_∗^c applies a vertical translation that restores the lines straight, where

y_∗=

(axl+b)·cosθ if|θ−θ^p|<|θ_⊥−θ^p|, (a_⊥x_l+b_⊥)·cosθ otherwise.

That is, each word is rotated according to the closest slope to that of the previous word and moved so that it becomes aligned to either the upper or the lower baseline of the first word in the line.

If the original imageI is not in black&white, 1. Compute a binarizationBofI

2. M← ∅(initialization of a model to be used for dewarping)

3. for each black pixelB(x, y)inB,M←M∪ {(x, y), (d, θ, x_l)},

where (x, y) are the straightened coordinates computed as before for pixel (x, y), and(d, θ, x_l)are the parameters used for such a roto-translation, asso-ciated to the word to whichB(x, y)belongs

4. for each pixelI (x, y)of the dewarped page,I (x, y)=I (x,^y⁻^d⁻^(x⁻^x^l^)sin(⁻^{θ )}

cos(θ ) )

where(x, y), (d, θ , xl) =arg min_(x,y),(d,θ,x_l)∈M(|x−x| +2· |y−y|).

That is, an inverse roto-translation is computed to find in the original image the value to be assigned to each dewarped pixel, according to the parameters associated to its closest black pixel inB.

Dans le document Advances in Pattern Recognition (Page 185-188)