Pixel-Based Segmentation - Advances in Pattern Recognition

5.3 Segmentation

5.3.2 Pixel-Based Segmentation

Some algorithms work on digitized images directly at the pixel-level, ignoring any possible structure of pixel aggregates. These strategies aim at obtaining directly high-level components, without first identifying intermediate pixel aggregates that can play a role in the document (e.g., single characters or straight lines).

RLSA (Run Length Smoothing Algorithm) A classical and efficient segmen-tation technique is the Run Length Smoothing (sometimes called Smearing) Algo-rithm (RLSA) [55]. Given a sequence of black and white pixels, a run is defined as a sequence of adjacent pixels of the same kind, delimited by pixels of the opposite color. The run length is the number of pixels in a run, and ‘smoothing’ a run means changing the color of its pixels so that they become of the same color as the pix-els delimiting the run. RLSA identifies runs of white pixpix-els in the document image and fills them with black pixels whenever they are shorter than a given threshold. In particular, the RLSA works in four steps, each of which applies an operator:

Fig. 5.6 Application of the first three steps of RLSA to a sample document. Top-left:

Original paper; Top-right:

Horizontal smoothing;

Bottom-left: Vertical smoothing; Bottom-right:

Logical AND. Note the presence of a false block in the ANDed image

1. Horizontal smoothing, carried out by rows on the image with thresholdt_h; 2. Vertical smoothing, carried out by columns on the image with thresholdt_v; 3. Logical AND of the images obtained in steps 1 and 2 (the outcome has a black

pixel in positions where both input images have one, or a white pixel otherwise);

4. New horizontal smoothing with thresholdt_aon the image obtained in step 3, to fill white runs inside the discovered blocks.

A top-down strategy based on RLSA consists of repeatedly applying it, working at each round with progressively smaller thresholds on the zones identified as blocks in the previous round. Much work in the literature is based on RLSA, exploiting it or trying to improve its performance by modifying it [11,46,52].

The result of the first three steps of RLSA on a sample document is shown in Fig.5.6, where two shortcomings of the technique are evident. First, due to the presence of thin black lines produced on the right border of the image by scanning or photocopying, the horizontal smoothing covers most of the right margin of the page.

Although in many cases this is fixed by step 3, spurious blocks not corresponding to any actual content in the original document, as in this case in the first column to the right of the ‘Abstract’ heading, can be returned nevertheless. A further shortcoming of this technique lays in its inability to handle documents having non-Manhattan layout, as shown in Fig.5.7.

The assessment of suitable thresholds is a hot problem, directly affecting the overall effectiveness of the technique. Manually setting such thresholds is not triv-ial, both because it is not a typical activity that (even expert) humans are used to carry out on documents, and because there is no unique threshold that can equally fit all documents. Studies on RLSA suggest to set large thresholds, in order to keep the number of black pixels in the smoothed images sufficiently large to prevent the

AND operator from dropping again most of them (indeed, this is why an additional horizontal smoothing is provided for). The original paper usest_h=300,t_v=500 andt_a=30 for document images scanned at 240 dpi, based on the assumption that t_handt_vshould be set to the length in pixels of long words, whilet_ashould be so wide as to cover a few characters. However, the generic claim that large thresholds are better is not so helpful in practice. Hence, a strong motivation for the develop-ment of methods that can automatically assess proper values for thet_h,t_v andt_a parameters, possibly based on the specific document at hand; e.g., [41] proposes to set

t_h=2·mcl, t_v=mtld, where

• mcl∈ [M·3.24,M·6.3]is the mean character length of the document, com-puted according to the maximumM=maxiH_h^b(i)of the histogram of horizontal black runsH_h^b(i), with suitable tolerance;

• mtld=arg maxi∈[0.8·mcl,80]H_v^w(i)is the mean text line distance of the docu-ment, located as the position of the global maximum of the vertical white runs histogramH_v^w(i)in the specified range;

using a mix of statistical considerations on the histograms of horizontal/vertical black/white run lengths and empirical definitions of multiplicative constants.

RLSO (Run-Length Smoothing with OR) RLSO [23] is a variant of the RLSA that performs:

1. Horizontal smoothing of the image, carried out by rows with thresholdt_h; 2. Vertical smoothing of the image, carried out by columns with thresholdtv; 3. Logical OR of the images obtained in steps 1 and 2 (the outcome has a black

pixel in positions where at least one of the input images has one, or a white pixel otherwise).

Each connected component in the resulting image is considered a frame, and ex-ploited as a mask to filter the original image through a logical AND operation in order to obtain the frame content. The result on a sample document having non-Manhattan layout, along with the corresponding frames extracted by ANDing the original image with the smoothed one, is shown in Fig.5.7.

Compared to RLSA, RLSO has one step less (no final horizontal smoothing is performed, since the OR operation, unlike the AND, preserves everything from the two smoothed images), and requires shorter thresholds (and hence fills fewer runs) to merge original connected components (e.g., characters) into larger ones (e.g., frames). Thus, it is more efficient than RLSA, and can be further sped up by avoiding the third step and applying vertical smoothing directly on the horizontally smoothed image obtained from the first step. This does not significantly affect, and may even improve, the quality of the result (e.g., adjacent rows having inter-word spacings vertically aligned would not be captured by the original version). However, the OR causes every merge of components to be irreversible, which can be a problem when

Fig. 5.7 Application of RLSA (on the left) and of RLSO (on the right) to the non-Manhattan digitized document in the middle. Below, the three text frames identified by RLSO

Fig. 5.8 Iterated RLSO on a document with different spacings. Each step shows the segmentation outcome and the horizontal and vertical cumulative histograms (scaled down to 10% of the original values) according to which the thresholds are automatically assessed

logically different components are very close to each other and might be erroneously merged if the threshold is too high. Conversely, too low thresholds might result in an excessively fragmented layout. Thus, as for RLSA, the choice of proper horizon-tal/vertical thresholds is a very important issue for effectiveness.

A technique to automatically assess document-specific thresholds on pages using a single font size was proposed in [22] and is illustrated in Fig.5.8. It is based on the distribution of white run lengths in the image, represented by cumulative histograms where each bar reports the number of runs having length larger or equal than the

Dans le document Advances in Pattern Recognition (Page 197-0)