Content-Based Retrieval - New Image Retrieval Principle: Image Mining and Visual OntologyMining

Part III Multimedia Data Indexing

9. New Image Retrieval Principle: Image Mining and Visual OntologyMining and Visual Ontology

9.2 Content-Based Retrieval

This section relates the current state of the art in the image retrieval field. The content-based image retrieval requires principally two modules: the logical indexation and the retrieval. The first one extracts the metadata (textual descriptors, visual features) associated to images and stores the extracted metadata in the database. The second one assists final users, even those unfamiliar with the database field, to retrieve efficiently images based on their visual and/or textual descriptions. Figure 9.1 illustrates the classic image media retrieval and access system architecture.

In this context, the scope of queries addressed to multimedia databases is very large and may include objective and subjective content. Three levels of abstraction are distinguished [6]: (1) syntactic level: visual features, spatial localization, and spatial constraints are used to retrieve images. This level is purely algorithmic. For instance, a query may be “Give me all the images containing red circles”; (2) semantic level including only objective content. In this level, objects appearing in images have to be identified. A query example may be “Give me all the images in which my children appear”; (3) semantic level with subjective content. This level involves complex reasoning about objects or scenes using human perception. Such a query may be “Give me all the images representing the notion of liberty” or “Give me all the images in which my friends are happy.” This level is also related to scene recognition as for an example, a child birthday: this scene can be characterized by balloons, young faces, candles, cakes, etc.

Both the logical indexation and query modules are briefly presented in the following subsections. (object relational database management system) Images data preprocessing

Fig. 9.1. The classic image media retrieval and access system architecture.

9.2.1 Logical Indexation Process

In the image retrieval system domain, the retrieval may be performed from tex-tual descriptions (key words, annotations, etc.) and from visual descriptions (color, shape, texture, spatial localization, spatial constraints). Consequently, these features must be modelled in a well-suited representation to satisfy the need of efficient and relevant retrieval of image media in both heterogeneous (image database with no specific application domain) and homogeneous (faces, finger printings databases) voluminous databases. Moreover, the feature computing is not made during each retrieval process because it is time-consuming task. They are previously repre-sented in an appropriate form (logical indexation) and then stored in the database [7].

On the one hand, the textual descriptors may be divided into content metadata like key words and database modelling. In the systems based on key word search, one or more key words are associated to an image. The system then selects the images related to the key word searched. This process is strongly limited because semantic relationships that exist between key words are not taken into account, like for example, the fact that a horse is an animal. If the query key word is “horse”

then the horse’s images annotated by the key word “animal” will not be returned.

A significant improvement consists in using an ontology to capture the semantic relationships between objects. This point will be discussed in Section 9.3.1.

From a database point of view, many propositions have been developed to model and query image databases. Most systems propose a manual description of image semantics. We can note the DISIMA model [8] on the basis of an object-oriented database and the ICDM model [9, 10] on the basis of a relational object database. In the first prototype, a class hierarchy is defined by the user. An image is composed by an OID, a set of physical representations (raster or vector), a content (spatial relation-ships, salient objects). The link between an object that is detected in an image and an object of the salient objects class is done manually. In the second prototype, the data model can be divided into four abstraction levels: (1) the image level containing the global properties of an image; (2) the syntactic level capturing local characteristics;

syntactic objects are created in this level; (3) the content level in which syntactic ob-jects are grouped; and (4) the semantic level in which semantic hierarchies are defined.

The major interest of this model relies to this semantic level that uses relationships like synonymy, hyponymy, etc.

Semistructured models are also well-suited to model image databases. MPEG 7 (Multimedia Content Description Interface) [11] is a standard dedicated to multimedia content description. This description is written using XML and corresponds to the semistructured approach.

On the second hand, the visual descriptor extracting of an image is a major step in content-based image retrieval since these visual features model the visual image content. Moreover, an interesting representation of these descriptors must be compact, reliable, and accurate. Without a real synergy between digital image processing, signal processing, and mathematics, such an objective cannot be reached. One or several representations are associated to each extracted feature.

The most used descriptor for content-based access today is the color feature.

Well-known color representations are based on color histograms and their extensions such as weighted color histograms [12], statistical moments [13], the dominant color, the color sets [14]. Colour is considered in a color space like RGB or HSV [15]. A well-suited similarity distance is necessary to each feature representation. To estimate the similarity between two color representations, several distances may be referred like the L1 metric [12], the quadratic distance [16], the Euclidean distance [16], etc.

Each distance has its advantages and its limits. For example, the quadratic distance appreciates correctly the color similarity, but it is time-consuming.

While texture is usually represented by a grey-level histogram, statistical mo-ments, and/or co-occurrence matrix, and more recently by Fourier transform, wavelets, Gabor modelling [17–20] the shape representations are classified into two categories:

boundary-based (perimeter, curvature, Freeman’s code, Fourier descriptors, etc.) and region-based (area, compactness, bounding box, moments, etc.) [15, 17, 18, 21, 22].

As far as texture is concerned, its modelling is hard since it is expressed by qualitative criteria like contrast, directionality, coarseness, etc. As far as shape is concerned, it represents significant regions or relevant objects in images. Ideally, shape segmen-tation would be automatic and efficient, but it is either impossible or difficult with heterogeneous images. Segmentation algorithms generally compute regions sharing several properties. Nevertheless, each calculated region does not correspond to a rel-evant entire object. Since obtaining relrel-evant objects, suited representations must be chosen.

Others usual important features are considered in content-based retrieval systems like spatial localization (for a semiautomatically or automatically extracted object) and spatial constraints between two regions called spatial relationships. Spatial constraints are based on extended Allen’s relationships [23]. All the previous visual feature representations correspond to psych-visual human criteria. However, some CBIR systems do not use such criteria; they use only one or several entire image signatures.

By example, while in [19] these features are calculated by means of Fourier transform, wavelet transform, etc., in [24], the extracted features contain a lot of photometric information.

To conclude this visual descriptor paragraph, it may be noticed: for each visual feature, a lot of representations have been proposed in the last decade. The representa-tion choosing to hold is very sensitive and closely connected to the considered domain application and the wanted aims. Nevertheless, these properties are often stored into a numeric vector called a visual descriptor that summarizes the photometric image information. This vector has a high dimensionnality that raises physical indexing problems in databases since these extracted features are excellent candidates to the physical indexation. Despite their importance, these aspects will not be presented in this chapter. For more details, [23, 25, 26] may be referred.

9.2.2 Retrieval Process

The visual retrieval systems usually include a visual query tool that lets users to form a query by painting, sketching, and selecting colors, shapes, textures, etc. or

by selecting an or several interesting image(s) from the database. So, through a user-friendly and intuitive interface, the final users formulate their queries from both textual descriptions and visual descriptions (from features previously extracted and stored in the database during the logical indexation process). These two kinds of metadata are necessary to improve the retrieval effectiveness since the use of text or visual information alone is not enough to fully describe the semantic content of images (each form has limited expressive power on its own). For instance, colors and shapes are good for capturing visual aspects of a region and they are not very accurate to represent level concept, while textual annotations are good at describing high-level concept and they are weak to capture visual content. These features tend also to be incomplete and not effective. Moreover, the content-based image retrieval systems must be flexible since images may belong to different fields.

Then, the retrieval process by means of a suited distance for each feature com-putes the similarity between the user’s query and the image database. The best similar images according to their similarity value are then displayed by decreasing similarity.

Because some of the features extracted from images may be imprecise, the retrieval process generally involves probabilistic querying and relevance feedback. It means that the user may get ranked results, if he is not satisfied with the result images of his query, he may refine the query and resubmit it. To refine his query, the user chooses from the returned image set, positive and negative examples. New results are obtained using this information and this process may be iterated. The reader can refer to [27]

for a complete review of visual information retrieval.

From a database point of view, queries are expressed using extensions of OQL.

New predicates likecontainsandsimilarare introduced. These queries can be quali-fied as exact queries. Fuzzy predicates can be introduced [28] to allow the specification of imprecise queries.

However, textual and visual feature combinations are sometimes not sufficient, particularly when semantic querying is predominant, that is, when the image and its context are necessary (as for instance the retrieval of audiovisual sequences about unemployment, the retrieval of images in which the sadness feeling is heavy). This limit is known as the semantic gap between the visual appearance of an image and the idea the user has in his mind of what images he wants to retrieve, including semantics.

So, the content-based retrieval suffers from a lack of expressive power because they do not integrate enough semantics. This problem prevents end user from making good explorations and exploitations of the image database. This is the reason why many research works are recently performed on image semantics, which is the scope of the following section.

9.3 Ontology and Data Mining Against Semantics

Dans le document Multimedia Data Mining and Knowledge Discovery (Page 190-193)