IMAGE EXTRACTION - MULTIMEDIA INFORMATION EXTRACTION: HISTORY AND

MULTIMEDIA INFORMATION EXTRACTION: HISTORY AND

2.4 IMAGE EXTRACTION

Image extraction has received increased attention given vast collections of indus-trial, personal, and web images (e.g., Flickr ) inspiring the detection and classiﬁ cation of objects, people, and events from images. Early image processing was motivated by challenges, such as character recognition, face recognition, and robot guidance.

Signiﬁ cant interest has focused on imagery retrieval. For example, early applications, such as IBM Research Almaden ’ s Query by Image Content ( QBIC ), analyzed color, shape, and texture feature similarity and allowed users to query by example or by drawing, selecting, or other graphical means in a graphical query language (Flickner et al. 1995 ; Flickner 1997 ). Early feature - based methods found their way into Inter-net search engines (e.g., Webseek, Webseer) and later into databases (e.g., Informix, IBM DB2, and Oracle). Feature - based methods proved practical for such tasks as searching trademark databases, blocking pornographic content, and medical image retrieval. Researchers sought improved methods that were translation, rotation, and scale invariant, as well as for ways to overcome occlusion and lighting variations.

Of course, storage and processing efﬁ ciency were desirable.

While feature - based image retrieval provided a great leap beyond text retrieval, very soon, researchers recognized the need to bridge the semantic gap from low level feature recognition (e.g., color, shape, and texture) to high - level semantic representations (e.g., queries or descriptions about people, locations, and events).

Along the way, a number of researchers explored mathematical properties that refl ected visual phenomena (e.g., fractals capture visual roughness, graininess refl ects coarseness, and entropy refl ects visual disorder). Recently, the Large Scale Concept Ontology for Multimedia ( http://www.lscom.org ) was created to provide common terms, properties, and taxonomy to use for manual annotation and automated clas-sifi cation of visual material. Common ontologies also enable the possibility of using semantic relations between concepts for search. Originally designed for news videos, LSCOM needs to be extended to new genres, such as home video, surveillance, and movies.

Another issue that arose early was the need (and desire) to process cross media, for example, the use by web image search engines (e.g., Google and Yahoo!) of the text surrounding an image to index searches or the subsequent use by sites, such as YouTube and Flickr , leverage user - generated tags to support search. Researchers have also used the digital camera manufacturer - adopted exchangeable image format (exif.org) standard to help process images that includes metadata such as the camera model and make, key parameters for each photo (e.g., orientation, aperture, shut-terspeed, focal length, metering mode, and ISO speed), time and place of the photo, a thumbnail, and any human tags or copyright information. In addition to exploiting related streams of data and metadata, other researchers considered user interactions and relevance feedback as additional sources of information to improve performance.

Scientifi c research requires access to realistic and accessible data along with ground truth. Researchers (e.g., Muller et al. 2002 ) found that working with artifi cial data sets, such as the Corel Photo CD, images could actually do more harm than good by misleading research because they don ’ t represent realistic tasks (they are all in a narrow, unrealistically easy domain leading to overgeneralization) and lack a query set and associated relevancy judgments. The SPIE Benchathlon ( www.ben-chathlon.net ) was an early (2001) benchmarking effort associated with the SPIE Electronic Imaging conference. It included specifying common tasks (e.g., query by example, sketch), a publically available annotated data set, and software. More recently, the multimedia image retrieval Flickr collection (Huiskes and Lew 2008 ) consists of 25,000 images and image tags (with an average of about nine per image) that are realistic, redistributable (under the Creative Commons license), and contain relevance judgments (press.liacs.nl/mirfl ickr) related to visual concept/topic and subtopic classifi cation (e.g., animal [cat, dog], plant [tree, fl ower], water [sea, lake, river]), and tag propagation tasks (press.liacs.nl/mirfl ickr). The 2011 ImageCLEF visual concept detection and annotation task used one million Flickr images which are under the Creative Commons license.

Deselaers et al. (2008) quantitatively compared the performance of a broad range of features on fi ve publically available, mostly thousand image data sets in four distinct domains (stock photos, personal photos, building images, and medical images). They found color histograms, local feature SIFT (Scale Invariant Feature Transform) global search, local feature patches histogram, local feature SIFT histo-gram, and invariant feature histogram methods performed the best across the fi ve data sets with average error rates less than 30% and mean average precisions of over 50%. Local features capture image patches or small subimages of images and are promising for extraction tasks (e.g., face and objects), although more computa-tionally expensive. Notably, color histograms were by far the most time effi cient in terms of feature extraction and retrieval. The processing and space effi cient

IMAGE EXTRACTION 23

Motion Pictures Expert Group ( MPEG ) - 7 scalable color also had excellent overall performance. For texture, the authors found that a combination of features usually improved results.

2.4.1 ImageCLEF and Image CLEFmed

The ﬁ rst broad community evaluation of progress on common, realistic data sets arose out of the Text Retrieval Conference (TREC). As illustrated initially in Figure 2.1 , starting from a track in TREC, the Cross Language Evaluation Forum ( CLEF ) initiated a multilingual image retrieval track in 2003 (ImageCLEF, http://

www.imageclef.org ) and then a medical image retrieval track in 2004 (ImageCLE-Fmed). ImageCLEF contains a number of tasks, such as visual concept detection in photos, medical images, photo retrieval, and robot vision. In 2009, a record 85 research groups registered for the seven sub tasks of ImageCLEF. For example, for the photo annotation task, 5000 annotated images from the 25,000 Multimedia Image Retrieval Flickr image collection are used to train image classiﬁ ers for 53 “ concepts, ” such abstract category (e.g., landscape and party), season, time of day, person or group, image quality, and so on. These visual concepts are organized into a small ontology, including hierarchy and relations. Systems are then tested on 13,000 images.

For the visual concept detection task, 19 groups submitted a total of 73 runs in 2009 and the best equal error rate ( EER ), in which false rejects equal false positives, was as low as 0.23. The organizers also created a hierarchical measure that considered the relations between concepts and the agreement of annotators on concepts and reporting a best area under curve ( AUC ) score of 0.84 (where 1 is perfection).

The 2009 photo retrieval task of ImageCLEF, in contrast, uses 50 topics (e.g., Olympic games, Hillary Clinton, beach soccer, stock exchange, Bulgarian churches, Brussels airport, ﬂ ood, and demonstrations) based on an analysis of 2008 Belga News Agency query logs to reﬂ ect more realistic tasks. The job of systems is to provide a rank ordered list of photo IDs most relevant to the query. Evaluation aims to maximize precision and recall. Retrieval is evaluated on almost a half - million images from the Belga News Agency. Diversity of retrieval is important so evalua-tion includes how many relevant images that are representative of the different subtopics within the results are included in the top 20 hits returned. Eighty - four runs were submitted from 19 different groups. The top system achieved an 81%

F - score (harmonic mean of precision and recall), although interestingly, the best image - only system achieved only a 21% F - score.

For the fi rst time in 2009, ImageCLEF hosted the Robot Vision task. Given indi-vidual pictures or a sequence of pictures, systems must report the room location of an image from a fi ve - room subsection of an offi ce environment (e.g., kitchen, cor-ridor, printer area, and one - or two - person offi ce) taken under three different illu-mination settings (night, sunny, cloudy) over a time frame of 6 months. Nineteen groups registered for the Robot Vision task, and seven submitted at least one run for a total of 27 runs. A point is given for each correctly annotated frame and a half - point is deducted for misannotation from about a thousand training and test images over the three lighting conditions. The highest scores had approximately 70% accuracy.

Each year, about 12 – 15 groups participate in ImageCLEFmed. In 2007, while 31 groups from 25 countries registered, in the end, 13 groups submitted 149 runs (see ir.ohsu.edu/image) using a consolidated test collection consisting of 66,662 images

(5.15 GB) from 47,680 cases and 85 topics in English, French, and German. Images, typically with associated clinical case descriptions as annotations (a total of 55,485 of them), come from a broad range of medical collections in radiology, nuclear medicine, pathology, and endoscopy. Interestingly, topics are visual, textual, or both, for example, referring to an imaging modality (e.g., photograph, computed tomog-raphy, magnetic resonance imaging, and x - ray), anatomical location, view, and/or disease or ﬁ nding. As in TREC, test collections are built on realistic samples of tasks that serve as topics that are submitted to systems as queries to retrieve images.

Human - created relevance judgments, which indicate which documents in the col-lection are relevant to which topics (about 800 – 1200 per topic), are used to measure recall and precision, although the mean of average precision ( MAP ) across all topics is the most frequently used aggregate performance measure.

One observation about the current state of the art from ImageCLEFmed is that text processing retrieval methods (based on the image annotations) fare better than image processing methods; however, combined methods do better still. Groups apply a range of text and image processing methods, such as single or multilingual term frequency/inverse document frequency ( TF/IDF ), bag of words, and query expansion for text and color, shape, and texture processing. Results from the open source GIFT (GNU Image Finding Tool) and FIRE ( Flexible Image Retrieval Engine ) were made available to participants that did not have their own visual system retrieval engine. The results of methods on textual topics achieved as high as about 40% MAP, whereas the best methods on visual topics fared about 24%, suggesting the value of both better language processing as well as the importance of visual processing. In an unrelated study, when users were asked to browse images organized by visual similarity or text caption similarity, in 40 of 54 searches to ﬁ nd pictures to illustrate a travel website, users preferred the text caption view (Rodden 2001 ).

A second ImageCLEFmed task focused on automatic annotation of medical images into 120 classes. Systems must identify the body orientation, body region, and biological system captured in a particular image, as well as what type of image it is (e.g., an x - ray of the front of a cranium showing the musculoskeletal system).

About 10 groups from 29 who registered participated submitting a total of 68 runs.

A broad range of image processing methods were applied. One observation was that methods using local as opposed to global image descriptions performed better, and the more training data available, the more likely images were to be classiﬁ ed correctly. Error rates ranged from a high of 86.8 to a low of 10.3.

Extending the TREC evaluations, the ImagEVAL workshop focuses on user centered evaluation of CBIR, including measures, such as the quality of user - interface, response time, and adaptability to a new domain. Four tasks included object (e.g., tree, cow, glasses) and attribute (e.g., indoor/outdoor, night/day, natural/urban) detection with mean average precision used for basic performance assessment.

2.4.2 Object Detection

In the past few years, the computer vision and image understanding community have developed a number of standard data sets for evaluating the performance of general object detection and recognition algorithms. For example, the PASCAL VOC ( Visual Object Classes ) (Everingham et al. 2009 , 2010 ) ( http://www.pascal

-IMAGE EXTRACTION 25

network.org/challenges/VOC ) evaluation challenge has grown from the detection of four object classes (motorcycle, bicycle, people, and car) in a few hundred images and few thousand objects in 2005 to a 2009 competition with over 30 thousand images evaluated against data sets for 20 object classes, including person, animal (bird, cat, cow, dog, horse, and sheep), vehicle (aeroplane, bicycle, boat, bus, car, motorbike, and train), and indoor object (bottle, chair, dining table, potted plant, sofa, and TV/monitor). Three main tasks were classiﬁ cation (predicting presence/

absence of an object class), detection (bounding box and label of each object), and segmentation (pixel - wise segmentation of objects and “ background ” ). A single smaller scale “ taster ” competition looked at person layout, that is predicting the bounding box and label of each part of a person (head, hands, and feet).

In 2005, there were 12 participants and two major test sets, an “ easier ” challenge from the PASCAL image databases and a second “ harder ” one from freshly col-lected Google Images. Performance measures included standard receiver operating characteristic ( ROC ) measures of equal error rate (EER) and AUC. A variety of methods were applied (e.g., interest points, region features), and the most successful for classifi cation used interest points plus SIFT plus clustering (histogram) plus SVMs. The mean EER of “ best ” results across all classes (motorcycle, bicycle, people, and car) was more or less uniform across the classes achieving 0.946 on the easy test and 0.741 on the more diffi cult one. For the detection task, 50% overlap in bounding boxes was considered a success with multiple detections considered as (one true+ ) false positive with average precision ( AP ) as defi ned by TREC (mean precision interpolated at various recall levels). The mean AP of “ best ” results across classes was 0.408 on the easy test and 0.195 on the hard test, with signifi cant variance across the classes. Detection for people was the poorest, for bicycles was around 0.1, for cars about 0.6 and 0.3 on the easy and diffi cult data, and 0.9 and just over 0.3 for motorbikes. One entry used a group ’ s own training data and raised AP to 0.4.

In summary, there was more encouraging performance on cars and motorbikes than people and bicycles.

By 2009, the competition attracted 12 groups who tested 18 methods representing a variety of approaches (e.g., sliding window, combination with whole - image classi-ﬁ ers, segmentation - based). 17,218 objects were annotated in 7,054 images randomly selected from 500,000 downloaded images from Flickr . Separately, about 15,829 objects in 6,650 images were annotated for a test set. Objects were annotated with bounding boxes, as well as degree of occlusion, truncation, and pose (e.g., facing left). A 50% area of overlap ( AO ) between the detected and ground truth bounding box was a correct detection. Figure 2.5 shows the average precision results for various object classes. Unlike the 2005 case, use of external training data based on 3D annotations for “ person ” detection showed only modest improvement (43.2 vs.

41.5% AP) over methods using VOC training data. Also, the increased number of object classes in the 2009 data set not only increased the annotator ’ s cognitive load, making it difﬁ cult to maintain quality, but it was also expensive, costing around 700 person - hours.

2.4.3 Face Recognition

In addition to general image retrieval, face recognition is an important specia-lized task. While early methods searched for whole faces, subsequent approaches

integrated separate detectors and extractors for eyes and noses, as well as new feature similarity measures based on color, texture, and shape. And while early systems performed very highly in detecting frontal face views, nonfrontal views, low quality images, and occluded faces remain challenging. More powerful methods, such as relevance feedback and the use of machine learning methods (e.g., hidden Markov models and support vector machines), improved retrieval performance.

The National Institute of Standards and Technology ( NIST ) - managed Face Recognition Grand Challenge ( FRGC ) pursued the development and indepen-dent evaluations of face recognition technologies, including high - resolution still image, three - dimensional face scans, and multiple - sample still imagery ( http://

www.nist.gov/itl/iad/ig/frvt-home.cfm ). Face recognition evaluations started with three FERET evaluations (1994, 1995, and 1996), followed by Face Recognition Vendor Tests ( FRVT ) 2000, 2002, and 2006. By 2004, the FRGC had 42 participants, including 13 universities.

By providing 50,000 recordings and including not only still images but high resolution (5 – 6 megapixel) still images, 3D images, and multi - images of a person, the evaluation helped advance metrology and the state of the art in face recognition.

For example, the FRGC high - resolution images consist of facial images with 250 pixels between the centers of the eyes on average (in contrast to the 40 – 60 pixels in current images). Also, three - dimensional recognition ensures robustness to varia-tions of lighting (illumination) and pose. Finally, this was the ﬁ rst time that a computational– experimental environment was used to support a challenge problem in face recognition or biometrics through the use of an XML - based framework for describing and documenting computational experiments, called the Biometric Experimentation Environment ( BEE ). Experimental data for validation included Figure 2.5. Average precision (AP) for visual object classes (VOC). ( Source : Everingham et al. 2009 ) .

50 45 40 35 30 25

AP (%) aeroplane bicycle train bus motorbike person horse tv/monitor car cat bottle sofa dog sheep diningtable cow bird boat chair potted plant

Max Median 15

10 5 0

IMAGE EXTRACTION 27

images from 4003 subject sessions exhibiting two expressions (smiling and neutral) under controlled illumination conditions. An order of magnitude performance was obtained in 2006 over 2002, one of the goals of the FRGC. For example, the false rejection rates at 0.001 (1 in 1000) false acceptance rates dropped from nearly 79%

in 1993 to 1% in FVRT 2006.

The related Iris Challenge Evaluation (iris.nist.gov/ICE) in 2006 reported iris recognition performance from left and right iris images. In this independent, large scale performance evaluation of three algorithms on 59,558 samples from 240 sub-jects, NIST observed false nonmatch rate s ( FNMR s) from 0.0122 to 0.038, at a false match rate ( FMR ) of 0.001. In a comparison of recognition performance from very high - resolution still face images, 3D face images, and single - iris images (Phillips et al.

2007 ), recognition performance was comparable for all three biometrics. Notably, the best - performing face recognition algorithms were found to be more accurate than humans.

2.4.4 Graphics Extraction

Just as facial recognition and extraction emerged as an important, distinct task from general image processing, so too extraction from data graphics (e.g., tables, charts, maps, and networks) has been identiﬁ ed as a special case. An early effort, SageBook (Chuah et al. 1997 ) enables search for and customization of stored data graphics (e.g., charts, tables). Nonexpert users can pose graphical queries, browse results, and adapt and reuse previously successful graphics for data analysis. Users formulate queries with a graphical direct manipulation interface (SageBrush) by selecting and arranging spaces (e.g., charts and tables), objects contained within those spaces (e.g., marks and bars), and object properties (e.g., color, size, shape, and position). Sage-Book represents the syntax and semantics of data graphics, including spatial rela-tionships between objects, relarela-tionships between data domains (e.g., interval and 2D coordinate), and the various graphic and data attributes. As in document and imagery retrieval, representation and reasoning about both data and graphical properties enables similarity matching as well as clustering to support search and browsing of large collections. Finally, automated adaptation methods support cus-tomization of the retrieved graphic (e.g., eliminating graphical elements that do not match the speciﬁ ed query).

An illustration of the practical utility of content - based graphics retrieval, the SlideFinder tool (Niblack 1999 ) used the IBM QBIC system described above to perform image similarity together with text matching to enable a user to index and browse Microsoft PowerPoint and Lotus Freelance Graphics. However, in spite of the exciting possibilities, no community - based evaluations have been conducted so common tasks, shared data sets, and practical performance bench-marks do not exist. The Chapter 15 contribution by Carberry et al. in this colle-ction explores the automated extracolle-ction of information graphics in multimodal documents.

In conclusion, there are important data, algorithmic, and method gaps for a number of challenges, including image query context, results presentation, and rep-resentation and reasoning about visual content. While methods such as SIFT (Scale Invariant Feature Transform) features have become widespread, new methods are needed to advance beyond the current state of the art.

Dans le document MULTIMEDIA INFORMATION EXTRACTION (Page 39-46)