Medical Content Based Image Retrieval by Using the HADOOP Framework
Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane, Abderrahim Sekkaki
Abstract— Most medical images are now digitized and stored in large image databases. Retrieving the desired images becomes a challenge. In this paper, we address the challenge of content based image retrieval system by applying the MapReduce distributed computing model and the HDFS storage model. Two methods are used to characterize the content of images: the first is called the BEMD-GGD method (Bidimensional Empirical Mode Decomposition with Generalized Gaussian density func- tions) and the second is called the BEMD-HHT method (Bidi- mensional Empirical Mode Decomposition with Huang-Hilbert Transform HHT). To measure similarity between images we compute the distance between signatures of images, for that we use the Kullback-Leibler Divergence (KLD) to compare the BEMD-GGD signatures and the Euclidean distance to compare the HHT signatures. Through the experiments on the DDSM mammography image database, we confirm that the results are promising and this work has allowed us to verify the feasibility and efficiency of applying the CBIR in the large medical image databases.
I. INTRODUCTION
Nowadays, medical imaging systems produce more and more digitized images in all medical fields. Most of these images are stored in image databases. There is a great interest to use them for diagnostic and clinical decision such as case-based reasoning [1]. The purpose is to retrieve desired images from a large image databases using only the numerical content of images. CBIR system (Content-Based Image Retrieval) is one of the possible solutions to effectively manage image databases [2].
Furthermore, fast access to such a huge database requires an efficient computing model. The Hadoop framework is one of the findings based on MapReduce [3] distributed computing model. Lately, the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Google, Amazon, and Facebook are the biggest users of the MapReduce programming model and it’s been recently adopted by several universities. It allows distributed processing of data intensive computing over many machines.
In CBIR systems, requests (the system inputs) are images and answers (outputs/results) are all the similar images in the database. A typical CBIR system can be decomposed in three steps: firstly, the characteristic features for each image in the database are extracted and are used to index images; secondly, the features vector of a query image is computed; and thirdly, the features vector of the query
Said Jai-Andaloussi, Nabil Madrane and Abderrahim Sekkaki are with LIAD Lab, Casablanca, Kingdom of Morocco.;
said.jaiandaloussi@etude.univcasa.ma
Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane and Abderrahim Sekkaki are with Faculty of science Ain-chok, Casablanca, Kingdom of Morocco.
image is compared to those of each image in the database.
For the definition and extraction of image characteristic features, many methods have been proposed, including image segmentation and image characterization using wavelet transform and Gabor filter bank [4, 5]. In this work, we used MapReduce computing model to extract features of images by applying the BEMD-GGD and BEMD-HHT [2], then we write the features files into HBase [6] (HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable), the Kullback-Leibler divergence (KLD) and Euclidean distance are used to compute the similarity between features of images.
The setup of the paper is as follows: Section II-a de- scribes the database we used for evaluation. In section II-b we present the components of Hadoop framework. Section II-c describes the BEMD, BEMD-GGD and BEMD-HHT methods. In section II-d we present the similarity methods.
Section III describes the architecture of CBIR system based on Hadoop framework and results are given in section IV.
We end with a discussion and conclusion in section V.
II. MATERIAL AND METHODS A. DDSM Mammography database
the DDSM project [7] is a collaborative effort involving the Massachusetts General Hospital, the University of South Florida and Sandia National Laboratories. The database contains approximately 2,500 patient files. Each patient file includes two images of each breast (4 images for one patient, 10 000 images in total), along with some associated patient information (age at time of study, ACR breast density rating) and image information (scanner, spatial resolution). Images have a definition of 2000 by 5000 pixels. The database is classified in 3 levels of diagnosis (’normal’, ’benign’ or
’cancer’). An example of image series is given in figure 1.
Fig. 1. Image series from a mammography study
B. Hadoop Framework
Hadoop is a distributed master-slave architecture 1 that consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities.
Traits intrinsic to Hadoop are data partitioning and parallel computation of large datasets. Its storage and computational capabilities scale with the addition of hosts to a Hadoop cluster, and can reach volume sizes in the petabytes on clusters with thousands of hosts [8].
1) MapReduce: MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce 3 . It allows you to parallelize work over a large amount of raw data. MapReduce decomposes work submitted by a client into a small parallelized map and reduce workers, as shown in figure 2 (figure 2 is taken from [8]). The map and reduce constructs used in MapReduce are borrowed from those found in the Lisp functional programming language, and use a shared-nothing model 4 . to remove any parallel execution interdependencies that could add unwanted syn- chronization points.
2) HDFS: HDFS is the storage component of Hadoop.
It’s a distributed filesystem that’s modeled after the Google File System (GFS) 2 . HDFS is optimized for high through- put and works best when reading and writing large files (gigabytes and larger). To support this throughput HDFS leverages unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O). Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance.
HDFS can create, move, delete or rename files like traditional file systems but the difference is the method of storage because it includes two actors which are the NameNode and the DataNode. A DataNode stores data in the Hadoop File System and the NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept.
C. Numerical image characterization: signatures
The BEMD [9, 10] is an adaptive decomposition which decomposes any image into a set of functions denoted BIMF and a residue, these BIMFs are obtained by means of an algorithm called sifting process [11]. This decomposition allows to extract local features (phase, frequency) of input image. In this work, we describe the image by generating a numerical signature based on BIMFs contents [12, 13].
The usual approach used in CBIR system to characterize an image in a generic way, is to define a global representation of the whole image, or by computing the statistical param- eters such as co-occurrence matrix and Gabor filter bank
1
A model of communication where one process called the master has control over one or more other processes, called slaves
3
See MapReduce: Simplified Data Processing on Large Clusters, http://research.google.com/archive/mapreduce.html.
4
A shared-nothing architecture is a distributed computing concept that represents the notion that each node is independent and self-sufficient.
2