Medical content based image retrieval by using the Hadoop framework

(1)

Medical Content Based Image Retrieval by Using the HADOOP Framework

Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane, Abderrahim Sekkaki

Abstract— Most medical images are now digitized and stored in large image databases. Retrieving the desired images becomes a challenge. In this paper, we address the challenge of content based image retrieval system by applying the MapReduce distributed computing model and the HDFS storage model. Two methods are used to characterize the content of images: the first is called the BEMD-GGD method (Bidimensional Empirical Mode Decomposition with Generalized Gaussian density func- tions) and the second is called the BEMD-HHT method (Bidi- mensional Empirical Mode Decomposition with Huang-Hilbert Transform HHT). To measure similarity between images we compute the distance between signatures of images, for that we use the Kullback-Leibler Divergence (KLD) to compare the BEMD-GGD signatures and the Euclidean distance to compare the HHT signatures. Through the experiments on the DDSM mammography image database, we confirm that the results are promising and this work has allowed us to verify the feasibility and efficiency of applying the CBIR in the large medical image databases.

I. INTRODUCTION

Nowadays, medical imaging systems produce more and more digitized images in all medical fields. Most of these images are stored in image databases. There is a great interest to use them for diagnostic and clinical decision such as case-based reasoning [1]. The purpose is to retrieve desired images from a large image databases using only the numerical content of images. CBIR system (Content-Based Image Retrieval) is one of the possible solutions to effectively manage image databases [2].

Furthermore, fast access to such a huge database requires an efficient computing model. The Hadoop framework is one of the findings based on MapReduce [3] distributed computing model. Lately, the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Google, Amazon, and Facebook are the biggest users of the MapReduce programming model and it’s been recently adopted by several universities. It allows distributed processing of data intensive computing over many machines.

In CBIR systems, requests (the system inputs) are images and answers (outputs/results) are all the similar images in the database. A typical CBIR system can be decomposed in three steps: firstly, the characteristic features for each image in the database are extracted and are used to index images; secondly, the features vector of a query image is computed; and thirdly, the features vector of the query

Said Jai-Andaloussi, Nabil Madrane and Abderrahim Sekkaki are with LIAD Lab, Casablanca, Kingdom of Morocco.;

said.jaiandaloussi@etude.univcasa.ma

Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane and Abderrahim Sekkaki are with Faculty of science Ain-chok, Casablanca, Kingdom of Morocco.

image is compared to those of each image in the database.

For the definition and extraction of image characteristic features, many methods have been proposed, including image segmentation and image characterization using wavelet transform and Gabor filter bank [4, 5]. In this work, we used MapReduce computing model to extract features of images by applying the BEMD-GGD and BEMD-HHT [2], then we write the features files into HBase [6] (HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable), the Kullback-Leibler divergence (KLD) and Euclidean distance are used to compute the similarity between features of images.

The setup of the paper is as follows: Section II-a de- scribes the database we used for evaluation. In section II-b we present the components of Hadoop framework. Section II-c describes the BEMD, BEMD-GGD and BEMD-HHT methods. In section II-d we present the similarity methods.

Section III describes the architecture of CBIR system based on Hadoop framework and results are given in section IV.

We end with a discussion and conclusion in section V.

II. MATERIAL AND METHODS A. DDSM Mammography database

the DDSM project [7] is a collaborative effort involving the Massachusetts General Hospital, the University of South Florida and Sandia National Laboratories. The database contains approximately 2,500 patient files. Each patient file includes two images of each breast (4 images for one patient, 10 000 images in total), along with some associated patient information (age at time of study, ACR breast density rating) and image information (scanner, spatial resolution). Images have a definition of 2000 by 5000 pixels. The database is classified in 3 levels of diagnosis (’normal’, ’benign’ or

’cancer’). An example of image series is given in figure 1.

Fig. 1. Image series from a mammography study

(2)

B. Hadoop Framework

Hadoop is a distributed master-slave architecture ¹ that consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities.

Traits intrinsic to Hadoop are data partitioning and parallel computation of large datasets. Its storage and computational capabilities scale with the addition of hosts to a Hadoop cluster, and can reach volume sizes in the petabytes on clusters with thousands of hosts [8].

1) MapReduce: MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce ³ . It allows you to parallelize work over a large amount of raw data. MapReduce decomposes work submitted by a client into a small parallelized map and reduce workers, as shown in figure 2 (figure 2 is taken from [8]). The map and reduce constructs used in MapReduce are borrowed from those found in the Lisp functional programming language, and use a shared-nothing model ⁴ . to remove any parallel execution interdependencies that could add unwanted syn- chronization points.

2) HDFS: HDFS is the storage component of Hadoop.

It’s a distributed filesystem that’s modeled after the Google File System (GFS) ² . HDFS is optimized for high through- put and works best when reading and writing large files (gigabytes and larger). To support this throughput HDFS leverages unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O). Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance.

HDFS can create, move, delete or rename files like traditional file systems but the difference is the method of storage because it includes two actors which are the NameNode and the DataNode. A DataNode stores data in the Hadoop File System and the NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept.

C. Numerical image characterization: signatures

The BEMD [9, 10] is an adaptive decomposition which decomposes any image into a set of functions denoted BIMF and a residue, these BIMFs are obtained by means of an algorithm called sifting process [11]. This decomposition allows to extract local features (phase, frequency) of input image. In this work, we describe the image by generating a numerical signature based on BIMFs contents [12, 13].

The usual approach used in CBIR system to characterize an image in a generic way, is to define a global representation of the whole image, or by computing the statistical param- eters such as co-occurrence matrix and Gabor filter bank

1

A model of communication where one process called the master has control over one or more other processes, called slaves

3

See MapReduce: Simplified Data Processing on Large Clusters, http://research.google.com/archive/mapreduce.html.

4

A shared-nothing architecture is a distributed computing concept that represents the notion that each node is independent and self-sufficient.

2

See the Google File System, http://research.google.com/archive/gfs.html.

Fig. 2. A client submitting a job to MapReduce

Fig. 3. Extraction process of the image signature using BEMD-GGD and BEMD-HHT

[5]. These parameters repesent the signature or the index of the image. The most widely used technique is to build image signatures based on the information content of color histograms. In this work, we use the Bidimensional Empirical Mode Decomposition (BEMD), Generalized Gaussian Den- sity function (GGD) and Huang-Hilbert transform to generate the image signature.

1) BEMD-GGD signature: The gaussian generalized law is derived from the normal law and parameterized by:

• α: a scale factor, it corresponds to the standard deviation of the classical Gaussian law.

• β: a shape parameter.

So the law density is defined as.

p(x; α, β) = β

2αΓ( _β ¹ ) e ⁻ (

^|x|α

)

^β

(1)

Where Γ(.) is the gamma function,

(3)

Γ(z) = R ∞

0 e ^−t t ^z−1 dt, z > 0

We propose to characterize images by couples (α, β), determined by using a maximum likelihood estimator (ˆ α, β) ˆ of the distribution law for coefficients of each BIMF in the BEMD decomposition [12]. The image vector signature is formed by the set of couples ( α, ˆ β ˆ ) derived from each BIMF and the histograme of residue.

2) BEMD-HHT siganture: in the second method we apply the Huang-Hilbert transform [11] to each BIMF, and extract information from transformed BIMFs. Given an analytic signal z(t) (equation (2)) of a real signal s(t). The imaginary part of z(t) is equal to the Hilbert transform of the real part (equation (3)).

z(t) = s(t) + iy(t). (2)

y(t) = H (s(t)) = v.p Z +∞

−∞

x(τ)

π(t − τ ) dτ. (3) Where p indicates the Cauchy principal value.

We propose to characterize image by using the statistic features (mean, standard deviation) extracted from the am- plitude matrix A, phase matrix θ and instantaneous frequency matrix W of each BIMF [13].

We give below in figure 3 the extraction process of the image signature using BEMD-GGD and BEMD-HHT.

D. Distance

1) BEMD-GGD similarity: To compute the similarity distance between two BIMFs (Generalized Gaussian) and, according to [14] Kullback-Leibler distance is used (see equation (4)).

KLD(p(X ; θ _q )||p(X; θ _i )) = Z

p(X ; θ _q ) log p(X ; θ _q ) p(X; θ i ) dx

(4) The distance between two images I and J is the sum of the weighted distance between BIMFs (see equation (5)).

D(I, J) =

K

X

k=1

KLD(P(X, α ^k _I , β ^k _I ), P (X, α ^k _J , β _J ^k ) (5)

2) BEMD-Hilbert similarity: The distance between two images I and J is the sum of the weighted distance between BIMFs (see equation (6)).

D(I, J) =

K

X

k=1

d(BIM F _k ^I , BIM F _k ^J ) (6)

where λ k are the adjustment weights.

The BIM F k represents the feature vector of an image.

The distance between two BIMFs in the same level of

decomposition, is defined as : D(BIM F I , BIM F J ) =

µ ^I _A − µ ^J _A α(µ A )

+

µ ^I _θ − µ ^J _θ α(µ θ )

+

µ ^I _W − µ ^J _W α(µ _W )

+

σ ^I _A − σ _A ^J α(σ _A )

+

σ ^I _θ − σ ^J _θ α(σ θ )

+

σ _W ^I − σ _W ^J α(σ W )

(7) α(µ) and α(σ) are the standard deviations of the respective features over the entire database, and are used to normalize the individual feature components.

III. A RCHITECTURE OF CBIR SYSTEM BASED ON

HADOOP FRAMEWORK

Content based image retrieval (CBIR) is composed of two phases: 1) offline phase, 2) online phase. In the offline phase, the signature vector is computed for each image in databases and they will be stored. In the online phase, the query is constructed by computing the vector signature of input image. Then, the query signature is compared with signatures of images in the database,

A. Offline phase: applying the Mapreduce in extraction of the image signature

MapReduce is known for its ability to handle large amounts of data. In this work, we use the open source distributed cloud computing framework Hadoop and its im- plementation of the MapReduce model to extract vectors fea- tures of images. The implementation method of distributed features extraction and image storage is given in figure 4.

Storage is the base of CBIR system, given the amount of images data produced daily by the medical services, retrieve and processed these images need important computation time. Therefore, parallel processing is necessary. For this reason, we adopt Mapreduce computing model to extract the visual features of images and then write the features and image files all into HBase. HBase partitions the key space.

Each partition is called a Table. Each table declares one or more column families. Column families define the storage properties for an arbitrary set of columns [6].

The given table in figure 5 shows the structure of our Hbase table, the row key of our Hbase table is assigned to the ID of image and families are files and features.

Label ”source” and ”class” are added under family ”file”, representing for source image and class of image (the DDSM database is classified in 3 levels of diagnosis (’normal’,

’benign’ or ’cancer’)) respectively. Under family ”features”,

label ”feature BEMD-GGD Alpha”, ”feature BEMD-GGD

Beta”, ”feature BEMD-HHT mean”,”feature BEMD-HHT

standard deviation”, ”feature BEMD-HHT phase”, ”feature

BEMD-residue histogram” are added, representing features

extracting by using BEMD-GGD and BEMD-HHT methods.

(4)

Fig. 4. Offline phase: applying the Mapreduce in extraction of the image signature

Fig. 5. Table 1. Structure of Hbase table for image features storage

B. Online phase: applying the Mapreduce in image retrieval In the figure (given below), we describe the online retrieval phase. This phase is divided into 7 steps:

1) The user sends a query image to SCL, then the image will be stored temporarily in HDFS.

2) Run a map-reduce job to extract features from query image

3) Store image features in HDFS

4) The similarity/distance between the features vectors of the query image in HDFS and the target images in the HBASE are computed.

5) A reduce collect and combines all the result from all the map function.

6) The reducer stores the result into HDFS.

7) Send the result to the user IV. RESULT

The method is tested on the DDSM database (see II-A).

We made experiments on mean precision at 20, which is the ratio between the number of pertinent images retrieved and

Fig. 6. Online phase: applying the Mapreduce in image retrieval

the total images retrieved. We give below the principle of our retrieval method.

• Each image in the database is used as a query image.

• The algorithm finds the twenty first images of the database closest to the query image.

• Precision is computed for this query.

• Finally, we compute the mean precision.

In performances testings of image retrieval, we compared the local method with the parallel method based on Hadoop framework. A diagram of time consumed to retrieve im- ages in parallel and in local way is given in figure 7.

The horizontal axis represents the size of image databases, vertical axis represents the retrieval time (in milliseconds).

We can see that when the size of image data is small

(100<y<1000), image retrieval in local way will take less

time. Local retrieval presents a performance better than

the retrieval method based Map/Reduce, it’s because the

Map/Reduce has needs to prepare the setting of cluster nodes,

source file splitting, etc. With the increase in the size of

data (1000<size<3000), retrievals in both parallel and local

way will have similar processing capability. When the size

of data is even larger (size>6000), the time consumed by

Map/Reduce retrieval is less than that consumed by local

way. The result shows that the use of MapReduce mode into

CBIR is very suitable for large image databases. It improves

the efficiency significantly.

(5)

Fig. 7. Comparison of the image retrieval methods (local and MapReduce) in terms of time consumed

V. CONCLUSIONS

In this paper, we used the Hadoop distributed computing environment to content based image retrieval. For that, we proposed two methods to characterize the numerical content of medical images: the first method is BEMD-GGD and the second method is BEMD-HHT. Hadoop framework is used to store images and their features in column-oriented database HBase, and utilizes MapReduce computing model to im- prove the performance of image retrieval among massive image data. Furthermore, we compared the proposed method with local method, image retrieval based on MapReduce distributed computing model are more efficient when target image data is large. Our method needs to be validated on larger image databases such as PACS (Picture Archiving and Communication System). In the near future, we think that the application of this approach into PACS will change the medical diagnosis aid, since we can index the amount of data stored on PACS in few seconds.

R EFERENCES

[1] G. Quellec, M. Lamard, L. Bekri, G. Cazuguel, B. Cochener, C. Roux

”Recherche de cas mdicaux multimodaux l’aide d’arbres de dcision”

IRBM 2008;29(1):35-43

[2] S. Jai andaloussi, M.Lamard, G.Cazuguel, H.Tairi, M.Meknassi, B.Cochener, C.Roux ”Content Based medical Image Retrieval based on BEMD: Optimization of a similarity metric”. 2010 Annual Interna- tional Conference of the IEEE, Buenos Aires, Argentina, 2010, 3069- 3072.

[3] Jeffrey Dean, Sanjay Ghemawat. ”Map Reduce: Simplified Data Processing on Large Cluster[C]”. OSDI, 2004.

[4] M. Lamard, G. Cazuguel, G. Quellec, L. Bekri, C. Roux, B. Cochener,

”Content Based Image Retrieval based on Wavelet Transform coeffi- cients distribution” in Proceedings of the 29th Annual International Conference of the IEEE EMBS, Lyon, France August 23-26, 2007.

[5] B.S. Manjunath, P. Wu, S. Newsam, and H.D. Shin ”A texture descriptor for browsing and similarity retrieval”. Journal of Signal Processing: Image Communication, 16(1-2) :33-43, 2000.

[6] Jing Zhang, Xianglong Liu, Junwu Luo, Bo Lang ”DIRS: Distributed Image Retrieval System Based on MapReduce ” (ICPCA), Lanzhou, China 2010 1-3 Dec. 2010.

[7] M. Heath, K. Bowyer, and D. K. et al, ”Current status of the digital database for screening mammography,”Digital Mammography, Kluwer Academic Publishers, pp. 457-460, 1998.

[8] A. HOLMES ”Hadoop in Practice” book, October, 2012

[9] J. C. Nunes, Y. Bouaoune, E. Delechelle, O. Niang, Ph. Bunrel, ”Image analysis by bidimensional empirical mode decomposition,” Image and Vision Computing, 2003; Vol. 21:1019-1026.

[10] Bhuiyan, S.M.A, Adhami, R.R. Khan, J.F, ”A novel approach of fast and adaptive bidimensional empirical mode decomposition”,EURASIP Journal on Advances in Signal Processing, Vol (2008).

[11] Huang and al, ”The empirical mode decomposition and the Hilbert spectrum for non linear and non-stationary time series analysis” Proc.

Roy. Soc. London A, Vol. 454, pp. 903-995, 1998.

[12] S.Jai-Andaloussi, M.Lamard, G.Cazuguel, H.Tairi, M.Meknassi, B.Cochener, C.Roux, ”Content Based Medical Image Retrieval Based on BEMD: use of Generalized Gaussian Density to model BIMFs coefficients”, GVIP, 2010;10(2):29-38.

[13] S.Jai-Andaloussi, M.Lamard, G.Cazuguel, H.Tairi, M.Meknassi, B.Cochener, C.Roux, ” Recherche d’images mdicales par leur contenu numrique : utilisation de signatures construites partir de la BEMD et la transform de Hilbert”, JDTIC’09, 16-18 Juillet 2009, Rabat-Maroc.

[14] G. V. Wouwer, P. Scheunders, and D. V. Dyck, ”Statistical texture characterization from discrete wavelet representations,” IEEE Trans.

Image Processing,1999;vol. 8:592-598.

Medical content based image retrieval by using the Hadoop framework

Medical Content Based Image Retrieval by Using the HADOOP Framework

Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane, Abderrahim Sekkaki

I. INTRODUCTION

Said Jai-Andaloussi, Nabil Madrane and Abderrahim Sekkaki are with LIAD Lab, Casablanca, Kingdom of Morocco.;

said.jaiandaloussi@etude.univcasa.ma

Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane and Abderrahim Sekkaki are with Faculty of science Ain-chok, Casablanca, Kingdom of Morocco.

image is compared to those of each image in the database.

The setup of the paper is as follows: Section II-a de- scribes the database we used for evaluation. In section II-b we present the components of Hadoop framework. Section II-c describes the BEMD, BEMD-GGD and BEMD-HHT methods. In section II-d we present the similarity methods.

Section III describes the architecture of CBIR system based on Hadoop framework and results are given in section IV.

We end with a discussion and conclusion in section V.

II. MATERIAL AND METHODS A. DDSM Mammography database

’cancer’). An example of image series is given in figure 1.

Fig. 1. Image series from a mammography study

B. Hadoop Framework

Hadoop is a distributed master-slave architecture 1 that consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities.

Traits intrinsic to Hadoop are data partitioning and parallel computation of large datasets. Its storage and computational capabilities scale with the addition of hosts to a Hadoop cluster, and can reach volume sizes in the petabytes on clusters with thousands of hosts [8].

2) HDFS: HDFS is the storage component of Hadoop.

C. Numerical image characterization: signatures

The usual approach used in CBIR system to characterize an image in a generic way, is to define a global representation of the whole image, or by computing the statistical param- eters such as co-occurrence matrix and Gabor filter bank

A model of communication where one process called the master has control over one or more other processes, called slaves

See MapReduce: Simplified Data Processing on Large Clusters, http://research.google.com/archive/mapreduce.html.

A shared-nothing architecture is a distributed computing concept that represents the notion that each node is independent and self-sufficient.

See the Google File System, http://research.google.com/archive/gfs.html.

Fig. 2. A client submitting a job to MapReduce

Fig. 3. Extraction process of the image signature using BEMD-GGD and BEMD-HHT

1) BEMD-GGD signature: The gaussian generalized law is derived from the normal law and parameterized by:

• α: a scale factor, it corresponds to the standard deviation of the classical Gaussian law.

• β: a shape parameter.

So the law density is defined as.

p(x; α, β) = β

2αΓ( β 1 ) e − (

)

(1)

Where Γ(.) is the gamma function,

Γ(z) = R ∞

0 e −t t z−1 dt, z > 0

z(t) = s(t) + iy(t). (2)

y(t) = H (s(t)) = v.p Z +∞

−∞

x(τ)

π(t − τ ) dτ. (3) Where p indicates the Cauchy principal value.

We propose to characterize image by using the statistic features (mean, standard deviation) extracted from the am- plitude matrix A, phase matrix θ and instantaneous frequency matrix W of each BIMF [13].

We give below in figure 3 the extraction process of the image signature using BEMD-GGD and BEMD-HHT.

D. Distance

1) BEMD-GGD similarity: To compute the similarity distance between two BIMFs (Generalized Gaussian) and, according to [14] Kullback-Leibler distance is used (see equation (4)).

KLD(p(X ; θ q )||p(X; θ i )) = Z

p(X ; θ q ) log p(X ; θ q ) p(X; θ i ) dx

(4) The distance between two images I and J is the sum of the weighted distance between BIMFs (see equation (5)).

D(I, J) =

K

X

k=1

KLD(P(X, α k I , β k I ), P (X, α k J , β J k ) (5)

2) BEMD-Hilbert similarity: The distance between two images I and J is the sum of the weighted distance between BIMFs (see equation (6)).

D(I, J) =

K

X

k=1

d(BIM F k I , BIM F k J ) (6)

where λ k are the adjustment weights.

The BIM F k represents the feature vector of an image.

The distance between two BIMFs in the same level of

decomposition, is defined as : D(BIM F I , BIM F J ) =

µ I A − µ J A α(µ A )

+

µ I θ − µ J θ α(µ θ )

+

µ I W − µ J W α(µ W )

+

σ I A − σ A J α(σ A )

+

σ I θ − σ J θ α(σ θ )

+

σ W I − σ W J α(σ W )

(7) α(µ) and α(σ) are the standard deviations of the respective features over the entire database, and are used to normalize the individual feature components.

III. A RCHITECTURE OF CBIR SYSTEM BASED ON

HADOOP FRAMEWORK

A. Offline phase: applying the Mapreduce in extraction of the image signature

Each partition is called a Table. Each table declares one or more column families. Column families define the storage properties for an arbitrary set of columns [6].

Hadoop is a distributed master-slave architecture ¹ that consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities.

2αΓ( _β ¹ ) e ⁻ (

0 e ^−t t ^z−1 dt, z > 0

KLD(p(X ; θ _q )||p(X; θ _i )) = Z

p(X ; θ _q ) log p(X ; θ _q ) p(X; θ i ) dx

KLD(P(X, α ^k _I , β ^k _I ), P (X, α ^k _J , β _J ^k ) (5)

d(BIM F _k ^I , BIM F _k ^J ) (6)

µ ^I _A − µ ^J _A α(µ A )

µ ^I _θ − µ ^J _θ α(µ θ )

µ ^I _W − µ ^J _W α(µ _W )

σ ^I _A − σ _A ^J α(σ _A )

σ ^I _θ − σ ^J _θ α(σ θ )

σ _W ^I − σ _W ^J α(σ W )