• Aucun résultat trouvé

Vector Model for Images

Part II Multimedia Data Exploration and Visualization

3. A New Hierarchical Approach for Image Clusteringfor Image Clustering

3.3 Hierarchy Construction and Similarity Measurement

3.3.2 Vector Model for Images

Image clustering is based on image similarity. To calculate image similarity, we construct an image vectorWl(w1,l, w2,l. . . wi,l, . . . wt,l) rather than compare images directly. The size of vector W is same as the total number of object clusters. Each item in the image vector is the weight for corresponding object cluster in the image l. For example,wi,lis the weight of object clusterCiin the imagel.

To get image vector, we borrow the idea of the vector model for text information [2]. Images correspond to documents and object clusters correspond to terms. LetN be the total number of images andnibe the number of images in which the objects in clusterCiever appears. Define the normalized frequency fi,jas below.

fi,l = freqi,l

maxhfreqh,l (3.6)

freqi,l is the number of times the cluster Ci appears in the imagel. The maximum is calculated over all clusters which ever appeared in the imagel. Such frequency is normally referred to as thetf factor, which indicates how well the clusteridescribes the imagel. But if the objects from one cluster appear in most of images, the cluster is not very useful for distinguishing a relevant image from a nonrelevant one. So, we need to define an inverse document frequencyidfi for Cias following.

idfi =logN ni

(3.7) Balancing above two factors, we have the weight of cluster.

wi,l= fi,l×idfi = fi,l×logN ni

(3.8) After computing image vectors, we could get similarity between any two images by calculating the degree of similarity between two image vectors.

simimg(i,j)= ij i×j =

t

h=1wh,i×wh,j

t

h=1wh2,i×t

h=1wh2,j (3.9) 3.3.3 Dynamic Growing Self-Organizing Tree (DGSOT) Algorithm

The DGSOT is a tree structure self-organizing neural network. It is designed to discover the correct hierarchical structure of the underlying data set. The DGSOT grows in two directions: vertical and horizontal. First, in the direction of vertical growth, the DGSOT adds children, and in the direction of horizontal growth, the DGSOT adds more siblings. In vertical growth of a node, only two children are added

to the node. In horizontal growth we strive to determine suitable number of children to represent data that are associated with the node. Thus, the DGSOT chooses the right number of subclusters at each hierarchical level during the tree construction process. During the tree growth, a learning process similar to the self-organizing tree algorithm (SOTA) is adopted.

The pseudo code of DGSOT is shown below:

Step 1: [Initialization] initially the tree has only one root node. The reference vector of this root node is initialized with the centroid of the entire collection, and all data will be associated with the root. The time parametertis initialized to 1.

Step 2: [Vertical Growing] the leaves whose heterogeneity is greater than a threshold will change itself to a node x and create two descendent leaves. The reference vector of a new leaf is initialized with the node’s reference vector.

Step 3: [Learning] for each input data associated with node, x, the winner is found (using KLD see Sections 3.3.3.1 and 3.3.3.4), and then the reference vectors of the winner and its neighborhood are updated. If the relative error of the entire tree is larger than a threshold, called error threshold (TE),tis increased by 1, i.e., t =t+1, and then Step 3 is repeated.

Step 4: [Horizontal Growing] for each lowest node (not leaf), if the horizontal growing stop rule is unsatisfied, a child leaf is added to this node; if it is satisfied, a child leaf is deleted from this node, and the horizontal growth is terminated.

Step 5: [Learning] for each input data, the winner is found (using KLD, see Section 3.3.3.4), and then the reference vectors of the winner and its neighbor-hood are updated. If the relative error of the entire tree is larger than a threshold, called error threshold (TE),t is increased by 1, i.e.,t=t+1, and then Step 5 is repeated.

Step 6: [Expansion] if there are more levels necessary in the hierarchy (i.e., vertical growing stop rule is reached), then return to Step 2; otherwise, stop.

Step 7: [Pruning] the leaf node with node data associated with it is deleted.

Let d be the branch factor of the DGSOT, then the height of the DGSOT will be lgdN, whereNis the number of data. LetKbe the average number of learning iterations to expand the tree one more level. Because in each learning iteration all data need to be distributed, the time complexity to build a full DGSOT will be O (KNlgdN).

3.3.3.1 Learning Process

In DGSOT, the learning process consists of a series of procedure to distribute all the data to leaves and update of the reference vectors. Each procedure is called a cycle.Eachcyclecontains a series ofepochs. Eachepochconsists of a presentation of all input data and each presentation has two steps: finding the best math node and updating the reference vector. Similar to SOTA, the input data is compared only to the leaf nodes to find the best match node, which is known as thewinner. The leaf node c, which has the minimum distance to the input datax, is the best match node/winner.

c: ||x−nc|| =min

i {||x−ni||} (3.10)

Nod

Nod

α α

α α

L L

L

Fig. 3.2. Different neighborhoods of winner (indicated by arrow) in DGSOT.

After awinner cis found, the reference vectors of the winner and its neighborhood will be updated using the following function:

wi =ϕ(t)×(x−wi) (3.11)

Whereϕ(t)is the learning function:

ϕ(t)=α×η(t) (3.12)

η(t)is the learning rate function, αis the learning constant, and t is the time parameter. The convergence of the algorithm depends on a proper choice ofαand η(t). During the beginning of the learning functionη(t)should be chosen close to 1, thereafter, it decreases monotonically. One choice can beη(t)=1/t.

Since neighborhood will be updated, there are two types of neighborhoods of a winning cell: If the sibling of thewinneris a leaf node, then the neighborhood includes the winner, the parent node, and the sibling nodes. On the other hand, if the sibling of the winning cell is not a leaf node, the neighborhood includes only the winnerand its parent node (see Figure 3.2). The updating of the reference vector of siblings is important so that data similar to each other are brought into same subtree.

Theαthe winner, the parent node and the sibling nodes will have different values.

Parameters αw,αm, andαs are used for the winner, the parent node, and the sibling node, respectively. Note that the parameter values are not equal. The order of the parameters is set asαw> αs> αm.This is different from the SOTA setting, which isαw> αm> αs. In SOTA, an inorder traversal strategy is used to determine the topological relationship in the neighborhood. And in DGSOT, a postorder traversal strategy is used to determine the topological relationship in the neighborhood. In our opinion, only the nonequalαwandαs are critical to partitioning the input data set into different leaf nodes. The goal of updating the parent node’s reference vector is to make it more precise to represent all the data associated with its children (see KLD, Section 3.3.3.4).

TheErrorof the tree, which is defined as the summation of the distance of each input data to the corresponding winner, is used to monitor the convergence of the learning process.A learning process has converged when the relative increase of

Errorof the tree falls below a given threshold.

Errort+1−Errort Errort

<ErrorThreshold (3.13) It is easier to avoid over training the tree in the early stages by controlling the value ofErrorof the tree during training.

3.3.3.2 Vertical Growing

In DGSOT, a nongreedy vertical growing is used. During vertical growing the leaves whose heterogeneity is greater than a threshold will change itself to a node and create two descendent leaves. The reference vectors of these leaves will be initialized by their parents.

There are several ways to determine the heterogeneity of a leaf. One simple way is to use the number of data associated with a leaf as its heterogeneity. This simple approach controls the number of elements that appear in each leaf node. The other way is to use the average erroreof a leaf as its heterogeneity. This approach does not depend on the number of data associated with a leaf node. The average errore of a leaf is defined as the average distance between the leaf node and the input data associated with the leaf:

ei= D

j=1

d(xj,ni)

|D| (3.14)

whereDis the total number of input data assigned to the leaf nodei.d(xj,ni) is the distance between data jand leaf nodei.ni is the reference vector of the leaf nodei. For the first approach, the data will be evenly distributed among the leaves. But the average error of the leaves may be different. For the second approach, the average error of each leaf will be similar to the average error of the other leaves. But the number of data associated with the leaves may vary substantially. In both approaches, the DGSOT can easily stop at higher hierarchical levels, which will save the computational cost.

3.3.3.3 Horizontal Growing

In each vertical growing, only two leaves are created for a growing node. In each hori-zontal growing, the DGSOT tries to find an optimal number of leaf nodes (subcluster) of a node to represent the clusters in each node’s expansion phase. Therefore, DGSOT adopts a dynamically growing scheme in each horizontal growing stage. For a lowest nonleaf node (heterogeneous one), a new child (subcluster) is added to the existing children. This process continues until a certain stop rule is reached. Once the stop rule is reached, the number of children nodes is optimized. For example, in a tree a nonleaf node has three children. If the horizontal growing criterion does not match the stop rule, a new child is added. Similarly, if the addition of a new child satisfies the stop rule, the child is deleted and the horizontal growing is stopped. After each addition/deletion of a node, a learning process is performed (see Sections 3.3.3.1 and 3.3.3.4).

Determining the true number of clusters, known as the cluster validation problem, is a fundamental problem in cluster analysis. It will be served as a stop rule. Several kinds of indexes have been used to validate the clustering [46]: one index based on external and internal criteria. This approach is based on statistical tests along with high computational cost. Since the DGSOT algorithm tries to optimize the number of clusters for a node in each expansion phase, the cluster validation is used heavily.

Therefore, the validation algorithms used in DGSOT must have a light computational cost and can be easily evaluated. Here we suggest average distortion (AD) measure.

AD is used to minimize intracluster distance. The average distortion of a subtree is defined as

AD= 1 N

N i=1

|d(xi, w)|2 (3.15)

whereNis the total number of input data assigned in the subtree, andwis the reference vector of the winner of input dataxi. The AD is the average distance between input data and its winner. During DGSOT learning, the total distortion is already calculated and the AD measure is easily computed after the learning process is finished. If the AD versus the number of clusters is plotted, the curve is monotonically decreasing.

There will be much smaller drop after the number of clusters is greater than the “true”

number of the clusters, because crossing this point we add more clusters simply to partitions within, rather between, “true” clusters. After a certain point the curve becomes flat. This point indicates the optimal number of the clusters in the original data set. Then the AD can be used as a criterion to choose the optimal number of clusters. In horizontal growing phase, if the relative value of AD after adding a new sibling is less than a thresholdε(Equation 3.16), then the new sibling will be deleted and the horizontal growing will stop.

|ADK+1ADK| ADK

< ε (3.16)

whereK is the number of siblings in a subtree andεis a small value, generally it is less than 0.1.

3.3.3.4 K-Level up Distribution (KLD)

Clustering in a self-organizing neural network is a distortion-based competitive learn-ing. The nearest neighbor rule is used to make the clustering decision. In SOTA, data associated with then parent node will be distributed only between its children. If data are incorrectly clustered in the early stage, these errors cannot be corrected in the later learning process.

To improve the cluster result, we propose a new distribution approach called K-level up distribution (KLD). Data associated with a parent node will be distributed not only to its children leaves but also to its neighboring leaves. The following is the KLD strategy:

r For a selected node, itsK level ancestor node is determined.

r The subtree rooted by the ancestor node is determined.

r Data assigned to the selected node will be distributed among all leaves of the subtree.

A

M S

B C D E

One level up redistribution scope Fig. 3.3. One level up distribution (K =1).

For example, Figure 3.3 shows the scope ofK =1. Now, the data associated with node M needs to be distributed to the new created leaves. For K =1, the immediate ancestor of M will be determined, which is A. The data associated with node M will be distributed to leaves B, C, D, and E of the subtree rooted by A. For each data, the winning leaf will be determined among B, C, D, and E using Equation (3.10). Note that ifK =0, data of M will be distributed between leaf B and C. This latter approach is identical to the conventional approach.