Illustration relationnelle : les données lesmis

4.3 Études de cas

4.3.3 Illustration relationnelle : les données lesmis

Le jeu de données lesmis

⁷

est un graphe représentant les liens entre les personnages du roman Les Misérables de Victor Hugo. Un sommet correspond à un personnage, une arête signie que les 2 personnages reliés apparaissent dans un même chapitre.

Figure 4.20 Graphique de co apparition des personnages des Misérables

Pour pouvoir utiliser ces données pour le cas 'relational' de l'algorithme, une matrice de dissimilarité a été calculée grâce à la fonction shortest.path du package igraph. À partir du graphe 4.20, une mesure de dissimilarité est dénie, qui correspond au chemin le plus court entre les 2 personnages : cette dissimilarité est donc constituée de valeurs entières et positives.

mis . som <- trainSOM (x.data= dissim . lesmis , type =" relational ")

La distribution des personnages sur la carte est accessible soit avec un tableau, résultat de la fonction table de R :

7source :http://people.sc.fsu.edu/~jburkardt/datasets/sgb/jean.dat

table( mis . som$clustering )

## 1 2 3 4 5 6 7 9 11 12 14 15 18 20 21 22 24 25

## 6 7 3 2 8 1 3 3 3 7 3 3 2 6 6 5 3 6

Soit graphiquement, en appelant la fonction plot.somRes (voir gure 4.21) ou encore en utilisant une palette de couleur sur le graphe de départ (gure 4.22).

Figure 4.21 Répartition des personnages sur la carte

Figure 4.22 Visualisation de la classication sur le graphe

La gure 4.23 représente le graphe projeté sur la grille : la taille de chaque sommet est proportionnelle au

nombre de personnages classés dans le neurone et l'épaisseur de l'arête entre 2 neuronnes dépend du nombre

d'arêtes entre personnages des 2 classes sur la gure 4.20.

Figure 4.23 Graphe projeté sur la grille

Les prols des observations peuvent, entre autres, être visualisés par un graphique de type barplot,

présenté en gure 4.24.

Figure 4.24 Prols des personnages

Etant donné le faible nombre d'observations dans la majorité des classes, il devient pertinent de procéder à une super classication :

sc. mis <- superClass ( mis .som , k =6)

Cela permet d'obtenir un nombre de classes réduit, qui produit donc des résultats plus proches de la réalité.

Outre le dendrogramme traditionnel produit par la graphique dendrogram (voir la gure iris-sc-dendro

pour exemple), SOMbrero propose également une version en 3 dimensions de celui-ci via le graphique

dendro3d (gure 4.25).

Figure 4.25 Dendrogramme en 3 dimensions de la super classication

Ajouter l'information de la super classication au graphe de départ permet d'idener (gure 4.26) :

• deux des personnages principaux, Javert et Valjean, qui ont un rôle central dans le roman (super classe 1) ;

• les personnages qui interviennent uniquement dans l'histoire de l'évêque Myriel (super classe 2) ;

• Fantine et les personnages de son histoire (super classe 3) ;

• Marius et sa famille (Mme Pontmercy, sa mère, le Lieutenant Gillenormand, son père, etc) mais éga-lement Cosette, avec qui il aura une aventure (super classe 4) ;

• Gavroche, l'enfant abandonné des Thénardiers, et les personnages relatifs à son histoire (super classe 5) ;

• la famille Thénardier (super classe 6).

Figure 4.26 Visualisation de la super classication sur le graphe

Chapitre 5

Conclusion

Ainsi, j'ai eectué mon stage de n d'études Génie Informatique et Statistique pour le compte du labo-ratoire SAMM. Cette longue immersion dans le monde de l'entreprise m'a permis de mettre et pratique et d'approfondir mes connaissances théoriques, notamment en programmation et plus particulièrement en programmation R. Durant cette période, je me suis également confrontée aux réalités du monde du travail.

La réalisation du package SOMbrero m'a donné le loisir d'accomplir diverses tâches, principalement : acquisition de connaissances sur la méthode des cartes auto organisatrices, développement des scripts et documentations R, rédaction de guides utilisateurs et progrès sur le plan de la communication orale, dûs à de multiples présentations de mon travail en interne.

Cette expérience enrichissante et complète m'a oert une bonne préparation à mon insertion profession-nelle et m'a rassurée quant à mes capacités dans le domaine informatique. En outre, j'ai pu aborder un aspect important du monde du travail actuel à savoir le travail à distance puisqu'une de mes encadrantes était basée à Paris.

Enn, je tiens à souligner les bonnes conditions, matérielles et humaines, dans lesquelles s'est déroulé

mon stage.

Bibliographie

[1] R Development Core Team. R : A Language and Environment for Statistical Computing. Vienna, Austria, 2012. ISBN 3-900051-07-0.

[2] L. Bendhaïba, M. Olteanu, and N. Villa-Vialaneix. SOMbrero : Cartes auto-organisatrices stochastiques pour l'intégration de données décrites par des tableaux de dissimilarités. In 2èmes Rencontres R, pages 12, Lyon, France, 2013.

[3] T. Kohonen. Self-Organizing Maps, 3rd Edition, volume 30. Springer, Berlin, Heidelberg, New York, 2001.

[4] M. Cottrell, P. Letremy, and E. Roy. Analyzing a contingency table with Kohonen maps : a factorial correspondence analysis. In Proceedings of IWANN'93, J. Cabestany, J. Mary, A. Prieto (Eds.), Lecture Notes in Computer Science, pages 305311. Springer Verlag, 1993.

[5] M. Olteanu, N. Villa-Vialaneix, and M. Cottrell. On-line relational som for dissimilarity data. In P.A.

Estevez, J. Principe, P. Zegers, and G. Barreto, editors, Advances in Self-Organizing Maps (Proceedings of WSOM 2012), volume 198 of AISC (Advances in Intelligent Systems and Computing), pages 1322, Santiago, Chile, 2012. Springer Verlag, Berlin, Heidelberg.

[6] C. Genolini. Construire un package. Classique ou S4. Technical report, 2009.

Annexe A

calculateRadius <- function( the .grid, radius .type , ind .t, maxit ) {

## TODO: implement other radius types

# ind.t: iteration index

method =" euclidean "))[ the . neuron ,]

the . nei <- which( the .dist <=1) } else {

the . dist <- as.matrix( dist ( the .grid $coord , diag=TRUE , upper=TRUE , method =" maximum "))[ the . neuron ,]

the . nei <- which( the .dist <= radius )

# Functions to manipulate objects in the input space

distEuclidean <- function(x,y) { sqrt(sum((x-y) ^2) )

}

distRelationalProto <- function( proto1 , proto2 , x.data) { -0.5*t( proto1 - proto2 )%*%x.data%*%( proto1 - proto2 )

}

calculateProtoDist <- function( prototypes , the .grid, type , complete=FALSE , x.data= NULL ) {

if (! complete) {

all. nei <- sapply(1:prod( the .grid $ dim),selectNei , the .grid= the .grid, radius =1) all. nei <- sapply(1:prod( the .grid $ dim), function( neuron )

setdiff(all. nei [[ neuron ]], neuron )) if ( type!=" relational ") {

# euclidean case

distances <- sapply(1:prod( the .grid $ dim), function( one . neuron ) { apply( prototypes [all. nei [[ one . neuron ]] ,] ,1 , distEuclidean ,

y= prototypes [ one . neuron ,]) } })else {

distances <- sapply(1:prod( the .grid $ dim), function( one . neuron ) { apply( prototypes [all. nei [[ one . neuron ]] ,] ,1 , distRelationalProto ,

proto2 = prototypes [ one . neuron ,], x.data=x.data) })if (sum(unlist( distances ) <0) >0)

warning(" some of the relational 'distances ' are negatives \n plots , qualities , super - clustering ... may not work!",

immediate .= TRUE , call.= TRUE ) } }else {

if ( type ==" relational ") {

# non euclidean case

distances <- apply( prototypes ,1,function( one . proto ) {

apply( prototypes , 1, distRelationalProto , proto2 = one .proto , x.data=x.data)

})if (sum( distances <0) >0)

warning(" some of the relational 'distances ' are negatives \n plots , qualities , super - clustering ... may not work!",

immediate .= TRUE , call.= TRUE )

} else distances <- as.matrix( dist ( prototypes , upper=TRUE , diag= TRUE )) }

distances }

## Functions used during training of SOM

# Step 2: Preprocess data ("korresp" case)

korrespPreprocess <- function( cont .table) {

both . profiles <- matrix(0, nrow=nrow( cont .table)+ncol( cont .table),

# Best column to complete row profiles

best .col <- apply( both . profiles [1:nrow( cont .table), 1:ncol( cont .table)], 1,which.max)

both . profiles [1:nrow( cont .table), (ncol( cont .table) +1) :ncol( both . profiles )] <-both . profiles [ best .col+nrow( cont .table),

(ncol( cont .table) +1) :ncol( both . profiles )]

# Best row to complete col profiles

best .row <- apply( both . profiles [(nrow( cont .table) +1) :

(nrow( cont .table)+ncol( cont .table)),

(ncol( cont .table) +1) :

rownames( both . profiles ) <- c(rownames( cont .table),colnames( cont .table)) colnames( both . profiles ) <- c(colnames( cont .table),rownames( cont .table)) return( both . profiles )

}

# Step 3: Initialize prototypes

initProto <- function( parameters , norm .x.data, x.data) { if (is.null( parameters$proto0 )) {

if ( parameters$init . proto ==" random ") { if ( parameters$type ==" relational ") {

prototypes <- t(apply(matrix(runif(prod( parameters$the .grid $dim, nrow( norm .x.data))), nrow=prod( parameters$the .grid $ dim)), 1, function(x)x/ sum(x)))

} else {

# both numeric and korresp

prototypes <- sapply(1:ncol( norm .x.data), function( ind ){

runif(prod( parameters$the .grid $ dim), min=min( norm .x.data[, ind ]) , max=max( norm .x.data[, ind ]))}) } }else if ( parameters$init . proto ==" obs ") {

if ( parameters$type ==" korresp "| parameters$type ==" numeric ") { prototypes <- norm .x.data[sample(1:nrow( norm .x.data),

prod( parameters$the .grid $ dim), replace= TRUE ) ,]

} else if ( parameters$type ==" relational ") {

prototypes <- matrix(0, nrow=prod( parameters$the .grid $ dim), ncol=ncol( norm .x.data))

prototypes [cbind(1:nrow( prototypes ),

sample(1:ncol( prototypes ),nrow( prototypes ), replace= TRUE ))] <- 1

} } } else {

prototypes <- switch( parameters$scaling ,

" unitvar "=scale( parameters$proto0 ,

center =apply(x.data,2,mean), scale=apply(x.data,2,sd)),

" center "=scale( parameters$proto0 ,

center =apply(x.data,2,mean), scale= FALSE ),

" none "=as.matrix( parameters$proto0 ),

" chi2 "=as.matrix( parameters$proto0 )) }return( prototypes )

}

# Step 5: Randomly choose an observation

selectObs <- function( ind .t, ddim , type ) {

oneObsAffectation <- function(x.new, prototypes , type , x.data= NULL ) {

if ( type ==" relational ") {

the . neuron <- which.min( prototypes %*%x.

new-0.5* diag( prototypes %*%x.data%*%

t( prototypes )))

} else the . neuron <- which.min(apply( prototypes , 1, distEuclidean , y=x.new)) the . neuron

}

# Step 7: Update of prototypes

prototypeUpdate <- function(type , the .nei , epsilon , prototypes , rand .ind , sel . obs ) {

# Step 8: calculate intermediate energy

# TODO: It would probably be better to implement a function 'distEltProto'

calculateClusterEnergy <- function( cluster , x.data, clustering , prototypes ,

parameters , radius ) {

if ( parameters$type ==" numeric " || parameters$type ==" korresp ") { if ( parameters$radius . type ==" letremy ") {

the . nei <- selectNei ( cluster , parameters$the .grid, radius ) if (sum( clustering %in% the . nei ) >0) {

return(sum((x.data[which( clustering %in% the . nei ) ,]-outer(rep(1,sum( clustering %in% the . nei )),

prototypes [ cluster ,]) ) ^2) ) } }

} else if ( parameters$type ==" relational ") { if ( parameters$radius . type ==" letremy ") {

the . nei <- selectNei ( cluster , parameters$the .grid, radius ) if (sum( clustering %in% the . nei ) >0) {

return(sum( prototypes %*%x.data[,which( clustering %in% the . nei )] -0.5*

diag( prototypes %*%x.data%*%t( prototypes )))) } }

} }

calculateEnergy <- function(x.data, clustering , prototypes , parameters , ind .t) { if ( parameters$type ==" numeric " || parameters$type ==" korresp ") {

if ( parameters$radius . type ==" letremy ") {

radius <- calculateRadius ( parameters$the .grid, parameters$radius .type , ind .t, parameters$maxit )

return(sum(unlist(sapply(1:nrow( prototypes ), calculateClusterEnergy , x.data=x.data, clustering = clustering ,

prototypes = prototypes , parameters = parameters , radius = radius )))/

nrow(x.data)/ nrow( prototypes )) } }else if ( parameters$type ==" relational ") { if ( parameters$radius . type ==" letremy ") {

radius <- calculateRadius ( parameters$the .grid, parameters$radius .type , ind .t, parameters$maxit )

return(sum(unlist(sapply(1:nrow( prototypes ), calculateClusterEnergy , x.data=x.data, clustering = clustering ,

prototypes = prototypes , parameters = parameters , radius = radius )))/

nrow(x.data)/ nrow( prototypes )) } }

}

##### Main function

################################################################################

trainSOM <- function (x.data, ...) { param .args <- list(...)

## Step 1: Parameters handling

if (!is.matrix(x.data)) x.data <- as.matrix(x.data, rownames. force = TRUE )

# Default dimension: nb.obs

10 with minimum equal to 5 and maximum to 10

if (is.null( param .args $dimension )) {

if (!is.null( param .args $type ) && param .args $type ==" korresp ") param .args $dimension

<-c(max(5,min(10 ,ceiling(sqrt((nrow(x.data)+ncol(x.data))/10) ))), stop(" data do not match chosen SOM type (' relational ')\n", call.= TRUE )

# Initialize parameters and print

parameters <- do.call(" initSOM ", param .args) if ( parameters$verbose ) {

cat("Self - Organizing Map algorithm ...\ n") print. paramSOM ( parameters )

}

# Check proto0 also now that the parameters have been initialized

if (!is.null( param .args $proto0 )) {

if (( param .args $type ==" korresp ")&&

(!identical (dim( param .args $proto0 ),

as.integer(c(prod( param .args $dimension ), ncol(x.data)+nrow(x.data)))))) { stop(" initial prototypes dimensions do not match SOM parameters :

in the current SOM , prototypes must have ", prod( param .args $dimension ), " rows and ",

ncol(x.data)+nrow(x.data), " columns \n", call.= TRUE ) } else if (!identical (dim( param .args $proto0 ),

as.integer(c(prod( param .args $dimension ), ncol(x.data))))) {

stop(" initial prototypes dimensions do not match SOM parameters : in the current SOM , prototypes must have ",

prod( param .args $dimension ), " rows and ", ncol(x.data), " columns \n", call.= TRUE ) } }

## Step 2: Preprocess the data

# Scaling

norm .x.data <- switch( parameters$scaling ,

" unitvar "=scale(x.data, center =TRUE , scale= TRUE ),

" center "=scale(x.data, center =TRUE , scale= FALSE ),

" none "=as.matrix(x.data),

" chi2 "= korrespPreprocess (x.data))

## Step 3: Initialize prototypes

prototypes <- initProto ( parameters , norm .x.data, x.data)

# Step 4: Iitialize backup if needed

if( parameters$nb.save>1) {

backup <- list()

backup$prototypes <- list()

backup$clustering <- matrix(ncol= parameters$nb.save, nrow=nrow( norm .x.data)) backup$energy <- vector(length= parameters$nb.save)

backup$steps <- round(seq(1, parameters$maxit ,length= parameters$nb.save) ,0) }

## Main Loop: from 1 to parameters

maxit

for ( ind .t in 1: parameters$maxit ) {

## Step 5: Randomly choose an observation

rand . ind <- selectObs ( ind .t, dim(x.data), parameters$type ) sel . obs <- norm .x.data[ rand .ind ,]

## Step 6: Assignment step

# For the "korresp" type, cut the prototypes and selected observation

if ( parameters$type ==" korresp ") {

if ( ind .t%%2==0) {

cur. obs <- sel . obs [1:ncol(x.data)]

cur. prototypes <- prototypes [ ,1:ncol(x.data)]

} else {

cur. obs <- sel . obs [(ncol(x.data) +1) :ncol( norm .x.data)]

cur. prototypes <- prototypes [,(ncol(x.data) +1) :ncol( norm .x.data)]

} }else {

cur. prototypes <- prototypes cur. obs <- sel . obs

}

# Assign

winner <- oneObsAffectation (cur.obs , cur. prototypes , parameters$type , norm .x.data)

## Step 7: Representation step

# Radius value

radius <- calculateRadius ( parameters$the .grid, parameters$radius .type , ind .t, parameters$maxit )

the . nei <- selectNei ( winner , parameters$the .grid, radius )

# TODO: scale epsilon with a parameter???

epsilon <- 0.3/(1+0.2*ind .t/ prod( parameters$the .grid $ dim))

# Update

prototypes [ the .nei ,] <- prototypeUpdate ( parameters$type , the .nei , epsilon , prototypes , rand .ind , sel . obs )

## Step 8: Intermediate backups (if needed)

if ( parameters$nb.save==1) {

warning("nb. save can not be 1\n No intermediate backups saved ", immediate .= TRUE , call.= TRUE )

}if ( parameters$nb.save>1) { if( ind .t %in% backup$steps ) {

out . proto <- switch( parameters$scaling ,

" unitvar "=scale( prototypes ,

center =-apply(x.data,2,mean)/ apply(x.data,2,sd),

scale=1/ apply(x.data,2,sd)),

" center "=scale( prototypes ,

center =-apply(x.data,2,mean), scale= FALSE ),

" none "= prototypes ,

" chi2 "= prototypes ) colnames( out . proto ) <- colnames( norm .x.data)

rownames( out . proto ) <- 1:prod( parameters$the .grid $ dim) res <- list(" parameters "= parameters , " prototypes "= out .proto ,

" data "=x.data) class( res ) <- " somRes "

ind .s <- match( ind .t, backup$steps ) backup$prototypes [[ ind .s]] <- out . proto

backup$clustering [, ind .s] <- predict. somRes (res , x.data) backup$energy [ ind .s] <- calculateEnergy ( norm .x.data,

backup$clustering [, ind .s], prototypes , parameters , ind .t) }if ( ind .t== parameters$maxit ) {

clustering <- backup$clustering [, ind .s]

if ( parameters$type ==" korresp ") {

names( clustering ) <- c(colnames(x.data), rownames(x.data)) } else names( clustering ) <- rownames(x.data)

energy <- backup$energy [ ind .s]

} }else if ( ind .t== parameters$maxit ) { out . proto <- switch( parameters$scaling ,

" unitvar "=scale( prototypes ,

center =-apply(x.data,2,mean)/

apply(x.data,2,sd),

scale=1/ apply(x.data,2,sd)),

" center "=scale( prototypes ,

center =-apply(x.data,2,mean), scale= FALSE ),

" none "= prototypes ,

" chi2 "= prototypes )

res <- list(" parameters "= parameters , " prototypes "= out .proto ,

" data "=x.data) class( res ) <- " somRes "

clustering <- predict. somRes (res , x.data) if ( parameters$type ==" korresp ") {

names( clustering ) <- c(colnames(x.data), rownames(x.data)) } else names( clustering ) <- rownames(x.data)

energy <- calculateEnergy ( norm .x.data, clustering , prototypes , parameters , ind .t)

} }

colnames( out . proto ) <- colnames( norm .x.data)

rownames( out . proto ) <- 1:prod( parameters$the .grid $ dim) if ( parameters$nb.save<=1) {

res <- list(" clustering "= clustering , " prototypes "= out .proto ,

" energy "= energy , " data "=x.data, " parameters "= parameters ) } else {

if ( parameters$type ==" korresp ") {

rownames( backup$clustering ) <- c(colnames(x.data), rownames(x.data)) } else rownames( backup$clustering ) <- rownames(x.data)

res <- list(" clustering "= clustering , " prototypes "= out .proto ,

" energy "= energy , " backup "= backup , " data "=x.data,

" parameters "= parameters ) }class( res ) <- " somRes "

return( res ) }

## S3 methods for somRes class objects

################################################################################

print. somRes <- function(x, ...) {

cat(" Self - Organizing Map object ...\ n")

cat(" ", x$parameters$mode, " learning , type :", x$parameters$type ,"\n") cat(" ", x$parameters$the .grid $ dim[1] ,"x",

x$parameters$the .grid $ dim[2] ,

" grid with ",x$parameters$the .grid $topo, " topology \n") }

summary. somRes <- function( object , ...) { cat("\ nSummary \n\n")

cat(" Class : ", class( object ),"\n\n") print( object )

cat("\n Final energy :", object$energy ,"\n") if ( object$parameters$type ==" numeric ") {

cat("\n ANOVA : \n")

res .anova <- as.data.frame(t(sapply(1:ncol( object$ data), function( ind ) { c(round(summary(aov( object$ data[, ind ]~as.factor( object$clustering )))

[[1]][1 ,4] , digits =3) ,

round(summary(aov( object$ data[, ind ]~as.factor( object$clustering ))) [[1]][1 ,5] , digits =8) )

})))names( res .anova) <- c("F", " pvalue ")

res .anova $significativity <- rep("",ncol( object$ data)) res .anova $significativity [ res .anova $" pvalue " <0.05] <- "*"

res .anova $significativity [ res .anova $" pvalue " <0.01] <- "**"

res .anova $significativity [ res .anova $" pvalue " <0.001] <- "***"

rownames( res .anova) <- colnames( object$ data) cat("\n Degrees of freedom : ",

summary(aov( object$ data[ ,1]~as.factor( object$clustering ))) [[1]][1 ,1] ,

"\n\n")

if ( chisq . res$p.value <0.001) sig <- "***"

cat("\n ", chisq . res$method , ":\n\n")

cat(" X- squared : ", chisq . res$statistic , "\n") cat(" Degrees of freedom : ", chisq . res$parameter , "\n") cat(" p- value : ", chisq . res$p.value , "\n") cat(" significativity : ", sig , "\n")

} }

predict. somRes <- function( object , x.new, ...) {

if (is.null(dim(x.new))) x.new <- matrix(x.new,nrow=1, dimnames=list(1,

colnames( object$ data))) if( object$parameters$type!=" korresp ") {

norm .x.new <- switch( object$parameters$scaling ,

" unitvar "=scale(x.new,

center =apply( object$data,2,mean), scale=apply( object$data,2,sd)),

" center "=scale(x.new,

center =apply( object$data,2,mean), scale= FALSE ),

" none "=as.matrix(x.new)) norm . proto <- switch( object$parameters$scaling ,

" unitvar "=scale( object$prototypes ,

center =apply( object$data,2,mean), scale=apply( object$data,2,sd)),

" center "=scale( object$prototypes ,

center =apply( object$data,2,mean), scale= FALSE ),

" none "= object$prototypes ) winners <- apply( norm .x.new, 1, oneObsAffectation ,

prototypes = norm .proto , type = object$parameters$type ,

x.data= object$ data) } else {

if (!identical (as.matrix(x.new), object$ data))

warning(" For 'korresp ' SOM , predict . somRes function can only be called on the original data set \n' object ' replaced ",

call.= TRUE )

norm .x.data <- korrespPreprocess ( object$ data)

winners . rows <- apply( norm .x.data[1:nrow( object$ data) ,1:ncol( object$ data)], 1, oneObsAffectation ,

prototypes = object$prototypes [ ,1:ncol( object$ data)], type = object$parameters$type )

winners . cols <- apply( norm .x.data[(nrow( object$ data) +1) :ncol( norm .x.data), (ncol( object$ data) +1) :ncol( norm .x.data)], 1, oneObsAffectation ,

prototypes = object$prototypes [,(ncol( object$ data) +1) : ncol( norm .x.data)], type = object$parameters$type )

winners <- c( winners .cols , winners . rows ) }winners

}

protoDist . somRes <- function( object , mode=c(" complete "," neighbors "), ...) { mode <- match. arg (mode)

complete <- (mode==" complete ")

if ( object$parameters$type ==" relational ") { x.data <- object$ data

} else x.data <- NULL

distances <- calculateProtoDist ( object$prototypes , object$parameters$the .grid, object$parameters$type , complete, x.data) return( distances )

}

protoDist <- function( object , mode,...) { UseMethod(" protoDist ")

}

Dans le document Création d'un package R pour des cartes auto-organisatrices (Page 32-49)