Procedure followed for cluster analyses - Developmental trajectories of cortical thickness in r

All procedure followed for cluster analyses were conducted in R.

6.1 Univariate Analyses – Within dimensions:

We performed univariate analyses for each dimension (positive, negative, disorganized) separately. For each dimension: rows represented participants (n=109) and the unique column represented the raw scores per participants on the dimension studied. There were no missing values and data was standardized.

For the univariate analyses we used optimal K-means clustering in One-dimension by dynamic programming (the package: Ckmeans.1d.dp) implemented in R. This function performs optimal k-means clustering on one-dimensional data.

In contrast to the heuristic k-means algorithms implemented for 3D data, this function assigns element in numeric vector into a number of clusters by dynamic programming¹⁰. It minimizes the sum of squares of within-cluster distances from each element to its corresponding cluster centre. When a range is provided for k (we gave a range from 1 to 10 clusters), the exact number of clusters is determined by Bayesian information criterion. A resulting efficient number of 2 clusters per dimensions was returned for each analysis.

As a validation of the clustering solutions we computed elbow and calinski-harabaz methods and displayed their visualisation.

The Elbow method looks at the percentage of variance explained when we increase the number of clusters, one should choose a number of clusters so that adding another cluster does not improve much better the variance explained. Visually, the location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

The Calinski-Harabasz index is based on a ratio of between cluster scatter matrix and within cluster scatter matrix. A maximum value for CH indicates a suitable partition for the data set.

Below are presented the estimation of number of clusters as determined by BIC using the Ckmeans.1d.1dp function, the repartition into clusters for each dimension as well as the validity measures obtained.

For each dimension we looked at both methods (elbow and C-H). The elbow method for each dimension suggest 2 clusters, as a third cluster forms no elbow and thus provides no significant added value. When looking at the Calinski-Harabasz index, we can see that the third cluster is only slightly higher. From this we conclude that for each dimension the 2 cluster-solution is optimal.

SPQ Disorganized dimension

Figure 4: Two clusters were obtained from applying the ckmeans.1d.dp function on x=

SPQ_disorganized scores sampled from a Gaussian mixture model estimated with two components. The horizontal axis represents the index number of each point in the data.

The dotted lines represent the centres of each cluster.

0 50 100 150

−10123

Optimal univariate clustering with k estimated

Number of clusters is estimated to be 2 Index

Cluster 1 Cluster 2

Figure 5: Elbow method for 2-solution cluster validation.

1 2 3 4 5

0.00.20.40.60.8

Nb. de groupes

% Explained variance

Figure 6: Calinski-Harabasz method for 2-solution cluster validation.

1 2 3 4 5

0100200300400500600

Nb. of groups

Calinski−Harabasz

SPQ Positive dimension

Figure 7: Two clusters were obtained from applying the ckmeans.1d.dp function on x=

SPQ_positive scores sampled from a Gaussian mixture model estimated with two components. The horizontal axis represents the index number of each point in the data.

The dotted lines represent the centres of each cluster.

0 50 100 150

−2−10123

Optimal univariate clustering with k estimated

Number of clusters is estimated to be 2 Index

Cluster 1 Cluster 2

Figure 8: Elbow method for 2-solution cluster validation.

1 2 3 4 5

0.00.20.40.60.8

Nb. de groupes

% Explained variance

Figure 9: Calinski-Haarabasz method for 2-solution cluster validation.

SPQ Negative dimension

1 2 3 4 5

0100200300400

Nb. of groups

Calinski−Harabasz

Figure 10: Two clusters were obtained from applying the ckmeans.1d.dp function on x=

SPQ_negative scores sampled from a Gaussian mixture model estimated with two components. The horizontal axis represents the index number of each point in the data.

The dotted lines represent the centres of each cluster.

0 50 100 150

−1012

Optimal univariate clustering with k estimated

Number of clusters is estimated to be 2 Index

Cluster 1 Cluster 2

Figure 11: Elbow method for 2-solution cluster validation.

1 2 3 4 5

0.00.20.40.60.8

Nb. de groupes

% Explained variance

Figure 12: Calinski-Haarabasz method for 2-solution cluster validation.

4.2. Mutlivariate Analyses – 3 dimensions:

We performed multivariate analyses for the three dimensions combined (positive, negative, disorganized). Rows represented participants (n=109) and the three columns

1 2 3 4 5

0100200300400500

Nb. of groups

Calinski−Harabasz

represented the raw scores per participants on the three dimensions. There were no missing values and data was standardized.

To determine the optimal number of clusters we computed different internal validation indices indicating the efficient number of clusters, including connectivity, silhouette and Dunn indices. Results provided for a clustering into 3 different groups. To do so we used functions from the clValid R package¹².

Figure 13: Internal validation indices provided for a 3-cluster solution.

The connectivity measure corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space. The value ranges between 0 and infinity and should be minimized to attest for efficient clustering. In this case the Connectivity measure seems to be in favour of a 2-clusters solution for the hierarchical method, although for k-mean clustering methods the value is equal for a 2 or 3-cluster solution.

The Dunn index is computed as the distance between each of the objects in the cluster and objects in the other clusters. The Dunn index should be maximised to represent a good cluster solution. For hierarchical and k-means methods, the Dunn index is maximised for the 3-cluster solution, which seems to be the most consistent compared to the 2-cluster solution.

The average silhouette (last graph) method determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. Again, we could hesitate between a 2 and 3-cluster solutions, but consistency between k-means and hierarchical methods is seen for the solution with 3-clusters.

Considering the aforementioned observations, a 3 clusters solution appeared to be the optimal solution when compared with the others.

Hierarchical clustering starts by treating each observation as a separate cluster, then it repeatedly executes two steps: identify the two clusters that are closer together and merge the two most similar clusters. Repetition of these two steps continue until all the cluster are merged together.

Kmeans clustering algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

PAM clustering (partition around medoids) is intended to find a sequence of medoids that are centrally located in clusters. The algorithm minimizes the average dissimilarity of objects to their closest selected object.

To further test for the stability of the chosen 3 clusters, we computed stability measures.

This algorithm compares results from clustering based on the full data to clustering based on removing each column on at a time. The included measures were:

- Average proportion of non-overlap (APN) = 0.1419422 - Average Distance (AD) = 1.7357395

- Average Distance Between Means (ADM) = 0.5967227 - Figure of Merit (FOM) = 0.8561013

The values for APN, ADM and FOM ranges from 0 to 1 with smaller values corresponding to highly consistent clustering results. AD has a value ranging from 0 to infinity and smaller values are also preferred.

APN measures average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based in the data with a single column removed.

AD measures the average distance between observations placed in the same cluster under both situation (full data and removal of 1 column).

ADM measures the average distance between clusters centres of observations placed in the same cluster under both situations.

FOM measures the average intra-cluster variance of the deleted column (clustering is based on the remaining columns).

After determining the stability of the three cluster solutions, we computed the cluster analysis to get the partitioning of participants into groups following scripts provided by the following website and we tried two different procedures:

https://www.statmethods.net/advstats/cluster.html. We used the two distinct procedures to check whether participants were partitioned in the same groups or clusters, which was the case.

We first partitioned the data using K-means clustering (kmeans function in R with 3 solutions).

Secondly, we used Ward hierarchical agglomerative clustering and showed a dendrogram with 3 solutions.

Figure 14: Dendrogram computed for three solution clusters using Ward hierarchical agglomerative method.

Figure 15: The clusplot function in R creates a bivariate plot visualizing a partition of the data using principal component scaling. All observations are represented by the participants’ IDs and ellipses are drawn around each cluster.

The clusplot algorithm used PCA to draw the data. It uses the first two principal components to explain the data. Principal component are the orthogonal axes that along them the data has the most variability¹³.

−2 −1 0 1 2 3 4

−2−1012

CLUSPLOT( mydata )

Component 1

Component 2

These two components explain 88.97 % of the point variability.

6106

Dans le document Developmental trajectories of cortical thickness in relation to schizotypy during adolescence (Page 27-41)