Analyzing the Data with K-Means - Thoughtful Machine Learning

Like we did with the K-Nearest Neighbors algorithm, we need to figure out an opti‐

mal K. Unfortunately, with clustering there really isn’t much we can test except to simply see whether we can split the data into two different clusters.

But let’s say that we want to fit all of our records on a bookshelf and we have 25 slots.

We could run a clustering of all of our data using K = 25.

Doing so requires little code because we have the ai4r gem to rely on:

# lib/kmeans_clusterer.rb require 'csv'

require 'ai4r'

data = []

artists = []

CSV.foreach('./annotated_jazz_albums.csv', :headers => true) do |row|

@headers ||= row.headers[2..-1]

artists << row['artist_album']

data << row.to_h.values[2..-1].map(&:to_i) end

ds = Ai4r::Data::DataSet.new(:data_items => data, :data_labels => @headers) clusterer = Ai4r::Clusterers::KMeans.new

clusterer.build(ds, 25)

CSV.open('./clustered_kmeans.csv', 'wb') do |csv|

csv << %w[artist_album year cluster]

ds.data_items.each_with_index do |dd, i|

csv << [artists[i], dd.first, clusterer.eval(dd)]

end end

That’s it! Of course, clustering without looking at what it actually tells us is useless.

This code does split the data into 25 different categories, but what does it all mean?

Looking at the graphic in Figure 8-3, which compares year versus assigned cluster number yields interesting results.

Figure 8-3. K-Means applied to jazz albums

As you can see, jazz starts out in the big band era, pretty much in the same cluster, after which it transitions into cool jazz. Then, around 1959, it starts to go in all differ‐

ent directions until about 1990 when things cool down a bit.

What’s fascinating is how well the clustering syncs up with jazz history.

What happens when we cluster the data using EM clustering instead?

EM Clustering

With EM clustering, remember that we are probabilistically assigning to different clusters: it isn’t 100% one or the other. This could be highly useful for our purposes here, as jazz has so much crossover.

There are no Ruby gems that have EM clustering in them, so we’ll have to write our own version of the tool.

Let’s go through the process of building our own gem and then utilize it to map the same data that we have from our jazz data set.

Our first step is to initialize the cluster. Remember, we need to have indicator vari‐

ables zt, which follows a uniform distribution. These tell us the probability that each data point is in each cluster. To do this, we have:

# lib/em_clusterer.rb require 'matrix' class EMClusterer

attr_reader :partitions, :data, :labels, :classes def initialize(k, data)

pick_k_random_indices = @data.row_size.times.to_a.shuffle.sample(@k) @classes = @k.times.map do |cc|

{

:means => @data.row(pick_k_random_indices.shift), :covariance => @s * Matrix.identity(@width)

At this point, we have set up all of our base case code. We have @k, which is the num‐

ber of clusters; @data, the data we pass in that we want to cluster; @labels, an array of the probability that the row is in each cluster; @classes, which holds on to an array of

means and covariances that tells us where the distribution of data is. And finally, there’s @partitions, which is the assignments of each data row to cluster index.

Now we need to build our expectation step, which is to figure out the probability of each data row in each cluster. To do this, we need to write a new method, ^expect:

# lib/em_clusterer.rb class EMClusterer # initialize # setup_cluster!

def expect

@classes.each_with_index do |klass, i|

puts "Expectation for class #{i}"

inv_cov = if klass[:covariance].regular?

klass[:covariance].inv else

puts "Applying shrinkage"

(klass[:covariance] - (0.0001 * Matrix.identity(@width))).inv end

d = Math::sqrt(klass[:covariance].det) @data.row_vectors.each_with_index do |row, j|

rel = row - klass[:means]

p = d * Math::exp(-0.5 * fast_product(rel, inv_cov)) @labels[j][i] = p

end end

@labels = @labels.map.each_with_index do |probabilities, i|

sum = probabilities.inject(&:+)

@partitions[i] = probabilities.index(probabilities.max) if sum.zero?

inv_cov.column_count.times do |j|

local_sum = 0

(0 ... rel.size).each do |k|

local_sum += rel[k] * inv_cov[k, j]

The first part iterates through all classes, which holds on to the means and covarian‐

ces of each cluster. From there, we want to find the inverse covariance matrix as well as the determinant of the covariance. For each row, we calculate a value that is pro‐

portional to the probability that the row is in a cluster:

p_ij=det C e^{− 12} ^xj⁻^{μi C}

−1xj−μi

This is effectively a Gaussian distance metric to help us determine how far outside the mean our data is.

Let’s say that the row vector is exactly the mean. That would mean that this would reduce to pij = det(C), which is just the determinant of the covariance matrix. This is actually the highest value you can get out of this function. If, for instance, the row vector was far away from the mean vector, then pij would become smaller and smaller due to the exponentiation and negative fraction in the front.

The nice thing is that this is proportional to the Gaussian probability that the row vector is in the mean. Because this is proportional (not equal), we end up normaliz‐

ing to sum to 1.

You’ll notice one last thing here: the introduction of the fast_product method. This is because the Matrix library in Ruby is slow and builds ^Array within ^Array, which is memory inefficient. In this case, things won’t change, so we optimized things for that.

Now we can move on to the maximization step:

# lib/em_clusterer.rb

@classes.each_with_index do |klass, i|

puts "Maximizing for class #{i}"

sum = Array.new(@width) { 0 } num = 0

@data.each_with_index do |row, j|

covariance = Matrix.zero(@width, @width) @data.row_vectors.each_with_index do |row, j|

p = @labels[j][i]

@classes[i][:covariance] = covariance end

end end

Again, here we are iterating over the clusters called ^@classes. We first make an array called ^sum, which holds on to the weighted sum of the data happening. From there, we normalize to build a weighted mean. To calculate the covariance matrix, we start with a zero matrix that is square and the width of our clusters. We then iterate through all ^row_vectors and incrementally add on the weighted difference of the row and the mean for each combination of the matrix. Again, at this point, we normalize and store.

Now we can get down to actually using this. To do that, we add two convenience methods that help in querying the data:

# lib/em_clusterer.rb

def cluster(iterations = 5)

iterations.times do |i|

Back to our results using EM clustering with our jazz music. To actually perform the analysis, we run the following script:

data = []

artists = []

CSV.foreach('./annotated_jazz_albums.csv', :headers => true) do |row|

@headers ||= row.headers[2..-1]

The first thing you’ll notice about EM clustering is that it’s slow. It’s not as quick as calculating new centroids and iterating. It has to calculate covariances and means, which are inefficient. Occam’s Razor would tell us here that most likely EM clustering is not a good use for grouping big amounts of data.

The other thing you’ll notice is that our annotated jazz music will not work; this is because the covariance matrix is singular. This is not a good thing. Realistically, this problem is ill suited for EM clustering for this reason, so we have to transform it into a different problem altogether.

We do that by collapsing the dimensions into the top two genres by index:

require 'csv'

CSV.open('./less_covariance_jazz_albums.csv', 'wb') do |csv|

csv << %w[artist_album key_index year].concat(2.times.map {|a| "Genre_#{a}" }) CSV.foreach('./annotated_jazz_albums.csv', :headers => true) do |row|

genre_count = 0

Dans le document Thoughtful Machine Learning (Page 177-183)