• Aucun résultat trouvé

The Role of Dimensionality Reduction in Clustering

3.3.4. Clustering Algorithms

3.3.4.1. Greedy clustering

Given that we have insight suggesting that overlap in titles is important, let’s try to cluster job titles by comparing them to one another as an extension of Example 3-7 using Jaccard distance. Example 3-12 clusters similar titles and then displays your contacts accordingly. Skim the code—especially the nested loop invoking the DISTANCE function—and then we’ll discuss.

Example 3-12. Clustering job titles using a greedy heuristic

import os import csv

from nltk.metrics.distance import jaccard_distance

# XXX: Place your "Outlook CSV" formatted file of connections from

# http://www.linkedin.com/people/export-settings at the following

# location: resources/ch03-linkedin/my_connections.csv

CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv')

# Tweak this distance threshold and try different distance calculations

# during experimentation

3.3. Crash Course on Clustering Data | 115

('VP', 'Vice President'), ]

separators = ['/', 'and', '&']

csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='"') contacts = [row for row in csvReader]

# Normalize and/or replace known abbreviations # and build up a list of common titles.

all_titles = list(set(all_titles)) clusters = {}

clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]

# Round up contacts who are in these clusters and group them together clustered_contacts = {}

for cluster in clusters:

clustered_contacts[tuple(cluster)] = []

for contact in contacts:

common_titles_heading = 'Common Titles: ' + ', '.join(titles) descriptive_terms = set(titles[0].split())

for title in titles:

descriptive_terms.intersection_update(set(title.split())) descriptive_terms_heading = 'Descriptive Terms: ' \

+ ', '.join(descriptive_terms) print descriptive_terms_heading

print '-' * max(len(descriptive_terms_heading), len(common_titles_heading)) print '\n'.join(clustered_contacts[titles])

print

The code listing starts by separating out combined titles using a list of common con‐

junctions and then normalizes common titles. Then, a nested loop iterates over all of the titles and clusters them together according to a thresholded Jaccard similarity metric as defined by DISTANCE, where the assignment of jaccard_distance to DISTANCE was chosen to make it easy to swap in a different distance calculation for experimentation.

This tight loop is where most of the real action happens in the listing: it’s where each title is compared to each other title.

If the distance between any two titles as determined by a similarity heuristic is “close enough,” we greedily group them together. In this context, being “greedy” means that the first time we are able to determine that an item might fit in a cluster, we go ahead and assign it without further considering whether there might be a better fit, or making any attempt to account for such a better fit if one appears later. Although incredibly pragmatic, this approach produces very reasonable results. Clearly, the choice of an effective similarity heuristic is critical to its success, but given the nature of the nested loop, the fewer times we have to invoke the scoring function, the faster the code executes (a principal concern for nontrivial sets of data). More will be said about this consider‐

ation in the next section, but do note that we use some conditional logic to try to avoid repeating unnecessary calculations if possible.

The rest of the listing just looks up contacts with a particular job title and groups them for display, but there is one other nuance involved in computing clusters: you often need to assign each cluster a meaningful label. The working implementation computes labels by taking the setwise intersection of terms in the job titles for each cluster, which seems 3.3. Crash Course on Clustering Data | 117

reasonable given that it’s the most obvious common thread. Your mileage is sure to vary with other approaches.

The types of results you might expect from this code are useful in that they group together individuals who are likely to share common responsibilities in their job duties.

As previously noted, this information might be useful for a variety of reasons, whether you’re planning an event that includes a “CEO Panel,” trying to figure out who can best help you to make your next career move, or trying to determine whether you are really well enough connected to other similar professionals given your own job responsibilities and future aspirations. Abridged results for a sample professional network follow:

Common Titles: Chief Technology Officer,

This section contains a relatively advanced discussion about the com‐

putational details of clustering and should be considered optional reading, as it may not appeal to everyone. If this is your first reading of this chapter, feel free to skip this section and peruse it upon en‐

countering it a second time.

In the worst case, the nested loop executing the DISTANCE calculation from Example 3-12 would require it to be invoked in what we’ve already mentioned is O(n2) time complexity

—in other words, len(all_titles)*len(all_titles) times. A nested loop that com‐

pares every single item to every single other item for clustering purposes is not a scalable approach for a very large value of n, but given that the unique number of titles for your professional network is not likely to be very large, it shouldn’t impose a performance constraint. It may not seem like a big deal—after all, it’s just a nested loop—but the crux of an O(n2) algorithm is that the number of comparisons required to process an input set increases exponentially in proportion to the number of items in the set. For example, a small input set of 100 job titles would require only 10,000 scoring operations, while 10,000 job titles would require 100,000,000 scoring operations. The math doesn’t work out so well and eventually buckles, even when you have a lot of hardware to throw at it.

Your initial reaction when faced with what seems like a predicament that doesn’t scale will probably be to try to reduce the value of n as much as possible. But most of the time you won’t be able to reduce it enough to make your solution scalable as the size of your input grows, because you still have an O(n2) algorithm. What you really want to do is come up with an algorithm that’s on the order of O(k*n), where k is much smaller than n and represents a manageable amount of overhead that grows much more slowly than the rate of n’s growth. As with any other engineering decision, there are performance and quality trade-offs to be made in all corners of the real world, and it can be quite challenging to strike the right balance. In fact, many data mining companies that have successfully implemented scalable record-matching analytics at a high degree of fidelity consider their specific approaches to be proprietary information (trade secrets), since they result in definite business advantages.

For situations in which an O(n2) algorithm is simply unacceptable, one variation to the working example that you might try is rewriting the nested loops so that a random sample is selected for the scoring function, which would effectively reduce it to O(k*n), if k were the sample size. As the value of the sample size approaches n, however, you’d expect the runtime to begin approaching the O(n2) runtime. The following amendments to Example 3-12 show how that sampling technique might look in code; the key changes to the previous listing are highlighted in bold. The core takeaway is that for each invo‐

cation of the outer loop, we’re executing the inner loop a much smaller, fixed number of times:

for sample in range(SAMPLE_SIZE):

title2 = all_titles[random.randint(0, len(all_titles)-1)]

if title2 in clusters[title1] or clusters.has_key(title2) and title1 \ in clusters[title2]:

3.3. Crash Course on Clustering Data | 119

continue

distance = DISTANCE(set(title1.split()), set(title2.split())) if distance < DISTANCE_THRESHOLD:

clusters[title1].append(title2)

# ... snip ...

Another approach you might consider is to randomly sample the data into n bins (where n is some number that’s generally less than or equal to the square root of the number of items in your set), perform clustering within each of those individual bins, and then optionally merge the output. For example, if you had 1 million items, an O(n2) algorithm would take a trillion logical operations, whereas binning the 1 million items into 1,000 bins containing 1,000 items each and clustering each individual bin would require only a billion operations. (That’s 1,000*1,000 comparisons for each bin for all 1,000 bins.) A billion is still a large number, but it’s three orders of magnitude smaller than a trillion, and that’s a substantial improvement (although it still may not be enough in some situations).

There are many other approaches in the literature besides sampling or binning that could be far better at reducing the dimensionality of a problem. For example, you’d ideally compare every item in a set, and at the end of the day, the particular technique you’ll end up using to avoid an O(n2) situation for a large value of n will vary based upon real-world constraints and insights you’re likely to gain through experimentation and domain-specific knowledge. As you consider the possibilities, keep in mind that the field of machine learning offers many techniques that are designed to combat exactly these types of scale problems by using various sorts of probabilistic models and sophis‐

ticated sampling techniques. In Section 3.3.4.3 on page 124, you’ll be introduced to a fairly intuitive and well-known clustering algorithm called k-means, which is a general-purpose unsupervised approach for clustering a multidimensional space. We’ll be using this technique later to cluster your contacts by geographic location.