• Aucun résultat trouvé

The Role of Dimensionality Reduction in Clustering

3.3.4. Clustering Algorithms

3.3.4.4. Visualizing geographic clusters with Google Earth

A worthwhile exercise to see k-means in action is to use it to visualize and cluster your professional LinkedIn network by plotting it in two-dimensional space. In addition to the insight gained by visualizing how your contacts are spread out and noting any pat‐

terns or anomalies, you can analyze clusters by using your contacts, the distinct em‐

ployers of your contacts, or the distinct metro areas in which your contacts reside as a basis. All three approaches might yield results that are useful for different purposes.

Recalling that through the LinkedIn API you can fetch location information that de‐

scribes the major metropolitan area, such as “Greater Nashville Area,” we’ll be able to geocode the locations into coordinates and emit them in an appropriate format (such as KML) that we can plot in a tool like Google Earth, which provides an interactive user experience.

Google’s new Maps Engine also provides various means of uploading data for visualization purposes.

The primary things that you must do in order to convert your LinkedIn contacts to a format such as KML include parsing out the geographic location from each of your connections’ profiles and constructing the KML for a visualization such as Google Earth.

Example 3-9 demonstrated how to geocode profile information and provides a working foundation for gathering the data we’ll need. The KMeansClustering class of the cluster package can calculate clusters for us, so all that’s really left is to munge the data and clustering results into KML, which is a relatively rote exercise with XML tools.

As in Example 3-12, most of the work involved in getting to the point where the results can be visualized is data-processing boilerplate. The most interesting details are tucked away inside of KMeansClustering’s getclusters method call, toward the end of Example 3-13, which illustrates k-means clustering. The approach demonstrated groups your contacts by location, clusters them, and then uses the results of the clustering algorithm to compute the centroids. Figure 3-8 illustrates sample results from running the code in Example 3-13.

The linkedin__kml_utility that provides the createKML function in Example 3-13 is a rote exercise that is omitted for brevity here, but it’s included with the IPython Notebook for this chapter as a turnkey example.

3.3. Crash Course on Clustering Data | 127

Example 3-13. Clustering your LinkedIn professional network based upon the locations of your connections and emitting KML output for visualization with Google Earth

import os import sys import json

from urllib2 import HTTPError from geopy import geocoders

from cluster import KMeansClustering, centroid

# A helper function to munge data and build up an XML tree.

# It references some code tucked away in another directory, so we have to

# add that directory to the PYTHONPATH for it to be picked up.

sys.path.append(os.path.join(os.getcwd(), "resources", "ch03-linkedin")) from linkedin__kml_utility import createKML

# XXX: Try different values for K to see the difference in clusters that emerge K = 3

# XXX: Get an API key and pass it in here. See https://www.bingmapsportal.com.

GEO_API_KEY = ''

g = geocoders.Bing(GEO_API_KEY)

# Load this data from where you've previously stored it

CONNECTIONS_DATA = 'resources/ch03-linkedin/linkedin_connections.json' OUT_FILE = "resources/ch03-linkedin/viz/linkedin_clusters_kmeans.kml"

# Open up your saved connections with extended profile information

# or fetch them again from LinkedIn if you prefer

connections = json.loads(open(CONNECTIONS_DATA).read())['values']

locations = [c['location']['name'] for c in connections if c.has_key('location')]

# Some basic transforms may be necessary for geocoding services to function properly

# Here are a couple that seem to help.

transforms = [('Greater ', ''), (' Area', '')]

# Step 1 - Tally the frequency of each location coords_freqs = {}

for location in locations:

if not c.has_key('location'): continue

# Avoid unnecessary I/O and geo requests by building up a cache if coords_freqs.has_key(location):

coords_freqs[location][1] += 1 continue

transformed_location = location for transform in transforms:

transformed_location = transformed_location.replace(*transform)

# Here, you could optionally segment locations by continent or country

# so as to avoid potentially finding a mean in the middle of the ocean.

# The k-means algorithm will expect distinct points for each contact, so

# build out an expanded list to pass it.

expanded_coords = []

for label in coords_freqs:

# Flip lat/lon for Google Earth ((lat, lon), f) = coords_freqs[label]

expanded_coords.append((label, [(lon, lat)] * f))

# No need to clutter the map with unnecessary placemarks...

kml_items = [{'label': label, 'coords': '%s,%s' % coords[0]} for (label, coords) in expanded_coords]

# It would also be helpful to include names of your contacts on the map for item in kml_items:

item['contacts'] = '\n'.join(['%s %s.' % (c['firstName'], c['lastName']) for c in connections if c.has_key('location') and

c['location']['name'] == item['label']])

# Step 3 - Cluster locations and extend the KML data structure with centroids

3.3. Crash Course on Clustering Data | 129

cl = KMeansClustering([coords for (label, coords_list) in expanded_coords for coords in coords_list])

centroids = [{'label': 'CENTROID', 'coords': '%s,%s' % centroid(c)} for c in cl.getclusters(K)]

kml_items.extend(centroids)

# Step 4 - Create the final KML output and write it to a file kml = createKML(kml_items)

f = open(OUT_FILE, 'w') f.write(kml)

f.close()

print 'Data written to ' + OUT

Just visualizing your network can provide previously unknown insight, but computing the geographic centroids of your professional network can also open up some intriguing possibilities. For example, you might want to compute candidate locations for a series of regional workshops or conferences. Alternatively, if you’re in the consulting business and have a hectic travel schedule, you might want to plot out some good locations for renting a little home away from home. Or maybe you want to map out professionals in your network according to their job duties, or the socioeconomic bracket they’re likely to fit into based on their job titles and experience. Beyond the numerous options opened up by visualizing your professional network’s location data, geographic clustering lends itself to many other possibilities, such as supply chain management and travelling sales‐

man types of problems in which it is necessary to minimize the expenses involved in travelling or moving goods from point to point.

Figure 3-8. Clockwise from top-left: 1) clustering contacts by location so that you can easily see who lives/works in what city; 2) finding the centroids of three clusters compu‐

ted by k-means; 3) don’t forget that clusters could span countries or even continents when trying to find an ideal meeting location!

3.4. Closing Remarks

This chapter covered some serious ground, introducing the fundamental concept of clustering and demonstrating a variety of ways to apply it to your professional network data on LinkedIn. This chapter was without a doubt more advanced than the preceding chapters in terms of core content, in that it began to address common problems such as normalization of (somewhat) messy data, similarity computation on normalized data, and concerns related to the computational efficiency of approaches for a common data mining technique. Although it might be difficult to absorb all of the material in a single reading, don’t be discouraged if you feel a bit overwhelmed. It may take a couple of readings to fully absorb the details introduced in this chapter.

Also keep in mind that a working knowledge of how to employ clustering doesn’t nec‐

essarily require an advanced understanding of the theory behind it, although in general 3.4. Closing Remarks | 131

you should strive to understand the fundamentals that undergird the techniques you employ when mining the social web. As in the other chapters, you could easily make the case that we’ve just barely touched the tip of the iceberg; there are many other interesting things that you can do with your LinkedIn data that were not introduced in this chapter, many of which involve basic frequency analysis and do not require clus‐

tering at all. That said, you do have a pretty nice power tool in your belt now.

The source code outlined for this chapter and all other chapters is available at GitHub in a convenient IPython Notebook format that you’re highly encouraged to try out from the comfort of your own web browser.

3.5. Recommended Exercises

• Take some time to explore the extended profile information that you have available.

It could be fun to try to correlate where people work versus where they went to school and/or to analyze whether people tend to relocate into and out of certain areas.

• Try employing an alternative visualization from D3, such as a choropleth map, to visualize your professional network.

• Read up on new and exciting GeoJSON specification and how you can easily create interactive visualizations at GitHub by generating GeoJSON data. Try to apply this technique to your professional network as an alternative to using Google Earth.

• Visualize your LinkedIn network with the LinkedIn Labs InMaps. It constructs a graphical representation of your network using some additional information that isn’t directly available to you through the API and produces a compelling visuali‐

zation.

• Take a look at geodict and some of the other geo utilities in the Data Science Toolkit. Can you extract locations from arbitrary prose and visualize them in a meaningful way to gain insight into what’s happening in the data without having to read through all of it?

• Mine Twitter or Facebook profiles for geo information and visualize it in a mean‐

ingful way. Tweets and Facebook posts often contain geocodes as part of their structured metadata.

• The LinkedIn API provides a means of retrieving a connection’s Twitter handle.

How many of your LinkedIn connections have Twitter accounts associated with their professional profiles? How active are their accounts? How professional are their online Twitter personalities from the perspective of a potential employer?

• Apply clustering techniques from this chapter to tweets. Given a user’s tweets, can you extract meaningful tweet entities, define a meaningful similarity computation, and cluster tweets in a meaningful way?

• Apply clustering techniques from this chapter to Facebook data such as likes or posts. Given a collection of Facebook likes for a friend, can you define a meaningful similarity computation, and cluster the likes in a meaningful way? Given all of the likes for all of your friends, can you cluster the likes (or your friends) in a meaningful way?

3.6. Online Resources

The following list of links from this chapter may be useful for review:

• Bing maps portal

• Centroid

• D3.js examples gallery

• Data Science Toolkit

• Dendogram

• Export LinkedIn Connections

• geopy GitHub code repository

• Keyhole Markup Language (KML)

• Levenshtein distance

• LinkedIn API rate-throttling limits

• LinkedIn API field selector syntax

• LinkedIn Developers

• LinkedIn InMaps

• Mapping GeoJSON files on GitHub

• python-linkedin GitHub code repository

• Travelling salesman problem

• Tutorial on Clustering Algorithms

3.6. Online Resources | 133

1. This book avoids splitting hairs over exactly what differences could be implied by common phrases such as text mining, unstructured data analytics (UDA), or information retrieval, and simply treats them as essentially the same thing.

CHAPTER 4

Mining Google+: Computing Document