• Aucun résultat trouvé

Mining Google+: Computing Document Similarity, Extracting Collocations,

4.2. Exploring the Google+ API

4.2.1. Making Google+ API Requests

From a software development standpoint, Google+ leverages OAuth like the rest of the social web to enable an application that you’ll build to access data on a user’s behalf, so you’ll need to register an application to get appropriate credentials for accessing the Google+ platform. The Google API Console provides a means of registering an appli‐

cation (called a project in the Google API Console) to get OAuth credentials but also exposes an API key that you can use for “simple API access.” This API key is what we’ll use in this chapter to programmatically access the Google+ platform and just about every other Google service.

Once you’ve created an application, you’ll also need to specifically enable it to use Google+ as a separate step. Figure 4-1 provides a screenshot of the Google+ API Console as well as a view that demonstrates what it looks like when you have enabled your application for Google+ API access.

Figure 4-1. Registering an application with the Google API Console to gain API access to Google services; don’t forget to enable Google+ API access as one of the service options

4.2. Exploring the Google+ API | 139

You can install a Python package called google-api-python-client for accessing Goo‐

gle’s API via pip install google-api-python-client. This is one of the standard Python-based options for accessing Google+ data. The online documentation for google-api-python-client is marginally helpful in familiarizing yourself with the ca‐

pabilities of what it offers, but in general, you’ll just be plugging parameters from the official Google+ API documents into some predictable access patterns with the Python package. Once you’ve walked through a couple of exercises, it’s a relatively straightfor‐

ward process.

Don’t forget that pydoc can be helpful for gathering clues about a package, class, or method in a terminal as you are learning it. The help function in a standard Python interpreter is also useful. Recall that appending ? to a method name in IPython is a shortcut for display‐

ing its docstring.

As an initial exercise, let’s consider the problem of locating a person on Google+. Like any other social web API, the Google+ API offers a means of searching, and in particular we’ll be interested in the People: search API. Example 4-1 illustrates how to search for a person with the Google+ API. Since Tim O’Reilly is a well-known personality with an active and compelling Google+ account, let’s look him up.

The basic pattern that you’ll repeatedly use with the Python client is to create an instance of a service that’s parameterized for Google+ access with your API key that you can then instantiate for particular platform services. Here, we create a connection to the People API by invoking service.people() and then chaining on some additional API oper‐

ations deduced from reviewing the API documentation online. In a moment we’ll query for activity data, and you’ll see that the same basic pattern holds.

Example 4-1. Searching for a person with the Google+ API

import httplib2 import json

import apiclient.discovery # pip install google-api-python-client

# XXX: Enter any person's name Q = "Tim O'Reilly"

# XXX: Enter in your API key from https://code.google.com/apis/console API_KEY = ''

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), developerKey=API_KEY)

people_feed = service.people().search(query=Q).execute() print json.dumps(people_feed['items'], indent=1)

Following are sample results for searching for Tim O’Reilly:

The results do indeed return a list of people named Tim O’Reilly, but how can we tell which one of these results refers to the well-known Tim O’Reilly of technology fame that we are looking for? One option would be to request profile or activity information for each of these results and try to disambiguate them manually. Another option is to render the avatars included in each of the results, which is trivial to do by rendering the avatars as images within IPython Notebook. Example 4-2 illustrates how to display avatars and the corresponding ID values for each search result by generating HTML and rendering it inline as a result in the notebook.

Example 4-2. Displaying Google+ avatars in IPython Notebook provides a quick way to disambiguate the search results and discover the person you are looking for

from IPython.core.display import HTML

Sample results are displayed in Figure 4-2 and provide the “quick fix” that we’re looking for in our search for the particular Tim O’Reilly of O’Reilly Media.

Figure 4-2. Rendering Google+ avatars as images allows you to quickly scan the search results to disambiguate the person you are looking for

Although there’s a multiplicity of things we could do with the People API, our focus in this chapter is on an analysis of the textual content in accounts, so let’s turn our attention to the task of retrieving activities associated with this account. As you’re about to find out, Google+ activities are the linchpin of Google+ content, containing a variety of rich content associated with the account and providing logical pivots to other platform ob‐

jects such as comments. To get some activities, we’ll need to tweak the design pattern we applied for searching for people, as illustrated in Example 4-3.

Example 4-3. Fetching recent activities for a particular Google+ user

import httplib2 import json

import apiclient.discovery

USER_ID = '107033731246200681024' # Tim O'Reilly

# XXX: Re-enter your API_KEY from https://code.google.com/apis/console

maxResults='100' # Max allowed per API ).execute()

print json.dumps(activity_feed, indent=1)

Sample results for the first item in the results (activity_feed['items'][0]) follow and illustrate the basic nature of a Google+ activity:

{

"kind": "plus#activity", "provider": {

"title": "Google+"

},

"title": "This is the best piece about privacy that I've read in a ...", "url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ",

"content": "Many governments (including our own, here in the US) ...", "url": "http://www.zdziarski.com/blog/?p=2155",

"displayName": "On Expectation of Privacy | Jonathan Zdziarski's Domain", "objectType": "article"

} ],

"url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ", "content": "This is the best piece about privacy that I've read ...", "plusoners": {

"url": "https://plus.google.com/107033731246200681024",

Each activity object follows a three-tuple pattern of the form (actor, verb, object). In this post, the tuple (Tim O’Reilly, post, note) tells us that this particular item in the results is a note, which is essentially just a status update with some textual content. A closer look at the result reveals that the content is something that Tim O’Reilly feels strongly about as indicated by the title “This is the best piece about privacy that I’ve read in a long time!” and hints that the note is active as evidenced by the number of reshares and comments.

If you reviewed the output carefully, you may have noticed that the content field for the activity contains HTML markup, as evidenced by the HTML entity I've that appears. In general, you should assume that the textual data exposed as Google+ activ‐

ities contains some basic markup—such as <br /> tags and escaped HTML entities for apostrophes—so as a best practice we need to do a little bit of additional filtering to clean it up. Example 4-4 provides an example of how to distill plain text from the content field of a note by introducing a function called cleanHtml. It takes advantage of a clean_html function provided by NLTK and another handy package for manipulating HTML, called BeautifulSoup, that converts HTML entities back to plain text. If you haven’t already encountered BeautifulSoup, it’s a package that you won’t want to live without once you’ve added it to your toolbox—it has the ability to process HTML in a reasonable way even if it is invalid and violates standards or other reasonable expecta‐

tions (à la web data). You should install these packages via pip install nltk beauti fulsoup4 if you haven’t already.

Example 4-4. Cleaning HTML in Google+ content by stripping out HTML tags and converting HTML entities back to plain-text representations

from nltk import clean_html

convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

print activity_feed['items'][0]['object']['content']

print

print cleanHtml(activity_feed['items'][0]['object']['content'])

The output from the note’s content, once cleansed with cleanHtml, is very clean text that can be processed without additional concerns about noise in the content. As we’ll learn in this chapter and follow-on chapters about text mining, reduction of noise in text content is a critical aspect of improving accuracy. The before and after content follows.

Here’s the raw content in activity_feed['items'][0]['object']['content']:

This is the best piece about privacy that I&#39;ve read in a long time!

If it doesn&#39;t change how you think about the privacy issue, I&#39;ll be surprised. It opens:<br /><br />&quot;Many governments (including our own, here in the US) would have its citizens believe that privacy is a switch (that is, you either reasonably expect it, or you don’t). This has been demonstrated in many legal tests, and abused in many circumstances ranging from spying on electronic mail, to drones in our airspace monitoring the movements of private citizens. But privacy doesn’t work like a switch – at least it shouldn’t for a country that recognizes that privacy is an inherent right. In fact, privacy, like other components to security, works in layers...&quot;<br /><br />

Please read!

And here’s the content rendered after cleansing with cleanHtml(activity _feed['items'][0]['object']['content']):

This is the best piece about privacy that I've read in a long time! If it doesn't change how you think about the privacy issue, I'll be surprised. It opens: "Many governments (including our own, here in the US) would have its

citizens believe that privacy is a switch (that is, you either reasonably expect it, or you don’t). This has been demonstrated in many legal tests, and abused

in many circumstances ranging from spying on electronic mail, to drones in our airspace monitoring the movements of private citizens. But privacy doesn’t work like a switch – at least it shouldn’t for a country that recognizes that privacy is an inherent right. In fact, privacy, like other components to security, works in layers..." Please read!

4.2. Exploring the Google+ API | 145

The ability to query out clean text from Google+ is the basis for the remainder of the text mining exercises in this chapter, but one additional consideration that you may find useful before we focus our attention elsewhere is a pattern for fetching multiple pages of content.

Whereas the previous example fetched 100 activities, the maximum number of results for a query, it may be the case that you’ll want to iterate over an activities feed and retrieve more than the maximum number of activities per page. The pattern for pagination is outlined in the HTTP API Overview, and the Python client wrapper takes care of most of the hassle.

Example 4-5 shows how to fetch multiple pages of activities and distill the text from them if they are notes and have meaningful content.

Example 4-5. Looping over multiple pages of Google+ activities and distilling clean text from notes

import os import httplib2 import json

import apiclient.discovery

from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html

USER_ID = '107033731246200681024' # Tim O'Reilly

# XXX: Re-enter your API_KEY from https://code.google.com/apis/console

# if not currently set

# API_KEY = ''

MAX_RESULTS = 200 # Will require multiple requests def cleanHtml(html):

if html == "": return ""

return BeautifulStoneSoup(clean_html(html),

convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), developerKey=API_KEY)

activity_feed = service.activities().list(

userId=USER_ID, collection='public',

maxResults='100' # Max allowed per request )

activity_results = []

while activity_feed != None and len(activity_results) < MAX_RESULTS:

activities = activity_feed.execute() if 'items' in activities:

for activity in activities['items']:

if activity['object']['objectType'] == 'note' and \ activity['object']['content'] != '':

activity['title'] = cleanHtml(activity['title'])

activity['object']['content'] = cleanHtml(activity['object']['content']) activity_results += [activity]

# list_next requires the previous request and response objects

activity_feed = service.activities().list_next(activity_feed, activities)

# Write the output to a file for convenience

f = open(os.path.join('resources', 'ch04-googleplus', USER_ID + '.json'), 'w') f.write(json.dumps(activity_results, indent=1))

f.close()

print str(len(activity_results)), "activities written to", f.name

With the know-how to explore the Google+ API and fetch some interesting human language data from activities’ content, let’s now turn our attention to the problem of analyzing the content.