Visualizing Twitter Data

August 30, 2014

Goals: To use the Twitter Streaming API in Python to filter and extract tweets based on search criteria; to funnel these tweets into MongoDB and visualize them using Robomongo GUI; to export tweet content to a CSV and perform further data analysis using Pandas; to plot results using ipython notebook

Tools Used: python, ipython, python idle, twitter streaming api, tweepy, mongodb, robomongo, pandas, vincent, sublime text 3

Installations & Setup

This article assumes that you have installed an up-to-date version of Python. To check your installation, open a new terminal window and type:

python

If Python is already installed on your system, you'll see something like:

Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

If instead an error is thrown, head over to the Python Wiki and download the correct version for your system.

Now, let's make sure you can pull up Idle, the official Python editor. From the console type:

idle

This should bring up a shell in which you can execute Python commands.

We are also going to need pip, a tool used to install and manage Python packages. Visit their documentation pages and click the link to download get-pip.py. Save the .py file to your desktop.

Back at the terminal, change directories to your desktop using the cd command. For example, if you are currently at your main $USER directory (on a Mac), type:

cd Desktop

This will bring you inside your desktop (and to the same level as the get-pip.py file that we just downloaded.

Now, type:

python get-pip.py

This should initialize setup of pip; progress will be displayed in your terminal window.

There are 2 main errors that you might encounter:

  1. Unable to locate file "get-pip.py": This results from not running the python get-pip.py command from the correct directory. You must cd into the directory in which the downloaded file resides. If you save the file to desktop but try to execute from root, you will get an error.
  2. Insufficient permissions: If the currently logged-in user does not have sufficient read/write access during the python installation, you will get an error. Try appending sudo to the beginning of the command and running again (i.e. sudo python get-pip.py).

If ever you encounter unexpected behavior at the command prompt, try closing the terminal window and opening a fresh one.

Next, we'll need to install Tweepy, a beautiful package for Twitter data handling. Since we've already got pip, this terminal command will do the trick:

pip install tweepy

Like before, if you have trouble with insufficient permissions, prepend the command with sudo.

Do similar installs for Pandas and Vincent:

pip install pandas
pip install vincent

We'll also need Pymongo, the recommended package for dealing with MongoDB in Python. Again, install using:

pip install pymongo

Pymongo interfaces with MongoDB, which requires there to be a data directory at a specific location on your hard drive. On a Mac, this means that you will need to create a new folder in your root directory called "data" (at the same level as your Applications, Library, System and Users folders). Inside this folder create another folder called "db". For more information visit this page.

Fun fact: MongoDB got its name from the word "humongous". I found this amusing.

Next up: Robomongo, a handy graphical interface for MongoDB. Visit their official page and download the correct version.

Once the download is complete and the installer has finished, go ahead and open the Robomongo application. The initial launch screen will prompt you create, edit, remove, clone or reorder database connections; this is where we will tie into our existing MongoDB installation.

Click "Create" to bring up a Connection Settings window. In the Connection tab, enter a name for the new connection and make sure the address is "localhost" (port 27017). This is the MongoDB default location. Since authentication isn't needed for this project, just click Save to store the connection in the main launch list.

Twitter Application Permissions

In order to use Tweepy and the underlying Twitter Streaming API, we need to create a new application and link it through our python code using keys and secrets. Head to the Twitter Developer's portal and create a new account. Once logged in, nagivate to the Application Management page and click the button to create a new app. Give your app a name, a description and associate it with a website, if desired. Don't worry about entering a callback URL. Finally, accept the Rules of the Road and click Create.

Click the name of the app you just created and then navigate to the API keys tab. Under the "Application Settings" section, you should see an API Key and an API Secret. Copy both of these long strings into a text editor for later.

Scroll down until you see the "Access Tokens" section and click the button to create a new token. This may take up to a minute to complete. Continue to refresh until you see an access token generated at the bottom of the page. Copy the Access Token and Access Token Secret into the text editor too.

Your keys and secrets are...well, secret. Don't go sharing them or accidentally pasting them online.

Extracting Tweets

Back in the terminal, bring up the Python editor by typing:

idle

Create a new file by going to File->New File in Idle. Create a new folder on your desktop named "TweetData" and save this new file as "extractor.py" into the folder.

The extractor file we're about to build will contain all the authentication details (keys and secrets) needed to use Twitter's streaming functionality to pull real-time tweets. At the very end of the file is the line of code that enables you to specify filters for the extracted tweets (i.e. search for a specific keyword; the default here is "dog"). Copy and paste the following code into "extractor.py" and enter your two sets of application credentials at the top of the document (between the " ").

import tweepy

#user application credentials
consumer_key=""
consumer_secret=""

access_token=""
access_token_secret=""

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

    def on_status(self, status):
        print status.text , "\n"

    #handle errors without closing stream:
    def on_error(self, status_code):
        print >> sys.stderr, 'Encountered error with status code:', status_code
        return True 

    def on_timeout(self):
        print >> sys.stderr, 'Timeout...'
        return True 

sapi = tweepy.streaming.Stream(auth, CustomStreamListener(api))
sapi.filter(track=['dog'])

With your cursor in "extractor.py", click File->Save and then Run->Run Module. Depending on the search term, you should see tweets start rolling in. If your term is uncommon, you may have to wait a few minutes. If your term is used often, expect to be inundated with data almost immediately.

When you're done staring at tweets, close out the Python shell, clicking OK when asked if you want to kill the program.

Storing Tweets in Real Time

Now that we can successfully extract the data we need to find a way to funnel it into a database. To do this we will make several additions to the above code to import the necessary database packages, create an empty database and save tweets into it. Make sure your "extractor.py" code is identical to the following:

import tweepy
import sys
import pymongo

#user application credentials
consumer_key=""
consumer_secret=""

access_token=""
access_token_secret=""

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.db = pymongo.MongoClient().Dog

    def on_status(self, status):
        print status.text , "\n"

        data ={}
        data['text'] = status.text
        data['created_at'] = status.created_at
        data['geo'] = status.geo
        data['source'] = status.source

        self.db.Tweets.insert(data)

	#handle errors without closing stream:
    def on_error(self, status_code):
        print >> sys.stderr, 'Encountered error with status code:', status_code
        return True

    def on_timeout(self):
        print >> sys.stderr, 'Timeout...'
        return True

sapi = tweepy.streaming.Stream(auth, CustomStreamListener(api))
sapi.filter(track=['dog'])

What this script does is create a new MongoDB database called "Dog" and a new Collection called "Tweets" inside of it.

One thing that initially threw me for a loop was the population of the data array in the above code- shown again here for clarity:

data ={}
        data['text'] = status.text
        data['created_at'] = status.created_at
        data['geo'] = status.geo
        data['source'] = status.source

It turns out that text, created_at, geo and source are default properties possessed by each and every tweet. These are just 4 examples of a long list of properties available for data mining- see the graphic below for details:

While latitude/longitude information allows you to do very cool things with geolocation and heat-mapping, very few Twitter users enable location services. Hence, this parameter is almost always moot. Shame.

Anyway, for each incoming tweet that matches our search criteria we create a new entry in our database that contains the tweet's content, the date it was created, its geographic origin and its source.

Don't believe me? Open Robomongo and connect to the localhost connection we created earlier.

Once connected, expand the parent connection in the left pane by clicking the little black arrow. Then look for the database we created, "Dog". Expand "Dog" and then expand "Collections" within it. You should see a collection called "Tweets"; double click to open the collection in the Robomongo view panel and inspect the documents it contains.

You can use the little icons in the upper right to toggle between views (I find it helpful to view tweet data in text mode). Depending on how long you let the initial "extractor.py" file run, you may see anywhere from a handful to several thousand tweet entries.

Exporting from MongoDB to CSV

So we have all of this data, now how do we handle it? First, let's export to a comma separate variable file using the mongoexport command.

Head back to the terminal and type:

mongoexport --db Dog --collection Tweets --csv --fields text,created_at,geo,source --out output.csv

This command selects the database "Dog", selects the collection "Tweets" within "Dog", specifies an output type of csv, specifies that we want to export the text, created_at, geo and source fields and then names the output file "output.csv". Barring any errors, the result will be exported and saved into your root directory. Find it, and move it into the TweetData folder on your desktop.

If you can't find "output.csv" in your root directory, try doing a system-wide search.

Go ahead and open the csv file in a text editor or spreadsheet to get an idea of its structure and contents.

Analysis and Visualization

We will be using iPython's Notebook to analyze our twitter data. Type at the command line:

ipython notebook

This should bring up a new browser window at the location http://localhost:8888/. Think of this like a GUI for Python data visualization.

Click the New Notebook button and then enter the following code in the first console line:

import pandas as pd
from pandas.tseries.resample import TimeGrouper
from pandas.tseries.offsets import DateOffset
import vincent as v

To commit code, you'll need to use Shift-Enter. This chunk imports the required packages for data visualization.

The next chunk reads in the csv file we created, sets the date field as a unique index and converts the time. Be sure to enter the correct path to your csv file in the TweetData folder in the first line.

dogs = pd.read_csv('/path/to/Desktop/TweetData/output.csv')
dogs['created_at'] = pd.to_datetime(pd.Series(dogs['created_at']))
dogs.set_index('created_at', drop=False, inplace=True)
dogs.index = dogs.index.tz_localize('GMT').tz_convert('EST')
dogs.index = dogs.index - DateOffset(hours = 12)
dogs.index

Next, we perform a resampling calculation to determine the number of tweets per minute. This is one of many, many functions that is supported by Pandas.

dogs1m = dogs['created_at'].resample('1t', how='count')

Finally, we initialize the notebook, create the graph and display it on the screen:

v.core.initialize_notebook()
area = v.Area(dogs1m)
area.colors(brew='Spectral')
area.display(

If your iPython code executes, but no graph appears, try executing the pylab inline command first, then re-entering the code above.

The graph below shows the average tweets/minute for the search term "bermuda" over a 15 minute period. And this only scratches the surface of what is possible. We haven't even labeled our graph's axes, for Pete's sake!

Graphing Tweet Data Using iPython

Notes

Embedded Code Snippets

The colorful code blocks in this article were created using the syntax highlighting plugin called highlight.js. The theme I'm using is "Codepen Embed", which is included by default. To use this theme, simply reference it in your CSS link string:

link rel="stylesheet" href="path/to/style/codepen-embed.css"


Inspiration & Resources

This project was inspired by Daniel Forsythe's Ice Hockey Twitter Project. Unfortunatley, my skill level was far below what was needed to follow Daniel's train of thought point for point. I decided to create a beginner's guide to enable Python newbies to interact with the amazing amount of data that can be harnessed using Tweepy.



I would not have been able to complete this without Sentdex on Youtube. His Twitter video series was absolutely invaluable.