Mining Twitter data using Python and GemFire XD

Python is a very popular and powerful programming language, that’s a fact. Some people may find useful to write a Python client for GemFireXD, this article describes one of the easiest alternatives to accomplish it while explaining some details about Twitter API.

The standard for database connectivity in Python is DB-API 2.0 (PEP 249) which is almost like JDBC for Java. There are a bunch of implementations for most traditional databases like Oracle and MySQL, but very few alternatives for NewSQL databases like GemFire XD. As you may already know, GemFire XD is based on Apache Derby (10.4) and as very few will remember, Derby was initially distributed by IBM as Cloudspace until 2004, when the code was contributed to Apache Software Foundation as an incubator project. And why is important to remember this piece of history  ? Because we’re going to leverage this IBM background of Apache Derby and use pyDB2 library for GemFire XD connectivity.

To make the tutorial a little bit more interesting we’re going to write a simple Twitter client that populate tweets into GemFire XD and some examples on how we can analyze the collected data.

Requirements and dependencies

Let’s get started by installing the dependencies:

  • DB2 Express-C – A no-charge (to develop/deploy), community edition of DB2 available for multiple platforms.
  • pyDB2 - Written by Man-Yong Lee, is licensed under LGPL and is the DB-API 2.0 compliant interface for IBM DB2 database.
  • TwitterAPI - A Python wrapper for Twitter API with support for Streaming.

Of course Python (2.7+) and GemFire XD are also required and download links are available under References.

Note: Alternatively you could also use GemFire XD ODBC drivers and mxODBC module written by Marc-Andre Lemburg, which has a commercial license or even pyODBC which is an open source alternative, but would rely on platform specific ODBC drivers.
Note 2: An alternative to DB2 Express-C is the DB2 Universal Run-Time client, but AFAIK it’s only available for Windows.
Note 3: I’m using OSX/Unix based instructions here but most of the instructions can easily be matched in a Windows environment.

Installation

DB2 Client Install

Download and install DB2 Express-C - For Mac OS X the package is called db2_v101_macos_expc.tar.gz - Uncompress it anywhere you want and run the installer command:

$ tar -xzvpf db2_v101_macos_expc.tar.gz
$ cd expc/
$ ./db2_install   (can be ran using sudo for system-wide installation, but it’s not required)

Follow the installation instructions, they’re intuitive.

pyDB2 install

After the installation is completed, download and install pyDB2. You need to modify the setup.py script to include the location you have installed the DB2 client:

$ tar -xzvpf PyDB2_1.1.1-1.tar.gz
$ cd PyDB2_1.1.1/
$ vi setup.py

Look for DB2_ROOT variable and point it to your DB2 installation – In my case, using the defaults for installation without using sudo, it’s under a folder called sqllib in user home directory.

DB2_ROOT = "/Users/markito/sqllib/"

Save and quit the setup.py file and perform the installation (requires root/sudo):

$ sudo python setup.py install

TwitterAPI install

Using easy_install simply do:

$ sudo easy_install TwitterAPI

Configuration and startup

Starting Gemfire XD

We’re going to use two data nodes and one locator with default settings, just changing listening ports since we’re running it on a single host.

$ cd ~
$ mkdir locator server1 server2
$ sqlf locator start -jmx-manager-start=true -dir=locator
....Starting SQLFire Locator...
$ sqlf server start -locators=localhost[10334] -client-port=1528 -dir=server1
....Starting SQLFire Server...
$ sqlf server start -locators=localhost[10334] -client-port=1529 -dir=server2
....Starting SQLFire Server...

Note that we have started the locator with -jms-manager-start=true which will start Pulse, the embedded monitoring tool for GemFire XD.

Cataloging GemFire XD under DB2 client

In order to connect from pyDB2 to GemFire XD we need to “catalog” our database, which is a way to register the database into the DB2 client with an alias. This is required even for pure DB2 installations, when for example your database server is hosted on a different machine than your client.

$ cd $DB2_ROOT (~/sqllib/)
$ cd bin
$ ./db2 catalog tcpip node GFXDNODE remote localhost server 1528
DB20000I  The CATALOG TCPIP NODE command completed successfully.
DB21056W  Directory changes may not be effective until the directory cache is
refreshed.
$ ./db2 catalog database GFXD at node GFXDNODE authentication server
DB20000I  The CATALOG DATABASE command completed successfully.
DB21056W  Directory changes may not be effective until the directory cache is
refreshed.

Note: If you messed up during this process or want to rename your catalog node, use the uncatalog option.

Creating the application and testing

I’m going to use STS (Spring Tool Suite) with PyDev plugin for Python development, but it can easily be done using VIM or any other text editor. The reasoning for PyDev is code completion which is handy for occasional Python developers (like myself) and STS also offers a built-in SQL interface that can be used to browse GemFire XD tables and perform queries, which could be also done through sqlf command line utility.

Creating the TWEETS table

Read this article on how to setup GemFire XD connectivity with STS (Eclipse) for database development. With the setup in place, open a SQL Scrapbook and create a table:

CREATE TABLE TWEETS
(
id BIGINT,
created_at TIMESTAMP,
favorited VARCHAR(5),
lang VARCHAR(3),
retweet_count INT,
retweeted VARCHAR(5),
source VARCHAR(256),
text VARCHAR(150),
user_id BIGINT,
CONSTRAINT FAVORITED_CONSTRAINT CHECK (favorited in ('True','False')),
CONSTRAINT RETWEETED_CONSTRAINT CHECK (retweeted in ('True','False')),
PRIMARY KEY (id)

) PARTITION BY COLUMN (user_id);

We’re creating a simple table mapping the most common Tweet fields and partition by user_id, which should scale well if we later decide to have data about the Users related to these tweets co-located with it’s own tweets. But according to your needs, you may want to partition by Tweet location or any other criteria. Other than that, it’s the usual create table SQL syntax.

Creating the Python client

  1. On STS create a new PyDev project and name it PyTweetXD
  2. Under the project create a new PyDev module and name it GfxdClient.py and let’s write a hello world query:
import DB2

conn = DB2.connect('GFXD', 'app','app')
cursor = conn.cursor()
cursor.execute('select * from sys.systables')

print cursor.fetchone()

if __name__ == '__main__':
    pass

Run this Python script and you should see the query results printed to stdout.

(’2faf0139-0142-bae2-7db6-000007d2f9a8′, ‘SYSXPLAIN_SCAN_PROPS’, ‘T’, ‘c013800d-00fb-2644-07ec-000000134f30′, ‘SYSSTAT’, ‘R’, ”, ‘NORMAL’, None, None, None, None, None, None, None, None, None, 0, None)

Now let’s write a GfxdClient class that will be reused to persist our tweets from the Twitter client

  1. First create a file named config and put our connection details there. Create a section on this file so we can reuse it later for Twitter client settings as well.
    [GemFireXD]
    database=GFXD
    user=app
    password=app
    
  2. With the help of ConfigParser Python module we’re going to create a simple class that can use this information and connect to GemFireXD
    # -*- coding: utf-8 -*-
    import DB2
    
    from ConfigParser import ConfigParser
    
    class GfxdClient():
        _cursor = None
        _conn = None
        _config = ConfigParser()
    
        def __init__(self):
            self._config.read('config')
    
        def readConfig(self, name):
            return self._config.get('GemFireXD',name)
    
        def connect(self):
            self._conn = DB2.connect(self.readConfig('database'), self.readConfig('user'), self.readConfig('password'))
    
        def getCursor(self):
            if (self._cursor == None):
                if (self._conn != None):
                    _cursor = self._conn.cursor()
                else:
                    raise Exception("No active connection!")
            return _cursor
    
        def execute(self, sql, params = None):
            self.getCursor().execute(sql, params)
    
        def select(self, sql, params = None):
            return self.execute(sql, params)
    
        def insert(self, sql, params):
            return self.execute(sql, params)
    

Finally we are going to write a Twitter client that will monitor public tweets for a given term using the Twitter RESTful stream API. Before proceeding, make sure you have registered an application at https://dev.twitter.com/apps and generate the OAuth keys, which are required –  To make it simple, put the OAuth keys in the config file under a Twitter section.  Your config file may look like this:

[Twitter]
consumer_key=*************
consumer_secret=***************************
access_token_key=***************************
access_token_secret=*************************

[GemFireXD]
database=GFXD
user=app
password=app

Create a new Python module named TwitterClient.py which is a script that collect public real-time tweets and persist them into GemFire XD. Here is the code:

# -*- coding: utf-8 -*-
from TwitterAPI import TwitterAPI
from GfxdClient import GfxdClient
from ConfigParser import ConfigParser
import time
import signal
import sys

successCount = 0
errorCount = 0

'''
Print some stats after script is stopped
'''
def signal_handler(signal, frame):
        print '\n# Stats'
        print 'Saved tweets: %d' % successCount
        print 'Errors: %d' % errorCount
        sys.exit(0)

signal.signal(signal.SIGINT, signal_handler)

### Read config file
config = ConfigParser()
config.read('config')

# get twitter keys
consumer_key        = config.get('Twitter', 'consumer_key')
consumer_secret     = config.get('Twitter', 'consumer_secret')
access_token_key    = config.get('Twitter', 'access_token_key')
access_token_secret = config.get('Twitter', 'access_token_secret')

# the subject you want to track on twitter
subject = '#NBA'

# twitter streaming api
api = TwitterAPI(consumer_key,consumer_secret,access_token_key,access_token_secret)
stream = api.request('statuses/filter', {'track': subject})

# Connect to GemFireXD
gfxd = GfxdClient()
gfxd.connect()

print "Reading Twitter stream for subject: %s (hit ctrl+c to stop)" % subject

# read streaming data and persist into GemFireXD
for tweet in stream.get_iterator():
    sql = "insert into tweets values (?,?,?,?,?,?,?,?,?)"
    try:
       params = (tweet['id'],time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')),str(tweet['favorited']), \
      str(tweet['lang']),tweet['retweet_count'],str(tweet['retweeted']),str(tweet['source']),unicode(tweet['text']),tweet['user']['id'])

        gfxd.insert(sql, params)
        successCount += 1
    except Exception as e:
        errorCount += 1
        print e

if __name__ == '__main__':
    pass

We’re basically tracking every tweet that has the NBA word and inserting into GemFire XD. If any error occur we’ll just print the exception, count it but keep processing. Running the script to test on a Friday night with 10 games scheduled, I was able to collect approx. 80k tweets, which may not represent a huge amount of data (in GemFire XD scale) but we can already analyze and it’s good enough for some examples.

Monitoring using Pulse

Now while the script is running let’s open a browser and check Pulse, the GemFireXD built-in monitoring tool, and see what’s happening with the servers.  The default URL for Pulse is: http://localhost:7070/pulse

After login (admin/admin) the first page gives you a clear and beautiful overview of the system, with total heap available (from both servers), current memory usage and last 15 minutes throughput (Put/Get operations rate).

PulseMainPage

We can double click one of the servers and check that the load is being balanced between both nodes.

PulseServerPage

Let’s get back to STS and perform some basic data analysis running some SQL queries. One of the joys of GemFire XD is that it will take care of parallelizing the query between the members, so even if I scale up or down, the locator will take care of load balancing the query between members and doing the proper data distribution.

Data analysis

Sources Rank (Query)

Most Twitter clients share their name through a field named ‘source’ on every tweet. So if you post a tweet from an iPhone the tweet source field look like this:
<a href=”http://twitter.com/download/iphone” rel=”nofollow”>Twitter for iPhone</a>

But if you post a tweet from an Android device it may look like this: <a href=”http://twitter.com/download/android” rel=”nofollow”>Twitter for Android</a>

Now back on STS on the SQL Scrapbook we can perform the following query to see which source has more tweets from the tweets we collected:

-- rank of tweet sources
  select ROW_NUMBER() OVER() as RANK,
         source, count(*) total
    from tweets
group by source
order by total desc

The first 20 results of this query returned the following:

GFXDResultsSourceRank

Based on these results we can see that the number of IPhone users double the number of Android users in number of messages.

Languages Rank (Query)

On another quick example, we know that the NBA league is pretty much global with 90+ international players representing more than 30 different countries. But which languages, besides English of course, talks about NBA on Twitter? Run the following query against the data to  get an idea:

-- languages most tweeted
  select ROW_NUMBER() OVER() as RANK,
         lang, count(*) total
    from tweets
group by lang
order by total desc

And the top 20 results of this query are:

GFXDResultLanguageRank

According to this data English tops the list of course, followed by Spanish, but then comes French, Japanese and German.

Word Frequency Analysis (Python script)

Our final example will use another Python script to parse the text from all tweets and return the most used words on all messages, leveraging the Counter class from Python collections package. The script looks like this:

import DB2
from ConfigParser import ConfigParser
from collections import Counter

config = ConfigParser()
config.read('config')

conn = DB2.connect(config.get('GemFireXD', 'database'), config.get('GemFireXD', 'user'), config.get('GemFireXD', 'password'))

SQL = 'SELECT text FROM tweets'
cursor = conn.cursor()
cursor.execute(SQL)

BATCH_SIZE = 10000
result = cursor.fetchmany(BATCH_SIZE)
wordCounter = Counter()

while result:
    # token list with every word with more than 3 letters on every tweet
    words=[ w
            for t in result
                for w in t[0].split()
                    if len(w) >= 3]
    # count frequency on token list
    for item in [words]:
        c = Counter(item)
        wordCounter = wordCounter + c

    try:
        result = cursor.fetchmany(BATCH_SIZE)
    except Exception as e:
        result = None
        print e

print wordCounter.most_common()[:10] # top 10 print

if __name__ == '__main__':
    pass

We’re going to call this script from a shell and evaluate how long it takes to complete. This is the output:

-bash-4.1$ time python BatchWordAnalyzer.py
[('#NBA', 16678), ('Kyle', 11120), ('Korver', 10916), ('#LosAngeles', 8163), ('made', 7304), ('@NBA:', 7275), ('record.', 6840), ('games', 6804), ('straight', 5892), ('Happy', 5759)]

real	0m5.809s
user	0m5.523s
sys	0m0.031s

So in about 6 seconds were able to parse 80000+ tweets, count every word and their frequency, generating the top 10 words of all these messages. Based on this result it’s clear that everybody was talking about Kyle Korver 3pt record. He has hit a shot from behind the arc for 90 consecutive games, an impressive mark, by the way.

Conclusion

This article demonstrated an alternative to ODBC for Python clients that want to use GemFire XD. It also shows how easy it can be to create an application that can store and analyze data from tweets about an specific topic using Twitter’s RESTful Stream API.  A tweet contains a very limited number of characters and that’s why we didn’t even  have to enable disk persistence or any other more advanced feature of GemFire XD, at least for this load. But for the person that may want to keep this data persisted for historical analysis it can easily be done.

Note that we’re using a very naive approach for the Python code, without parallelism  for example, which would definitely improve our processing time and by adding data locality and processing data inside the GemFire nodes and not in the client, it would be a LOT faster.

What’s next ?

If you are using GemFire XD inside the Pivotal HD VM and want to play with GemFire XD  PHD integration, here goes  the extra work that needs to be done for Hadoop persistence using GemFire XD:

Create a HDFS persistence store:

sqlf> CREATE HDFSSTORE tweetStore</pre>
NAMENODE 'hdfs://pivhdsne:8020'
HOMEDIR '/tweetStore'
BATCHTIMEINTERVAL 5000;
0 rows inserted/updated/deleted

And modify the table DDL as follows:

CREATE TABLE TWEETS
(
id BIGINT,
created_at TIMESTAMP,
favorited VARCHAR(5),
lang VARCHAR(3),
retweet_count INT,
retweeted VARCHAR(5),
source VARCHAR(256),
text VARCHAR(150),
user_id BIGINT,
CONSTRAINT FAVORITED_CONSTRAINT CHECK (favorited in ('True','False')),
CONSTRAINT RETWEETED_CONSTRAINT CHECK (retweeted in ('True','False')),
PRIMARY KEY (id)

) PARTITION BY COLUMN (user_id)
HDFSSTORE (tweetStore);

That’s it! Now your tweets will be persisted into HDFS! Simple huh ?

In a future post I’m going to demonstrate how to create the same kind of analysis using SpringXD and GemFireXD, but looking  for trending topics in the hope of generating more data. 

References

7 comments

  1. William Markito

    There is no native Derby Python client, but since the Derby database has it’s roots from IBM, a DB2 Python client is “compliant” with Derby. Since GFXD is compliant (lightly based) with Derby you can use the DB2 client for most use cases.

  2. David

    This is a great article but as I was working with the GA version of GemfireXD I went the GemfireXD JDBC path with Jpype instead of using DB2 as a gateway service. Works a treat!

  3. William Markito

    Have you installed pyDB2 ? Any errors during that process ?

    If it’s only on STS, you may need to reimport your python compiler so it will reload packages installed.

  4. kishore

    yes, i have installed pyDB2 with out any errors. i am using STS only. i have managed to import now. bUt some how i could not able to get connection, using python program.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s