HBase Storage and PIG

Screen shot 2011-10-27 at 10.59.38 AM

We’ve been using PIG for analytics and for processing data for use in our site for some time now. PIG is a high level language for building data analysis programs that can run across a distributed Hadoop cluster. It has allowed us to scale up our data processing while decreasing the amount of time it takes to run jobs.

When it came time to update our runtime data storage for the site, it was natural for us to consider using HBase to achieve horizontal scalability. HBase is a distributed, versioned, column-oriented store based on Hadoop. One of the great advantages of using HBase is the ability to integrate it with our existing PIG data processing. In this post I will introduce you to the basics of working with HBase from your PIG scripts.

Getting Started

Before getting into the details of using HBaseStorage there are a couple of environment variables you will need to make sure are set so that HBaseStorage can work correctly.

export HBASE_HOME=/usr/lib/hbase
export PIG_CLASSPATH=”`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH”

First, you will need to let HBaseStorage know where to find the HBase configuration, thus the HBASE_HOME environment variable. Second, the PIG_CLASSPATH needs to be extended to include the classpath for loading HBase. If you are using PIG 0.8.x there is a slight variation:

export HADOOP_CLASSPATH=”`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH”

Hello World
Let’s write a simple script to load some data from a file and write it out to an HBase table. To begin, use the shell to create your table:

jhoover@jhoover2:~$ hbase shell
HBase Shell; enter ‘help‘ for list of supported commands.
Type “exit” to leave the HBase Shell
Version 0.90.3-cdh3u1, r, Mon Jul 18 08:23:50 PDT 2011

hbase(main):002:0> create ‘sample_names’, ‘info’
0 row(s) in 0.5580 seconds

Next, we’ll put some simple data in a file ‘input.csv’:

1, John, Smith
2, Jane, Doe
3, George, Washington
4, Ben, Franklin

Then we’ll write a simple script to extract this data and write it into fixed columns in HBase:

raw_data = LOAD ‘sample_data.csv’ USING PigStorage( ‘,’ ) AS (
listing_id: chararray,
fname: chararray,
lname: chararray );

STORE raw_data INTO ‘hbase://sample_names’ USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage (
‘info:fname info:lname’);

Then run the pig script locally:

jhoover@jhoover2:~/hbase_sample$ pig -x local hbase_sample.pig

Success!

Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 raw_data MAP_ONLY hbase://hello_world,

Input(s):
Successfully read records from: “file:///autohome/jhoover/hbase_sample/sample_data.csv”

Output(s):
Successfully stored records in: “hbase://sample_names”

Job DAG:
job_local_0001

You can then see the results of your script in the hbase shell:

hbase(main):001:0> scan ‘hello_world’
ROW COLUMN+CELL
1 column=info:fname, timestamp=1356134399789, value= John
1 column=info:lname, timestamp=1356134399789, value= Smith
2 column=info:fname, timestamp=1356134399789, value= Jane
2 column=info:lname, timestamp=1356134399789, value= Doe
3 column=info:fname, timestamp=1356134399789, value= George
3 column=info:lname, timestamp=1356134399789, value= Washington
4 column=info:fname, timestamp=1356134399789, value= Ben
4 column=info:lname, timestamp=1356134399789, value= Franklin
4 row(s) in 0.4850 seconds

Sample Code
You can download the sample code from this blog post here.

Next: Column Families
In PIG 0.9.0 we get some new functionality around being able to treat entire column families using maps. I’ll post some examples as well as some UDFs we wrote to support that next.

Have any questions or tips of your own? Let me know here, ore follow me on Twitter at @sublogical or check out my personal blog!

by Jay Hoover

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>