Monday, December 9, 2013

Analyzing unstructured data with Apache Hive, Hadoop and RegexSerDe


Download IMDB movies list with countries to your favorite folder within HDFS on your Hadoop cluster. If you observe records in the file, the movie name is enclosed in quotes and is the first item on the record. The year follows enclosed in parenthesis. Following that there is optional information such as TV episode title, season, episode etc. within curly braces. The last item on the record is the country name.

"'Da Kink in My Hair" (2007) {Empty Bag Can't Stand Up (#1.4)}    Canada

These items must be parsed into fields to enable analysis, such as count of movies made in the year 2000 in the USA. One way to do it is by first creating a table within Hive with the entire record as one field, and then using Create Table as Select (CTAS) where the select uses regexp_extract function to parse fields. The shortcomings of this approach are (1) the CTAS table cannot be external and (2) you have just doubled the storage space required if you intend to keep the original external table.

A better approach is to create the table with a Regular Expressions Serializer/Derserializer class (RegexSerDe) that will parse fields from the record with regular expressions at the time of initial table create.

The RegexSerDe class is contained within the hive-contrib-*.jar file. On my Cloudera Hadoop (CDH) cluster this file is located in the /usr/lib/hive/lib/ folder. Before using the RegexSerDe class, you will need to tell Hive about this library by running the following command on hive prompt:

add jar /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.4.0.jar;

This will add the jar file as a resource for that session. To make this jar available automatically each time you start a hive session, you can edit the /etc/hive/conf/hive-site.xml file to add the following:
<property>
        <name>hive.aux.jars.path</name>
        <value>file:///usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.4.0.jar</value>
</property>
The <value> can take multiple files separated by comma in case you are using other jars.

Now that the RegexSerDe class is available for use within Hive, we can create the table using RegexSerDe like so:
CREATE EXTERNAL TABLE (title string, year string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = '"([^"]+)"\\s+\\(([^)]+).+'
)
LOCATION '/user/smith/imdb';
The above assumes that the countries.list file is located within the /user/smith/imdb folder on HDFS. We are only parsing out the movie title and date. I will update the post in the future with code that parses all fields. Notice that RegexSerDe requires the "input.regex" pattern to match the entire record, thus the .+ at the end. Also note that Regex shorthand and escape such as \s is written as \\s because Hive's shell escape character is also \. Each field to be parsed is a capture group enclosed by parenthesis.

No comments:

Post a Comment