Tuesday, June 24, 2014

Cleansing data with Pig

Hadoop is a great solution for cleansing and matching as a part of the ETL pipeline. The Map in Map-Reduce is about data transformation.

Consider the following two rows of CSV data:

01-15-2014 12:04:05, row1
2014/02/20, row2

Row 1 has the desired timestamp format MM-DD-YYYY HH:MM:SS. Row 2 timestamp needs to be transformed to match. Here are three ways you can accomplish this with Pig.

Method 1 (Preferred)

a = load 'testdata.csv' USING PigStorage(',') as (date:chararray, row:chararray);
b = foreach a generate (date matches '\\d+/.+'
        ?
            CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(SUBSTRING(date,6,10), '-'), SUBSTRING(date,0,2)), '-'), SUBSTRING(date,3,5)), ' 00:00:00')
        :
            date) as date, row;
dump b;

Method 2

a = load 'testdata.csv' USING PigStorage(',') as (date:chararray, row:chararray);
goodDate = filter a by date matches '\\d+-.+';
parseBadDate = foreach (filter a by date matches '\\d+/.+') generate REGEX_EXTRACT_ALL(date,'(\\d+)/(\\d+)/(\\d+).*') as datenew:(month, day, year), row;

fixedBadDate = foreach parseBadDate generate CONCAT(CONCAT(CONCAT(CONCAT(CONCAT((chararray)datenew.year, '-'), (chararray)datenew.month), '-'), (chararray)datenew.day), ' 00:00:00') as date, row;

together = union goodDate, fixedBadDate;
dump together;

Method 3

a = load 'testdata.csv' USING PigStorage(',') as (date:chararray, row:chararray);
b = foreach a generate REGEX_EXTRACT_ALL(date,'(\\d+)-(\\d+)-(\\d+).*') as date:(year, month, day), row;
c = foreach b generate date.day as day, date.month as month, date.year as year, row;
d = foreach a generate REGEX_EXTRACT_ALL(date,'(\\d+)/(\\d+)/(\\d+).*') as date:(month, day, year), row;
e = foreach d generate date.day as day, date.month as month, date.year as year, row;
f = union c,e;
g = filter f by day is not null;
dump g;

1 comment:

  1. The article is well-written, well-framed and very much informative. A lot more can be learned from this write-up. Pegasi Media Group is the world’s leading database vendor that provides you the email list that you want. It is a well-reputed database vendor in the market or industry. The organizations all over the world prefer to work with Pegasi Media Group to expand their businesses. The organizations that are working with us are very much delighted with our services. We aim at rendering best services to our clients. For further information, or to give your suggestions, please do not hesitate to contact us.Data Cleansing Services

    ReplyDelete