You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Eric Lubow <er...@gmail.com> on 2011/02/17 17:53:46 UTC
JSON Loading on EMR

Hello,

   I'll preface this with saying that I know very very little Java and I am
just learning Pig.

   My situation is that I am aggregating logs with Flume into a single
logfile.  All my logs are in JSON format and then gzip'd before being added
to S3.  I have 3 types of log lines in each file (b, i, c).  Since I can't
seem to get anything to work, I am pulled a few logfiles down to the local
machine and I am running pig in local mode on decompressed log files.

   What I am trying to do is write a Pig script to parse the JSON and then
run queries against.  Since there are 3 types of lines in the same file,
when I do an illustrate of a regex (that I know works because I have tested
it against multiple regex matching programs) it only shows me the first
line, not the first matching line.  The JSON log line that is of type 'b' is
a nested JSON, so I am staying away from that for now (mostly because I
can't figure out how to get the Java in this Gist to build:
https://gist.github.com/601331).  Log lines 'i' and 'c' are single level
JSON (not nested) so a simple regex should work if I understand everything
correctly.

   More specifics are in this StackOverflow question I posted as well (
http://stackoverflow.com/questions/5013003/how-do-i-parse-json-in-pig).
 Feel free to answer it for the points if we answer the question here.

   The version of Hadoop is 0.20 and Pig is 0.6 because that is what is on
the EMR (Elastic Map Reduce) instances.

   Here is where I am at:
----
Example log line type 'i':
{"exchange_id":"4cc877b81badf422af000010","exchange_user_id":"MTY4Mjk2NTk2eDAuODA2IDEyOTc4MDI5NTh4MTI2NDc5NjY2MA","bid_id":"00cc4341-facb-4ec1-a403-d5309472d70e","bid_amount":"2.05","win_amount":1.369999968133322,"ad_ids":"4d237a731badf45c8200011a,4d237ac81badf45c85000006,4d4c64c0e32b132113000013,4d23807a1badf45c85000299","wv":"2","logged_at":"2011-02-15T23:36:31.386Z"}

Pig Script Attempt:
REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
RAW_LOGS = LOAD 'file:/home/hadoop/logs/adserver.log' USING TextLoader AS
(line:chararray);
LOGS_BASE= foreach RAW_LOGS generate
FLATTEN(EXTRACT(line,'{"exchange_id":"(.*[^"])","exchange_user_id":"(.*[^"])","bid_id":"(.*[^"])","bid_amount":"(.*[^"])","win_amount":(.*),"ad_ids":"(.*[^"])","wv":"(.*[^"])","logged_at":"(.*[^"])"}'))
AS
(exchange_id:chararray,exchange_user_id:chararray,bid_id:chararray,bid_amount:float,win_amount:float,ad_ids:chararray,wv:int,logged_at:chararray);
WIDGET_VERSION_ONLY = FOREACH LOGS_BASE GENERATE wv;
WIDGET_VERSION_COUNT = FOREACH (GROUP WIDGET_VERSION_ONLY BY $0) GENERATE
$0, COUNT($1) as num;
WIDGET_VERSION_SORTED_COUNT = LIMIT(ORDER WIDGET_VERSION_COUNT BY num DESC)
5;
----

  Any help that would push me in the right direction would be greatly
appreciated.

-e
--
Eric Lubow
e: eric.lubow@gmail.com
w: eric.lubow.org