You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dan Brickley <da...@danbri.org> on 2012/07/05 22:21:09 UTC

Simple .py custom loader for slightly-nested input?

Cutting this over from #hadoop-pig IRC:

hi Pig people. I have some TV viewing logs in a text format - example
http://pastebin.com/raw.php?i=HS4zy2pP - ... unfortunately it has some
nesting/list structure, so I can't see a way to read it with an 'out
of the box' Pig loader. Is the conventional practice to write a custom
loader? (Python? Java? anything?). The actual parsing is quite trivial
but I'm unsure how to hook into Pig infrastructure. Ideally it would
be a simple linked .py file, not messing around with complex java
builds etc.

I found e.g. http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html
(for a Java loader). I hate to sound ungrateful but this is looking a
bit heavy, compared to the simplicity of the task. Would a Python
loader be simpler? (ie. just a second .py script alongside my .pig
script). I was suprised that I wasn't able to find an example of
someone having done this.

Here's the target format, below. Each row is a TV-viewing session,
with a channel and total time, followed by a space-separate list of
item:minute pairs for a sequence of consecutive viewed items on that
channel making up that total.

Thanks for any pointers. I don't mind coding, I just want to find the
right framework to plug into...

cheers,

Dan

2012-03-01T00:00:29Z 1360015279 mychannela 0 asdfasdf:0
2012-03-01T00:04:23Z 0728509428 mychannelb 6 bsdf92c1:6
2012-03-01T00:01:23Z 0516050342 mchannela 20 b00s123k0:19 b0dfgdfgk1:1

(fields: timestamp userid channelid total_duration ... then a sequence
of {itemid}:{mins} for each item viewed in that session of viewing the
channel. These will sum to the total_duration.)

Re: Simple .py custom loader for slightly-nested input?

Posted by Dan Brickley <da...@danbri.org>.
On 6 July 2012 15:53, Duckworth, Will <wd...@comscore.com> wrote:
> Not sure of your desired "final output" but below is the pseudo code how I solved a similar problem with pig and python.
>
> Use PigStorage with new-line as the delimiter (or whatever you are using to denote a new line) in order to throw PIG a "fakie" and have it load the whole line as the tuple.
>
> tv_in = load '$tv_in_path' using PigStorage('\n') as (line:chararray);
>
> Pass each line to a python UDF
>
> tv_in2 = foreach tv_in generate udf.explode_tv(line);
>
> That gets the whole line into the python UDF so that you can do your custom parsing.
>
> Since you don't know the total number of item:minute pairs you are going to have to decide what you want to return.
>
> You could do a bag of item:minute pairs something like: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemids:bag{iT:tuple(itemid, minutes)} or you could create a tuple for each item:minute pair: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemid, minutes)}.
>
> Hope this helps.

Very much so. Thanks, Will!

Dan

RE: Simple .py custom loader for slightly-nested input?

Posted by "Duckworth, Will" <wd...@comscore.com>.
Not sure of your desired "final output" but below is the pseudo code how I solved a similar problem with pig and python.

Use PigStorage with new-line as the delimiter (or whatever you are using to denote a new line) in order to throw PIG a "fakie" and have it load the whole line as the tuple.

tv_in = load '$tv_in_path' using PigStorage('\n') as (line:chararray);

Pass each line to a python UDF

tv_in2 = foreach tv_in generate udf.explode_tv(line);

That gets the whole line into the python UDF so that you can do your custom parsing.

Since you don't know the total number of item:minute pairs you are going to have to decide what you want to return.

You could do a bag of item:minute pairs something like: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemids:bag{iT:tuple(itemid, minutes)} or you could create a tuple for each item:minute pair: R:bag{T:tuple(timestamp, userid, channeled, total_duration, itemid, minutes)}.

Hope this helps.



Will Duckworth  Senior Vice President, Software Engineering  | comScore, Inc.(NASDAQ:SCOR)
o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:wduckworth@comscore.com
.....................................................................................................

Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral measurement
www.comscore.com/MobileMetrix
-----Original Message-----
From: Dan Brickley [mailto:danbri@danbri.org]
Sent: Thursday, July 05, 2012 4:21 PM
To: user@pig.apache.org
Subject: Simple .py custom loader for slightly-nested input?

Cutting this over from #hadoop-pig IRC:

hi Pig people. I have some TV viewing logs in a text format - example http://pastebin.com/raw.php?i=HS4zy2pP - ... unfortunately it has some nesting/list structure, so I can't see a way to read it with an 'out of the box' Pig loader. Is the conventional practice to write a custom loader? (Python? Java? anything?). The actual parsing is quite trivial but I'm unsure how to hook into Pig infrastructure. Ideally it would be a simple linked .py file, not messing around with complex java builds etc.

I found e.g. http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html
(for a Java loader). I hate to sound ungrateful but this is looking a bit heavy, compared to the simplicity of the task. Would a Python loader be simpler? (ie. just a second .py script alongside my .pig script). I was suprised that I wasn't able to find an example of someone having done this.

Here's the target format, below. Each row is a TV-viewing session, with a channel and total time, followed by a space-separate list of item:minute pairs for a sequence of consecutive viewed items on that channel making up that total.

Thanks for any pointers. I don't mind coding, I just want to find the right framework to plug into...

cheers,

Dan

2012-03-01T00:00:29Z 1360015279 mychannela 0 asdfasdf:0 2012-03-01T00:04:23Z 0728509428 mychannelb 6 bsdf92c1:6 2012-03-01T00:01:23Z 0516050342 mchannela 20 b00s123k0:19 b0dfgdfgk1:1

(fields: timestamp userid channelid total_duration ... then a sequence of {itemid}:{mins} for each item viewed in that session of viewing the channel. These will sum to the total_duration.)