You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@chukwa.apache.org by Matt Davies <ma...@tynt.com> on 2010/10/04 23:28:41 UTC

Chukwa Pig Data Passthrough

Hey all-

Trying to do some operations utilizing Chukwa and Pig.  Would like to basically

1. Read in the data from HDFS
2. Do some SPLIT operations
3. write the various files out with all the fields as seen during the loading phase.


So, my question is this - is there a way to utilize the org.apache.hadoop.chukwa.ChukwaStorage(); engine to load in and then store out all the various fields without having to individually define fields in the ChukwaStorage constructor?


Thanks,
Matt


Re: Chukwa Pig Data Passthrough

Posted by Bill Graham <bi...@gmail.com>.
Do you want to split on the chukwa payload fields or the fields in the
record body?

I have scripts that do similar things with the body using FILTER and a
custom TOKENIZE udf I wrote to tokenize the body content. I'm using
the latest ChukwaLoader for Pig 0.7.0, but the previous one should
work the same way.

define chukwaLoader org.apache.hadoop.chukwa.pig.ChukwaLoader();
define tokenize     my.udfs.TOKENIZE();

raw = LOAD '/your/path' USING chukwaLoader AS (ts: long, fields);
bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as
tokens, timePeriod(ts) as time;

bodies_this_period = FILTER bodies BY ((chararray)time == '[some timestamp]');

STORE bodies_this_period INTO '/some/output/path'


>From bodies_this_period you can access the different tokens using
$0.token0, bodies_this_period1, etc...

I wrote TOKENIZE to return an ordered tuple of the values found, since
Pig's TOKENIZE returns an unordered bag, which isn't that useful in
this case.

HTH,
Bill

On Mon, Oct 4, 2010 at 2:35 PM, Jerome Boulon <jb...@netflix.com> wrote:
> Hi Matt,
> When I designed this, the schema was NOT available in Pig. I’m not sure if
> this has changed or not.
> So I’m using the constructor as a way to get around the lack of schema
> definition but if you can get it now from the query & the storage handler
> then it should be a pretty easy thing todo.
> So do you know if the sql schema is now available in Pig?
>
> /Jerome.
>
> On 10/4/10 2:28 PM, "Matt Davies" <ma...@tynt.com> wrote:
>
> Hey all-
>
> Trying to do some operations utilizing Chukwa and Pig.  Would like to
> basically
>
> 1. Read in the data from HDFS
> 2. Do some SPLIT operations
> 3. write the various files out with all the fields as seen during the
> loading phase.
>
>
> So, my question is this - is there a way to utilize the
> org.apache.hadoop.chukwa.ChukwaStorage(); engine to load in and then store
> out all the various fields without having to individually define fields in
> the ChukwaStorage constructor?
>
>
> Thanks,
> Matt
>
>
>
>

Re: Chukwa Pig Data Passthrough

Posted by Jerome Boulon <jb...@netflix.com>.
Hi Matt,
When I designed this, the schema was NOT available in Pig. I'm not sure if this has changed or not.
So I'm using the constructor as a way to get around the lack of schema definition but if you can get it now from the query & the storage handler then it should be a pretty easy thing todo.
So do you know if the sql schema is now available in Pig?

/Jerome.

On 10/4/10 2:28 PM, "Matt Davies" <ma...@tynt.com> wrote:

Hey all-

Trying to do some operations utilizing Chukwa and Pig.  Would like to basically

1. Read in the data from HDFS
2. Do some SPLIT operations
3. write the various files out with all the fields as seen during the loading phase.


So, my question is this - is there a way to utilize the org.apache.hadoop.chukwa.ChukwaStorage(); engine to load in and then store out all the various fields without having to individually define fields in the ChukwaStorage constructor?


Thanks,
Matt