You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Alex McLintock <al...@gmail.com> on 2011/02/06 23:37:40 UTC

Repetitive pig scripts...

I'm trying to understand the best way of setting up repeated processing of
continuously generated data - like logs.

I can manually copy files from normal FS to HDFS and kick off pig scripts
but ideally I want something automatic - preferably every hour, or possibly
more often. I also want to process a day or a month's worth of data rather
than just the most recent file.

Is there a best practice way of doing this documented anywhere? I believe
that I should be looking at Flume for transferring files into HDFS and Oozie
for some kind of workflow of pig jobs. Is that right? Any example setups?

Cheers

Alex

Re: Repetitive pig scripts...

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Also take a look of http://wiki.apache.org/pig/TuringCompletePig. You 
can embed Pig into Python script. This feature already checked in into 
trunk and will be available in 0.9.

Daniel

Alex McLintock wrote:
> I'm trying to understand the best way of setting up repeated processing of
> continuously generated data - like logs.
>
> I can manually copy files from normal FS to HDFS and kick off pig scripts
> but ideally I want something automatic - preferably every hour, or possibly
> more often. I also want to process a day or a month's worth of data rather
> than just the most recent file.
>
> Is there a best practice way of doing this documented anywhere? I believe
> that I should be looking at Flume for transferring files into HDFS and Oozie
> for some kind of workflow of pig jobs. Is that right? Any example setups?
>
> Cheers
>
> Alex
>