You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Russell Jurney <ru...@gmail.com> on 2012/06/16 04:36:17 UTC

Resume failed pig script

In production I use short Pig scripts and schedule them with Azkaban
with dependencies setup, so that I can use Azkaban to restart long
data pipelines at the point of failure. I edit the failing pig script,
usually towards the end of the data pipeline, and restart the Azkaban
job. This saves hours and hours of repeated processing.

I wish Pig could do this. To resume at its point of failure when
re-run from the command line. Is this feasible?

Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

Re: Resume failed pig script

Posted by Russell Jurney <ru...@gmail.com>.

What's nectar?

I'd like this feature because Pig is easier to read than Oozie XML or
Azkaban YAML/ JSON where one must manually specify dependencies.
Lipstick is a good example of using Pig this way?

Russell Jurney
twitter.com/rjurney
russell.jurney@gmail.com
datasyndrome.com

On Jun 16, 2012, at 8:27 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Could save some metadata like crcs of all the jars... And maybe a hash of the subplan associated with each stored intermediate output... But really we should just do Nectar since it solves all this and more :)
>
> On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <jc...@gmail.com> wrote:
>
>> Well, you can do this physically by adding load/store boundaries to your
>> code. Thinking out loud, such a thing could be possible...
>>
>> At any M/R boundary, you store the intermediate in HDFS, and pig is aware
>> of this and doesn't automatically delete it (this part in and of itself is
>> not trivial -- what manages the garbage collection? perhaps that could be
>> part of the configuration of such a feature). Then, when you rerun a job,
>> it will look to see if the nodes that it would have saved (since it knows
>> this at compile time) don't already actually exist.
>>
>> There are some tricky caveats here... what if your code changes affect
>> intermediate data? You could save the logical plan as well, but what if you
>> make a change to a UDF? I am not sure if the benefit of automating this in
>> the language compared to developing a workflow similar to yours external to
>> pig is worth the complexity.
>>
>> But it is intriguing, and is a subset of data caching that we have thought
>> a lot about here.
>>
>> 2012/6/15 Russell Jurney <ru...@gmail.com>
>>
>>> In production I use short Pig scripts and schedule them with Azkaban
>>> with dependencies setup, so that I can use Azkaban to restart long
>>> data pipelines at the point of failure. I edit the failing pig script,
>>> usually towards the end of the data pipeline, and restart the Azkaban
>>> job. This saves hours and hours of repeated processing.
>>>
>>> I wish Pig could do this. To resume at its point of failure when
>>> re-run from the command line. Is this feasible?
>>>
>>> Russell Jurney
>>> twitter.com/rjurney
>>> russell.jurney@gmail.com
>>> datasyndrome.com
>>>

Re: Resume failed pig script

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Could save some metadata like crcs of all the jars... And maybe a hash of the subplan associated with each stored intermediate output... But really we should just do Nectar since it solves all this and more :)

On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <jc...@gmail.com> wrote:

> Well, you can do this physically by adding load/store boundaries to your
> code. Thinking out loud, such a thing could be possible...
> 
> At any M/R boundary, you store the intermediate in HDFS, and pig is aware
> of this and doesn't automatically delete it (this part in and of itself is
> not trivial -- what manages the garbage collection? perhaps that could be
> part of the configuration of such a feature). Then, when you rerun a job,
> it will look to see if the nodes that it would have saved (since it knows
> this at compile time) don't already actually exist.
> 
> There are some tricky caveats here... what if your code changes affect
> intermediate data? You could save the logical plan as well, but what if you
> make a change to a UDF? I am not sure if the benefit of automating this in
> the language compared to developing a workflow similar to yours external to
> pig is worth the complexity.
> 
> But it is intriguing, and is a subset of data caching that we have thought
> a lot about here.
> 
> 2012/6/15 Russell Jurney <ru...@gmail.com>
> 
>> In production I use short Pig scripts and schedule them with Azkaban
>> with dependencies setup, so that I can use Azkaban to restart long
>> data pipelines at the point of failure. I edit the failing pig script,
>> usually towards the end of the data pipeline, and restart the Azkaban
>> job. This saves hours and hours of repeated processing.
>> 
>> I wish Pig could do this. To resume at its point of failure when
>> re-run from the command line. Is this feasible?
>> 
>> Russell Jurney
>> twitter.com/rjurney
>> russell.jurney@gmail.com
>> datasyndrome.com
>>

Re: Resume failed pig script

Posted by Jonathan Coveney <jc...@gmail.com>.

Well, you can do this physically by adding load/store boundaries to your
code. Thinking out loud, such a thing could be possible...

At any M/R boundary, you store the intermediate in HDFS, and pig is aware
of this and doesn't automatically delete it (this part in and of itself is
not trivial -- what manages the garbage collection? perhaps that could be
part of the configuration of such a feature). Then, when you rerun a job,
it will look to see if the nodes that it would have saved (since it knows
this at compile time) don't already actually exist.

There are some tricky caveats here... what if your code changes affect
intermediate data? You could save the logical plan as well, but what if you
make a change to a UDF? I am not sure if the benefit of automating this in
the language compared to developing a workflow similar to yours external to
pig is worth the complexity.

But it is intriguing, and is a subset of data caching that we have thought
a lot about here.

2012/6/15 Russell Jurney <ru...@gmail.com>

> In production I use short Pig scripts and schedule them with Azkaban
> with dependencies setup, so that I can use Azkaban to restart long
> data pipelines at the point of failure. I edit the failing pig script,
> usually towards the end of the data pipeline, and restart the Azkaban
> job. This saves hours and hours of repeated processing.
>
> I wish Pig could do this. To resume at its point of failure when
> re-run from the command line. Is this feasible?
>
> Russell Jurney
> twitter.com/rjurney
> russell.jurney@gmail.com
> datasyndrome.com
>