You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Abhishek Agarwal <ab...@gmail.com> on 2015/10/12 13:30:13 UTC

Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/
-----------------------------------------------------------

Review request for pig and Rohini Palaniswamy.


Repository: pig-git


Description
-------

Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.

In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
          Pig script has not changed
          Input locations have not changed
          Output data from previous run is intact
          Configuration has not changed


Diffs
-----

  src/org/apache/pig/PigConfiguration.java 03b36a5 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
  src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
  src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
  src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 

Diff: https://reviews.apache.org/r/39226/diff/


Testing
-------


Thanks,

Abhishek Agarwal


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Posted by Abhishek Agarwal <ab...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/
-----------------------------------------------------------

(Updated Nov. 27, 2015, 3:40 p.m.)


Review request for pig and Rohini Palaniswamy.


Repository: pig-git


Description
-------

Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.

In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
         > Pig script has not changed
         > Input locations have not changed
         > Output data from previous run is intact
         > Configuration has not changed


Diffs (updated)
-----

  src/org/apache/pig/PigConfiguration.java 54959fe 
  src/org/apache/pig/PigServer.java ee52472 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
  src/org/apache/pig/impl/PigImplConstants.java 050a243 
  src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
  src/org/apache/pig/tools/grunt/GruntParser.java 439d087 

Diff: https://reviews.apache.org/r/39226/diff/


Testing
-------


Thanks,

Abhishek Agarwal


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Posted by Abhishek Agarwal <ab...@gmail.com>.

> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java, line 491
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line491>
> >
> >     Even if you skip deleting intermediate files here, it will delete in finally block of Main.java

When the jobCheckpoint is enabled, temporary output is written inside the staging container. Necessary transformation happens in mr.transform() call. In the finally block of main, only the temporary container is deleted.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java, line 507
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line507>
> >
> >     Just storing the current plan? how about what part of it has succeeded?

Success or failure is being inferred through the commit file in the output path of an intermediate job. If the output path SUCCESS file is present, then it is assumed that job has succeeded.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java, line 532
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line532>
> >
> >     Creating files that user is not expecting in output directories will be a problem.

That is understandable. Another approach could be store the completion state of the job along with output path. We can opt to not rerun the job if last state was successful and the directory is still present. Should we also note the timestamp of the directory?


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, line 103
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line103>
> >
> >     This might have to do more checks to skip some settings that usually change between runs but do not affect recovering. For eg: Running through Oozie, for a rerun you will get a different launcher job id in the config.

I probably missed this configuration. Skipping custom hard-coded settings is not feasible. User can give as an option to skip some configurations but that will make things complex for the user. I am now inclining toward an approach similar to oozie. For a rerun, user can explicitly specifiy rerun option. Pig will simply use the new configuration and recover the job. At least then behavior is easier to explain.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, line 119
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line119>
> >
> >     This might not be as simple as removing the operators. You might also have to traverse the plan and remove any predecessors and other corner cases.

Since we are walking in dependency order, the predecessor should have already been removed. If not, current node will not recover


- Abhishek


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------


On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2015, 11:30 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 03b36a5 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
>   src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Posted by Abhishek Agarwal <ab...@gmail.com>.

> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > Overall approach in general looks good and viable. There might be more of condition checks to be added and corner cases to handle as you go along and test out the implementation. I have commented on whatever gaps I could think of at the moment.

Major changes 
- Use logical signature instead of script hash
- Code style changes
- Storing the success/failure state of MRJob nodes

I have replied to the comments not falling in above category. Please check.


- Abhishek


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------


On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2015, 11:30 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 03b36a5 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
>   src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Posted by Abhishek Agarwal <ab...@gmail.com>.

> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, line 313
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line313>
> >
> >     The staging directory should be something that can be used by all scripts and different runs of the script with different parameters.
> >     
> >     Have sub-directory (logical plan hash) to store checkpoint files of a specific script so that you can look under it instead of whole staging directory. The directory and or the checkpoint files should be deleted once the script completes successfully or job cannot be recovered because input files/jars or config has changed and previous one cannot be used.

Checkpoint directory is supposed to be unique across different runs so that a rerun of a failed script does not clash with the fresh run of the same script.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, line 107
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line107>
> >
> >     Logical plan signature instead of script hash to take into account param changes, change in imported macro functions, etc

done


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, line 103
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line103>
> >
> >     This might have to do more checks to skip some settings that usually change between runs but do not affect recovering. For eg: Running through Oozie, for a rerun you will get a different launcher job id in the config.
> 
> Abhishek Agarwal wrote:
>     I probably missed this configuration. Skipping custom hard-coded settings is not feasible. User can give as an option to skip some configurations but that will make things complex for the user. I am now inclining toward an approach similar to oozie. For a rerun, user can explicitly specifiy rerun option. Pig will simply use the new configuration and recover the job. At least then behavior is easier to explain.

I have removed the check for configuration. Pig script will ignore the configuration changes as they are very hard to track.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java, line 532
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line532>
> >
> >     Creating files that user is not expecting in output directories will be a problem.
> 
> Abhishek Agarwal wrote:
>     That is understandable. Another approach could be store the completion state of the job along with output path. We can opt to not rerun the job if last state was successful and the directory is still present. Should we also note the timestamp of the directory?

On further thinking, temporary directories are not visible to the user. So it should be fine if we create the _SUCCESS file in them.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, line 163
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line163>
> >
> >     Use LoadStoreFinder like before. Reducing the number of times plan is traversed will help reduce compile time for large scripts.

Done.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, line 184
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line184>
> >
> >     Just the existence of temporary store directory does not guarantee that the job ran successfully and it has full output in it. You need to check if the mapreduce job that wrote to the temp store location was also successfull. Not sure if that is what you intend to do in isFailedMROper(previousMR). Currently it is not implemented or does not have TODO comments.
> >     
> >     Also there might be cases, data is written to side files in UDFs and temp stores are not used. You will have to check the code. Daniel might know off the top of his head.

can you give me some examples in code?


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/tools/pigstats/ScriptState.java, line 302
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095357#file1095357line302>
> >
> >     You should be going instead with pig.logical.plan.signature instead of creating a hash of the script.  Most of the scripts are parameterized and hash of the script is not useful to differentiate when run with different parameters.

Done


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/impl/io/FileLocalizer.java, line 63
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095355#file1095355line63>
> >
> >     Can you avoid wildcard imports? Also set your package import order to - java, javax, org, com - if that is not the default in your IDE. That is the standard used in Pig and would avoid bringing in unnecessary changes to all classes when you are contributing.

Done


- Abhishek


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------


On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2015, 11:30 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 03b36a5 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
>   src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Posted by Abhishek Agarwal <ab...@gmail.com>.

> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java, line 507
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line507>
> >
> >     Just storing the current plan? how about what part of it has succeeded?
> 
> Abhishek Agarwal wrote:
>     Success or failure is being inferred through the commit file in the output path of an intermediate job. If the output path SUCCESS file is present, then it is assumed that job has succeeded.

Successful operators are being stored as well now, in the state.


- Abhishek


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------


On Nov. 27, 2015, 3:40 p.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Nov. 27, 2015, 3:40 p.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 54959fe 
>   src/org/apache/pig/PigServer.java ee52472 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
>   src/org/apache/pig/impl/PigImplConstants.java 050a243 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Posted by Rohini Palaniswamy <ro...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------


Overall approach in general looks good and viable. There might be more of condition checks to be added and corner cases to handle as you go along and test out the implementation. I have commented on whatever gaps I could think of at the moment.


src/org/apache/pig/PigConfiguration.java (line 233)
<https://reviews.apache.org/r/39226/#comment161406>

    Internal configs should go in PigImplConstants.java



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java (line 454)
<https://reviews.apache.org/r/39226/#comment161418>

    Even if you skip deleting intermediate files here, it will delete in finally block of Main.java



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java (line 470)
<https://reviews.apache.org/r/39226/#comment161434>

    Just storing the current plan? how about what part of it has succeeded?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java (line 495)
<https://reviews.apache.org/r/39226/#comment161407>

    Creating files that user is not expecting in output directories will be a problem.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java (line 103)
<https://reviews.apache.org/r/39226/#comment161420>

    This might have to do more checks to skip some settings that usually change between runs but do not affect recovering. For eg: Running through Oozie, for a rerun you will get a different launcher job id in the config.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java (line 107)
<https://reviews.apache.org/r/39226/#comment161422>

    Logical plan signature instead of script hash to take into account param changes, change in imported macro functions, etc



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java (line 112)
<https://reviews.apache.org/r/39226/#comment161424>

    You will also have to check the size and checksum of jars and cache files.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java (line 119)
<https://reviews.apache.org/r/39226/#comment161439>

    This might not be as simple as removing the operators. You might also have to traverse the plan and remove any predecessors and other corner cases.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java (line 163)
<https://reviews.apache.org/r/39226/#comment161427>

    Use LoadStoreFinder like before. Reducing the number of times plan is traversed will help reduce compile time for large scripts.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java (line 184)
<https://reviews.apache.org/r/39226/#comment161436>

    Just the existence of temporary store directory does not guarantee that the job ran successfully and it has full output in it. You need to check if the mapreduce job that wrote to the temp store location was also successfull. Not sure if that is what you intend to do in isFailedMROper(previousMR). Currently it is not implemented or does not have TODO comments.
    
    Also there might be cases, data is written to side files in UDFs and temp stores are not used. You will have to check the code. Daniel might know off the top of his head.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java (line 313)
<https://reviews.apache.org/r/39226/#comment161429>

    The staging directory should be something that can be used by all scripts and different runs of the script with different parameters.
    
    Have sub-directory (logical plan hash) to store checkpoint files of a specific script so that you can look under it instead of whole staging directory. The directory and or the checkpoint files should be deleted once the script completes successfully or job cannot be recovered because input files/jars or config has changed and previous one cannot be used.



src/org/apache/pig/impl/io/FileLocalizer.java (line 45)
<https://reviews.apache.org/r/39226/#comment161408>

    Can you avoid wildcard imports? Also set your package import order to - java, javax, org, com - if that is not the default in your IDE. That is the standard used in Pig and would avoid bringing in unnecessary changes to all classes when you are contributing.



src/org/apache/pig/tools/pigstats/ScriptState.java (line 259)
<https://reviews.apache.org/r/39226/#comment161410>

    You should be going instead with pig.logical.plan.signature instead of creating a hash of the script.  Most of the scripts are parameterized and hash of the script is not useful to differentiate when run with different parameters.


- Rohini Palaniswamy


On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2015, 11:30 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 03b36a5 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
>   src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>


Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Posted by Abhishek Agarwal <ab...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/
-----------------------------------------------------------

(Updated Oct. 12, 2015, 11:30 a.m.)


Review request for pig and Rohini Palaniswamy.


Repository: pig-git


Description (updated)
-------

Pig scripts can have multiple ETL jobs in the DAG which may take hours to finish. In case of transient errors, the job fails. When the job is rerun, all the nodes in Job graph will rerun. Some of these nodes may have already run successfully. Redundant runs lead to wastage of cluster capacity and pipeline delays.

In case of failure, we can persist the graph state. In next run, only the failed nodes and their successors will rerun. This is of course subject to preconditions such as
         > Pig script has not changed
         > Input locations have not changed
         > Output data from previous run is intact
         > Configuration has not changed


Diffs
-----

  src/org/apache/pig/PigConfiguration.java 03b36a5 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 595e68c 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java 4b62112 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java PRE-CREATION 
  src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
  src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
  src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 

Diff: https://reviews.apache.org/r/39226/diff/


Testing
-------


Thanks,

Abhishek Agarwal