You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "David A Boyuka, II" <da...@ncsu.edu> on 2011/06/01 00:23:12 UTC

Questions about node failures and data replication

Hello everyone. I’m currently doing some research involving Hadoop and Pig, evaluating the cost of data replication vs. the penalty of node failure with respect to job completion time. I’m current modifying a MapReduce simulator called “MRPerf” to accommodate a sequence of MapReduce jobs such as generated by Pig. I have some questions about how failures are handled in a Pig job, so I can reflect that in my implementation. I’m fairly new to Pig, so if some of these questions reflect misunderstanding, please let me know:

1) How does Pig/Hadoop handle failures which would require re-execution of part of all of some MapReduce job that was already completed earlier in the sequence? The only source of information I could find on this is in the paper “Making cloud intermediate data fault-tolerant” by Ko et. al. (http://portal.acm.org/citation.cfm?id=1807160), but I’m not sure whether this is accurate.

As an example:

Suppose a sequence of three MapReduce jobs have been generated, J1, J2 and J3, with a replication factor of 1 for the output of J1 and J2 (i.e. one instance of each HDFS block). Suppose jobs J1 and J2 complete, and J3 is almost done. Then a failure occurs on some node. Map and Reduce task outputs are lost from all jobs. It seems to me it would be necessary to re-execute some tasks, but not all, from all three jobs to complete the overall job. How does Pig/Hadoop handle such backtracking in this situation?

(This is based on understanding that when Pig produces a series MapReduce jobs, the replication factor for the intermediate data produced by each Reduce phase can be set individually. Correct me if I’m wrong here.)

2) On a related note, is the intermediate data produced by the Map tasks in an individual MapReduce job deleted after that job completes and the next in the sequence starts, or is it preserved until the whole Pig job finishes?

Any information or guidance would be greatly appreciated.

Thanks,
Drew

Re: Questions about node failures and data replication

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Drew,
Pig relies completely on Hadoop's handling of fault tolerance, and
does not do anything by itself (so, for example, if a job completely
fails, Pig won't try to rerun it).
The way Hadoop deals with failure is pretty well documented in the
O'Reilly Hadoop book as well as online -- essentially, mappers and
reducers can be rerun as needed (and thus one needs to be careful
about side effects and non-determinism in mapper and reducer tasks, as
they may run multiple times!).

Data replication is a somewhat orthogonal to the problem of failure of
a node that's processing data.  If HDFS is reading from server A, and
server A dies, HDFS will just switch to reading from a replica B, and
kick off a background copy to replace the missing A. Tasks won't fail.

D

On Tue, May 31, 2011 at 3:23 PM, David A Boyuka, II <da...@ncsu.edu> wrote:
> Hello everyone. I’m currently doing some research involving Hadoop and Pig, evaluating the cost of data replication vs. the penalty of node failure with respect to job completion time. I’m current modifying a MapReduce simulator called “MRPerf” to accommodate a sequence of MapReduce jobs such as generated by Pig. I have some questions about how failures are handled in a Pig job, so I can reflect that in my implementation. I’m fairly new to Pig, so if some of these questions reflect misunderstanding, please let me know:
>
> 1) How does Pig/Hadoop handle failures which would require re-execution of part of all of some MapReduce job that was already completed earlier in the sequence? The only source of information I could find on this is in the paper “Making cloud intermediate data fault-tolerant” by Ko et. al. (http://portal.acm.org/citation.cfm?id=1807160), but I’m not sure whether this is accurate.
>
> As an example:
>
> Suppose a sequence of three MapReduce jobs have been generated, J1, J2 and J3, with a replication factor of 1 for the output of J1 and J2 (i.e. one instance of each HDFS block). Suppose jobs J1 and J2 complete, and J3 is almost done. Then a failure occurs on some node. Map and Reduce task outputs are lost from all jobs. It seems to me it would be necessary to re-execute some tasks, but not all, from all three jobs to complete the overall job. How does Pig/Hadoop handle such backtracking in this situation?
>
> (This is based on understanding that when Pig produces a series MapReduce jobs, the replication factor for the intermediate data produced by each Reduce phase can be set individually. Correct me if I’m wrong here.)
>
> 2) On a related note, is the intermediate data produced by the Map tasks in an individual MapReduce job deleted after that job completes and the next in the sequence starts, or is it preserved until the whole Pig job finishes?
>
> Any information or guidance would be greatly appreciated.
>
> Thanks,
> Drew