You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by Vincentius Martin <vi...@gmail.com> on 2014/10/16 19:37:10 UTC

Failover Mechanism in Giraph?

Hi,

Recently, I tried to learn Giraph by running RandomMessageBenchmark.

In normal condition, it works just fine. However, when I tried running it
with a slow node in the system, the work just didn't finish. The progress
just went down after it reached 100% map task. After that, it showed me
some errors log like this:

*INFO mapred.JobClient: Task Id : attempt_201410101016_0003_m_*
*000004_0, Status : FAILEDTask attempt_201410101016_0003_m_**000004_0
failed to report status for 600 seconds. Killing!*

So, I'm curious about how failover mechanism works in Giraph? I believe
that it uses checkpoint but I don't know the detail.

Also, I read the source GiraphJob.java. It states that Giraph doesn't use
speculative execution, so what happened when a node in a cluster is
problematic? Does hadoop also redistribute the task to some other workers?

Thanks!

Regards,
Vincentius Martin