You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Denis Dudinski <de...@gmail.com> on 2018/11/28 12:27:21 UTC

Automatic restart from checkpoint after worker failure on YARN

Hi!

I am running Giraph with YARN. Checkpointing is enabled. But when worker failure happens master node outputs:

18/11/28 12:52:31 INFO master.MasterThread: masterThread: Coordination of superstep 3 took 0.094 seconds ended with state WORKER_FAILURE and is now on superstep 3
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState: {"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2} on superstep 2
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState: {"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2}
18/11/28 12:52:31 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ONLY checkWorkers: Only found 0 responses of 2 needed to start superstep 2
After a while it fails job since timeout expires and no workers are present. 

Is it possible to use automatic checkpoint resuming without falling back from YARN to MR driver?

Best Regards,
Denis Dudinski