You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Vincentius Martin <vi...@gmail.com> on 2014/11/21 05:01:28 UTC

Master seems to have no role when an agent is killed by Jobtracker

Hi,

I want to ask.. I'm using Giraph 1.0.0 with hadoop-0.20.203.0.

I saw a case when a worker cannot give response to master because of the
slow connection problem. It is the situation in sending the aggregation.
After the master waits for a period of time, then suddenly the worker is
killed by JobTracker. Here is the log:






*2014-10-21 10:25:31,708 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_201410210948_0001_m_000006_0: java.lang.Throwable: Child
Error    at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)Caused by:
java.io.IOException: Task process exit with nonzero status of 134.    at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)2014-11-18
10:25:34,723 INFO org.apache.hadoop.mapred.JobTracker: Removing task
'attempt_201410210948_0001_m_000006_0'*

What confuse me more is, I didn't see master does the checkpoint process
here. Instead, the superstep just fails and the master is also killed by
JobTracker












*2014-10-21 10:32:54,151 INFO org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 1 took 2054.184 seconds ended with
state WORKER_FAILURE and is now on superstep 12014-10-21 10:32:54,929 ERROR
org.apache.giraph.master.MasterThread: masterThread: Master algorithm
failed with RuntimeExceptionjava.lang.RuntimeException:
restartFromCheckpoint: KeeperException    at
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)Caused
by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
NoNode for /_hadoopBsp/job_201410210948_0001/_edgeInputSplitDir    at
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)    at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)    at
org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)    at
org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)    at
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1179)
... 1 more*

However, sometimes JobTracker can assign the job to another worker but in
my case, it is not always success.

My question here is, does master have any role in this case? It seems that
I didn't see any recovery (checkpoint) from master in my case.

Thanks

Regards,
Vincentius Martin