You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by "Zhiwei Gu (Created) (JIRA)" <ji...@apache.org> on 2011/10/24 20:58:32 UTC

[jira] [Created] (GIRAPH-61) Worker's early failure will cause the whole system fail

Worker's early failure will cause the whole system fail
-------------------------------------------------------

                 Key: GIRAPH-61
                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
             Project: Giraph
          Issue Type: Bug
          Components: bsp
    Affects Versions: 0.70.0
            Reporter: Zhiwei Gu
            Priority: Critical


When there's early failure happens to a worker, the whole system will fail.

Observed failed worker:
   State: Creating RPC threads failed
   Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.

[Failed worker logs:]

2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:597)
	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
	at org.apache.hadoop.util.Shell.run(Shell.java:182)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)


[Failed master logs:]

2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
	... 4 more

2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
java.lang.NullPointerException
	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
Caused by: java.lang.NullPointerException
	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150668#comment-13150668 ] 

Avery Ching commented on GIRAPH-61:
-----------------------------------

Currently, the master knows whenever an active worker fails, but doesn't handle the case when a worker fails and there is no checkpoint available.  The master could correctly handle this situation.  One possible way would be similar to the way that worker failure is handled, create a new attempt and restart the input superstep.  Of course, if you a failure early on from not having enough memory, this is likely to only prolong your agony.  On the other hand, if someone unplugged a machine you were using, you'll be able to start over without any operator intervention.
                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.70.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:597)
> 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
> 	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
> 	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
> 	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
> 	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
> 	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
> 	... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
> java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134716#comment-13134716 ] 

Avery Ching commented on GIRAPH-61:
-----------------------------------

If you actually have enough resources (memory), then the issue is less likely, but could still potentially happen.  In fact, there is currently no recovery if the failure happens prior to the first checkpoint (other than restart).
                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.70.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:597)
> 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
> 	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
> 	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
> 	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
> 	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
> 	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
> 	... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
> java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Posted by "Juan F. Codagnone (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239439#comment-13239439 ] 

Juan F. Codagnone commented on GIRAPH-61:
-----------------------------------------

I think that I had the same problem.

I had a task running on zombie02.it.zauber.com.ar that was running ok
Last log:
       2012-03-27 08:55:38,208 INFO org.apache.giraph.graph.BspServiceWorker: 
       storeCheckpoint: Finished metadata (_bsp/_checkpoints/job_201203270847
       _0002/0.zombie02.it.zauber.com.ar_3.metadata) and vertices 
       (_bsp/_checkpoints/job_201203270847_0002/0.zombie02.it.zauber.com.ar_3.vertices).
       2012-03-27 08:56:56,985 WARN some app specific log
but suddenly failed 
	java.lang.Throwable: Child Error
		at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
	Caused by: java.io.IOException: Task process exit with nonzero status of 137.
		at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
because it was killed by the OOM killer (dmesg)
   [151346.666614] Out of memory: Kill process 10383 (java) score 883 or sacrifice child)

The problem is that the task was attempted to ran in another host: (tomas.it.zauber.com.ar)
But all others tasks, even the re-attempt are trying to connect back to zombie02.

2012-03-27 09:06:38,719 WARN org.apache.giraph.comm.BasicRPCCommunications: connectAllRPCProxys: Failed on attempt 4 of 5 to connect to (id=23,cur=Worker(hostname=zombie02.it.zauber.com.ar, MRpartition=3, port=30003),prev=null,ckpt_file=null)
java.net.ConnectException: Call to zombie02.it.zauber.com.ar/192.168.160.77:30003 failed on connection exception: java.net.ConnectException: Connection refused

2012-03-27 09:16:48,632 ERROR org.apache.giraph.comm.BasicRPCCommunications: java.net.ConnectException: Call to zombie02.it.zauber.com.ar/192.168.160.77:30003 failed on connection exception: java.net.ConnectException: Connection refused


Zookeeper and the main tasks are running in another hosts.

                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.1.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:597)
> 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
> 	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
> 	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
> 	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
> 	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
> 	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
> 	... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
> java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Posted by "Zhiwei Gu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134536#comment-13134536 ] 

Zhiwei Gu commented on GIRAPH-61:
---------------------------------

Even with the solution to RPC thread issue, it might crash with early failure. I think we need some strategy to deal with such kind of failures.

                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.70.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:597)
> 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
> 	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
> 	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
> 	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
> 	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
> 	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
> 	... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
> java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Posted by "Owen O'Malley (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150652#comment-13150652 ] 

Owen O'Malley commented on GIRAPH-61:
-------------------------------------

I don't think this is the right direction. Adding a second job is really expensive to get working right. (And worse, it will make us more sensitive to security vs. non-security.) It seems like a much better fix is to be able to use the initial state as a checkpoint.

It would also be good if we could handle worker failure more gracefully, but that is a much bigger project.
                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.70.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:597)
> 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
> 	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
> 	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
> 	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
> 	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
> 	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
> 	... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
> java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Posted by "Arun Suresh (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150646#comment-13150646 ] 

Arun Suresh commented on GIRAPH-61:
-----------------------------------

Since a Giraph Job is a SINGLE Hadoop Job, and since the MasterThread has no other option but to Fail the current job if all workers dont respond (I cannot spawn new workers / restart existing workers) there might not be a way around this in the current scheme.

But consider what if the Giraph Job were composed of two Hadoop Jobs.. Basically, the first Job would start just the Master Map Task which takes care of Job initialization, starting Zookeeper etc. Finally, it will kick off a second Hadoop Job. The Second Hadoop Job would be similar to the current Giraph Job and will spawn only Worker Tasks. The Master Task from the first job stays alive until the algorithm completes.

Pig apparently does something similar actually. It starts a single Map task first which in-turn spawns multiple subtasks. 

Pros:
* The whole Giraph Job need not fail if a worker fails at startup as above. Under the new scheme, the Master will detect that all workers have not responded at the start of the superstep.. and can opt to completely restart JUST the Second job. 
* The GiraphMapper class can be split into a MasterMapper and WorkerMapper. Might possibly make it a bit cleaner (Basically there are a couple of places where an "if (mapFunctions == MapFunctions.MASTER_ONLY)" is required to differentiate the Task from being a Master/Worker. This can cleanly be refactored into two class each with a specific responsibility).
* At some point of time, we will possibly move to YARN where Giraph would be userland code decoupled from MapReduce itself. In which case, the Master would map to the ApplicationMaster. The refactoring would allow Giraph to more easily retrofitted to YARN.
                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.70.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:597)
> 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
> 	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
> 	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
> 	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
> 	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
> 	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
> 	... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
> java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134395#comment-13134395 ] 

Jakob Homan commented on GIRAPH-61:
-----------------------------------

Yeah, we need to be much better at failure resolution.  The particular problem you're running into is the RPC thread overhead issue, being addressed in GIRAPH-12 and GIRAPH-37.
                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.70.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:597)
> 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 	at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
> 	at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
> 	at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
> 	at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher 
> java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
> 	at org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
> 	at org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
> 	at org.apache.giraph.graph.BspService.process(BspService.java:1100)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
> 	at org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
> 	... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 399 of 400 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: masterThread: Master algorithm failed: 
> java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
> 	at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira