You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Arghya Kusum Das <ar...@gmail.com> on 2014/11/16 06:53:51 UTC

Giraph job is failing on 128 node cluster. Seems only one worker failure is causing the entire job failure

Hi,

My Giraph job works fine in smaller number of nodes.
But when trying to run it on 128 nodes cluster I am getting the following
error.
It seems that only one worker failure is causing the entire job failure.
I attached the error messages from master and failed worker log.
Any help is appreciated


[MASTER LOG]
2014-11-15 23:01:45,305 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: (waiting for rest of workers) ALL_EXCEPT_ZOOKEEPER -
Attempt=0, Superstep=59
2014-11-15 23:01:46,169 FATAL org.apache.giraph.graph.GraphMapper:
uncaughtException: OverrideExceptionHandler on thread
org.apache.giraph.master.MasterThread, msg = unable to create new native
thread, exiting...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:691)
at
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:943)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1336)
at java.lang.UNIXProcess.initStreams(UNIXProcess.java:172)
at java.lang.UNIXProcess$2.run(UNIXProcess.java:145)
at java.lang.UNIXProcess$2.run(UNIXProcess.java:143)
at java.security.AccessController.doPrivileged(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:143)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
at java.lang.Runtime.exec(Runtime.java:615)
at java.lang.Runtime.exec(Runtime.java:448)
at java.lang.Runtime.exec(Runtime.java:345)
at pga.MasterVertex.compute(MasterVertex.java:242)
at
org.apache.giraph.master.BspServiceMaster.doMasterCompute(BspServiceMaster.java:1691)
at
org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1627)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)

[FAILED WORKER LOG]
2014-11-15 23:11:46,281 WARN org.apache.giraph.comm.netty.NettyServer:
start: Likely failed to bind on attempt 0 to port 30007
org.jboss.netty.channel.ChannelException: Failed to bind to: qb114/
208.100.93.114:30007
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
at org.apache.giraph.comm.netty.NettyServer.start(NettyServer.java:326)
at
org.apache.giraph.comm.netty.NettyMasterServer.<init>(NettyMasterServer.java:49)
at
org.apache.giraph.master.BspServiceMaster.becomeMaster(BspServiceMaster.java:877)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:98)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:344)
at sun.nio.ch.Net.bind(Net.java:336)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.bind(NioServerSocketPipelineSink.java:138)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleServerSocket(NioServerSocketPipelineSink.java:90)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:64)
at org.jboss.netty.channel.Channels.bind(Channels.java:569)
at org.jboss.netty.channel.AbstractChannel.bind(AbstractChannel.java:187)
at
org.jboss.netty.bootstrap.ServerBootstrap$Binder.channelOpen(ServerBootstrap.java:343)
at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)
at
org.jboss.netty.channel.socket.nio.NioServerSocketChannel.<init>(NioServerSocketChannel.java:80)
at
org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:158)
at
org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:86)
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:277)
... 4 more
2014-11-15 23:11:46,305 INFO org.apache.giraph.comm.netty.NettyServer:
start: Started server communication server: qb114/208.100.93.114:31007 with
up to 16 threads on bind attempt 1 with sendBufferSize = 32768
receiveBufferSize = 524288 backlog = 874
2014-11-15 23:11:46,325 INFO org.apache.giraph.comm.netty.NettyClient:
NettyClient: Using execution handler with 8 threads after requestEncoder.
2014-11-15 23:11:46,325 INFO org.apache.giraph.master.BspServiceMaster:
becomeMaster: I am now the master!
2014-11-15 23:11:46,326 INFO org.apache.giraph.master.BspServiceMaster:
/_hadoopBsp/job_201411152123_0003/_vertexInputSplitDir already exists, no
need to create
2014-11-15 23:11:46,326 ERROR org.apache.giraph.master.MasterThread:
masterThread: Master algorithm failed with NullPointerException
java.lang.NullPointerException
at java.lang.String.<init>(String.java:505)
at
org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:600)
at
org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:696)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:100)
2014-11-15 23:11:46,327 FATAL org.apache.giraph.graph.GraphMapper:
uncaughtException: OverrideExceptionHandler on thread
org.apache.giraph.master.MasterThread, msg =
java.lang.NullPointerException, exiting...
java.lang.IllegalStateException: java.lang.NullPointerException
at org.apache.giraph.master.MasterThread.run(MasterThread.java:185)
Caused by: java.lang.NullPointerException
at java.lang.String.<init>(String.java:505)
at
org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:600)
at
org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:696)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:100)


-- 
Thanks and regards,
Arghya Kusum Das
(225-362-4031)