You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Hai Lan <la...@gmail.com> on 2015/05/22 12:25:09 UTC
Graph job self-killed after superstep 0 with large input
Hello,
I’m trying to run Giraph job with 180092160 vertex on a 18 nodes 440G memory cluster. I used 144 workers with default partitioning. However, my job is always killed after superstep 0 with error as following:
2015-05-22 05:20:57,668 ERROR [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu <http://bespin05.umiacs.umd.edu/>, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu <http://bespin04d.umiacs.umd.edu/>, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu <http://bespin03a.umiacs.umd.edu/>, MRtaskID=14, port=30014)] on superstep 0
2015-05-22 05:20:57,668 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 0 took 77.624 seconds ended with state WORKER_FAILURE and is now on superstep 0
2015-05-22 05:20:57,673 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.
java.io.FileNotFoundException: File hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015 <hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015> does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:658)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
at org.apache.giraph.utils.CheckpointingUtils.getLastCheckpointedSuperstep(CheckpointingUtils.java:106)
at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(BspService.java:1196)
at org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1289)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:148)
This job works ok with customized partitioning with 144 workers and each worker partitioned in 144/72/180 by vertex id.
Also, default partitioning some job with 100051200 vertex input works good too.
Anyone could help?
Many thanks
Best wishes
Hai
Re: Graph job self-killed after superstep 0 with large input
Posted by Hai Lan <la...@gmail.com>.
Hi Lukas
Thanks for quick response. It seems I found the problem.
On 2,6,14 worker, errors show:
raph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 WARN [netty-client-worker-1]
org.apache.giraph.comm.netty.handler.ResponseClientHandler:
exceptionCaught: Channel failed with remote address
bespin03c.umiacs.umd.edu/192.168.74.113:30005
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:871)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:118)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
at java.lang.Thread.run(Thread.java:745)
I checked with bespin03c.umiacs.umd.edu/192.168.74.113:30005 and it shows:
2015-05-22 05:20:50,028 ERROR [main]
org.apache.giraph.graph.GraphMapper: Caught an unrecoverable exception
waitFor: ExecutionException occurred while waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@7328027c
java.lang.IllegalStateException: waitFor: ExecutionException occurred
while waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@7328027c
at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:193)
at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:151)
at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:136)
at org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.java:99)
at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:233)
at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:756)
at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:335)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:93)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.getResult(ProgressableUtils.java:327)
at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:187)
... 14 more
So the problem could be only solved by expand the memory of cluster if
I still use default hash way?
Thanks
Hai
Hai Lan, PhD student
hlan@umd.edu <cf...@umd.edu>
Department of Geographical Science
University of Maryland, College Park
1104 LeFrak Hall
College Park, MD 20742, USA
On Fri, May 22, 2015 at 6:32 AM, Lukas Nalezenec <
lukas.nalezenec@firma.seznam.cz> wrote:
> On 22.5.2015 12:25, Hai Lan wrote:
>
> Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu, MRtaskID=14, port=30014)] on superstep 0
>
>
> Hi,
> See in logs what happened on the missing workers.
> Lukas
>
Re: Graph job self-killed after superstep 0 with large input
Posted by Lukas Nalezenec <lu...@firma.seznam.cz>.
On 22.5.2015 12:25, Hai Lan wrote:
> Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu <http://bespin05.umiacs.umd.edu>, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu <http://bespin04d.umiacs.umd.edu>, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu <http://bespin03a.umiacs.umd.edu>, MRtaskID=14, port=30014)] on superstep 0
Hi,
See in logs what happened on the missing workers.
Lukas