You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Hai Lan <la...@gmail.com> on 2015/05/22 12:25:09 UTC

Graph job self-killed after superstep 0 with large input

Hello,

I’m trying to run Giraph job with 180092160 vertex on a 18 nodes 440G memory cluster. I used 144 workers with default partitioning. However, my job is always killed after superstep 0 with error as following:

2015-05-22 05:20:57,668 ERROR [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu <http://bespin05.umiacs.umd.edu/>, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu <http://bespin04d.umiacs.umd.edu/>, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu <http://bespin03a.umiacs.umd.edu/>, MRtaskID=14, port=30014)] on superstep 0
2015-05-22 05:20:57,668 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 0 took 77.624 seconds ended with state WORKER_FAILURE and is now on superstep 0
2015-05-22 05:20:57,673 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.
java.io.FileNotFoundException: File hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015 <hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015> does not exist.
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:658)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
	at org.apache.giraph.utils.CheckpointingUtils.getLastCheckpointedSuperstep(CheckpointingUtils.java:106)
	at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(BspService.java:1196)
	at org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1289)
	at org.apache.giraph.master.MasterThread.run(MasterThread.java:148)

This job works ok with customized partitioning with 144 workers and each worker partitioned in 144/72/180 by vertex id.

Also, default partitioning some job with 100051200 vertex input works good too.

Anyone could help?

Many thanks

Best wishes

Hai

Re: Graph job self-killed after superstep 0 with large input

Posted by Hai Lan <la...@gmail.com>.
Hi Lukas

Thanks for quick response. It seems I found the problem.

On 2,6,14 worker, errors show:

raph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,606 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 ERROR [netty-client-worker-1]
org.apache.giraph.comm.netty.NettyClient: Request failed
java.nio.channels.ClosedChannelException
2015-05-22 05:20:57,607 WARN [netty-client-worker-1]
org.apache.giraph.comm.netty.handler.ResponseClientHandler:
exceptionCaught: Channel failed with remote address
bespin03c.umiacs.umd.edu/192.168.74.113:30005
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:871)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:118)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
	at java.lang.Thread.run(Thread.java:745)


I checked with bespin03c.umiacs.umd.edu/192.168.74.113:30005 and it shows:


2015-05-22 05:20:50,028 ERROR [main]
org.apache.giraph.graph.GraphMapper: Caught an unrecoverable exception
waitFor: ExecutionException occurred while waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@7328027c
java.lang.IllegalStateException: waitFor: ExecutionException occurred
while waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@7328027c
	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:193)
	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:151)
	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:136)
	at org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.java:99)
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:233)
	at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:756)
	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:335)
	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:93)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:188)
	at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.getResult(ProgressableUtils.java:327)
	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:187)
	... 14 more



So the problem could be only solved by expand the memory of cluster if
I still use default hash way?


Thanks


Hai




Hai Lan, PhD student
hlan@umd.edu <cf...@umd.edu>
Department of Geographical Science
University of Maryland, College Park
1104 LeFrak Hall
College Park, MD 20742, USA

On Fri, May 22, 2015 at 6:32 AM, Lukas Nalezenec <
lukas.nalezenec@firma.seznam.cz> wrote:

>  On 22.5.2015 12:25, Hai Lan wrote:
>
> Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu, MRtaskID=14, port=30014)] on superstep 0
>
>
> Hi,
> See in logs what happened on the missing workers.
> Lukas
>

Re: Graph job self-killed after superstep 0 with large input

Posted by Lukas Nalezenec <lu...@firma.seznam.cz>.
On 22.5.2015 12:25, Hai Lan wrote:
> Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu  <http://bespin05.umiacs.umd.edu>, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu  <http://bespin04d.umiacs.umd.edu>, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu  <http://bespin03a.umiacs.umd.edu>, MRtaskID=14, port=30014)] on superstep 0

Hi,
See in logs what happened on the missing workers.
Lukas