You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by José Luis Larroque <la...@gmail.com> on 2016/08/15 15:15:16 UTC

Container gets killed on Worker, when doing flush, after completing superstep, and the entire application hangs

Hi all !!

I know that all are very busy with giraph 1.2.0, but i hope that someone
can help me :)

I'm running a **Giraph** application on **EMR**.

I'm using a cluster of 1 master and 10 slaves, all **m3.2xlarge** machines.

The application consist, basically, on a BFS through the spanish version of
Wikipedia (i adapted the Wikipedia information for fitting on Giraph).

I execute the applications in the following way:

    /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
ar.edu.info.unlp.tesina.lectura.grafo.algoritmos.masivos.BusquedaDeCaminosNavegacionalesWikiquotesMasivo
/tmp/vertices.txt 4 -@- 1
ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
-vif
ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
-vip /user/hduser/input/grafo-wikipedia.txt -vof
ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
-op /user/hduser/output/caminosNavegacionales -w 10 -yh 11500 -ca
giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.isStaticGraph=true,giraph.numInputThreads=4,giraph.numOutputThreads=4

I can sucessfully run a application with 3 supersteps, but if i want to do
4 supersteps, the application fails, a container gets killed, and the rest
die along too.

Searching in the Giraph Application Manager, says this:

    16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening
proxy : ip-172-31-0-147.sa-east-1.compute.internal:9103
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: START_CONTAINER for Container
container_1471231949464_0001_01_000005
    16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening
proxy : ip-172-31-0-145.sa-east-1.compute.internal:9103
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000009
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000011
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000004
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000010
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000006
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000007
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000008
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000005
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000002
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000012
    16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event
EventType: QUERY_CONTAINER for Container
container_1471231949464_0001_01_000003
    16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got response from
RM for container ask, completedCnt=1
    16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got container
status for containerID=container_1471231949464_0001_01_000008,
state=COMPLETE, exitStatus=143, diagnostics=Container
[pid=4455,containerID=container_1471231949464_0001_01_000008] is running
beyond physical memory limits. Current usage: 11.4 GB of 11.3 GB physical
memory used; 12.6 GB of 56.3 GB virtual memory used. Killing container.
    Dump of the process-tree for container_1471231949464_0001_01_000008 :
            |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
            |- 4459 4455 4455 4455 (java) 13568 5567 13419675648 2982187
java -Xmx11500M -Xms11500M -cp
.:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*
org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1
            |- 4455 2706 4455 4455 (bash) 0 0 115875840 807 /bin/bash -c
java -Xmx11500M -Xms11500M -cp
.:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*
org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1
1>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stdout.log
2>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stderr.log


    Container killed on request. Exit code is 143
    Container exited with a non-zero exit code 143

    16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: After completion
of one conatiner. current status is: completedCount :1 containersToLaunch
:11 successfulCount :0 failedCount :1
    16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got response from
RM for container ask, completedCnt=7
    16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container
status for containerID=container_1471231949464_0001_01_000002,
state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
    org.apache.hadoop.util.Shell$ExitCodeException:
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
            at org.apache.hadoop.util.Shell.run(Shell.java:418)
            at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
            at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
            at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
            at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)


    Container exited with a non-zero exit code 1

    16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container
status for containerID=container_1471231949464_0001_01_000012,
state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
    org.apache.hadoop.util.Shell$ExitCodeException:
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
            at org.apache.hadoop.util.Shell.run(Shell.java:418)
            at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
            at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
            at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
            at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)


    Container exited with a non-zero exit code 1

    16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container
status for containerID=container_1471231949464_0001_01_000006,
state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
    org.apache.hadoop.util.Shell$ExitCodeException:
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
            at org.apache.hadoop.util.Shell.run(Shell.java:418)
            at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
            at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
            at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
            at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)


    Container exited with a non-zero exit code 1

    16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container
status for containerID=container_1471231949464_0001_01_000007,
state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
    org.apache.hadoop.util.Shell$ExitCodeException:
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
            at org.apache.hadoop.util.Shell.run(Shell.java:418)
            at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
            at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)

So, appears to be a problem with memory on container 8, but here are the
last logs lines of container 8 (remember, it's the container that gets
killed):

    16/08/15 03:46:52 INFO graph.ComputeCallable: call: Computation took
23.90834 secs for 10 partitions on superstep 3.  Flushing started
    16/08/15 03:46:52 INFO worker.BspServiceWorker: finishSuperstep:
Waiting on all requests, superstep 3 Memory (free/total/max) = 4516.47M /
10619.50M / 10619.50M
    16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests:
Waiting interval of 15000 msecs, 1307 open requests, waiting for it to be
<= 0, MBytes/sec received = 0.0029, MBytesReceived = 0.0678, ave received
req MBytes = 0, secs waited = 23.332
    MBytes/sec sent = 143.2912, MBytesSent = 3343.4141, ave sent req MBytes
= 0.4999, secs waited = 23.332
    16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: 548
requests for taskId=10, 504 requests for taskId=0, 251 requests for
taskId=5, 1 requests for taskId=4, 1 requests for taskId=7, 1 requests for
taskId=8,

So, i i understant this correctly, the container have 4516.47M available
before doing **flush**, and when doing it, consumes all available of those
4516.47M, and when wants more, gets killed by Giraph AM?

I don't understand why needs so much memory doing flush, its basically save
the results on disk for the next superstep right? so theoretically
shouldn’t need memory at all.

Any help will be greatly apreciated.

Thanks!
Jose