You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by Matthew Cornell <ma...@matthewcornell.org> on 2014/09/22 17:50:12 UTC

understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Hi Folks,

I've spent the last two months learning, installing, coding, and
analyzing the performance of our Giraph app, and I'm able to run on
small inputs on our tiny cluster (yay!) I am now stuck trying to
figure out why larger inputs fail, why only some compute nodes are
being used, and generally how to make sure I've configured hadoop and
giraph to use all available CPUs and RAM. I feel that I'm "this
close," and I could really use some pointers.

Below I share our app, configuration, results and log messages, some
questions, and counter output for the successful run. My post here is
long (I've broken it into sections delimited with '===='), but I hope
I've provided good enough information to get help on. I'm happy to add
to it.

Thanks!


==== application ====

Our application is a kind of path search where all nodes have a type
and source database ID (e.g., "movie 99"), and searches are expressed
as type paths, such as "movie, acted_in, actor", which would start
with movies and then find all actors in each movie, for all movies in
the database. The program does a kind of filtering by keeping track of
previously-processed initial IDs.

Our database is a small movie one with 2K movies, 6K users (people who
rate movies), and 80K ratings of movies by users. Though small, we've
found this kind of search can result in a massive explosion of
messages, as was well put by Rob Vesse (
http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3CCEC4A409.2D7AD%25rvesse@dotnetrdf.org%3E
):

> even with this relatively small graph you get a massive explosion of messages by the later super steps which exhausts memory (in my graph the theoretical maximum messages by the last super step was ~3 billion)


==== job failure and error messages ====

Currently I have a four-step path that completes in ~20 seconds
("rates, movie, rates, user" - counter output shown at bottom) but a
five-step one ("rates, movie, rates, user, rates") fails after a few
minutes. I've looked carefully at the task logs, but I find it a
little difficult to discern what the actual failure was. However,
looking at system information (e.g., top and ganglia) during the run
indicates hosts are running out of memory. There are no
OutOfMemoryErrors in the logs, and only this one stsands out:

> ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4

NB: So far I've been ignoring these other types of messages:

> FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.

> java.io.FileNotFoundException: File _bsp/_checkpoints/job_201409191450_0003 does not exist.

> WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished, type=NodeDeleted, state=SyncConnected)

> ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got failure, unregistering health on /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1 on superstep 4

The counter statistics are minimal after the run fails, but during it
I see something like this when refreshing the Job Tracker Web UI:

> Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB
> Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
> Counters > Giraph Stats > Sent messages: ~181M


==== hadoop/giraph command ====

hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=xx.xx.xx.edu:2181 \
-libjars ${LIBJARS} \
relpath.RelPathVertex \
-wc relpath.RelPathWorkerContext \
-mc relpath.RelPathMasterCompute \
-vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
-vip $REL_PATH_INPUT \
-of relpath.RelPathVertexValueTextOutputFormat \
-op $REL_PATH_OUTPUT \
-ca RelPathVertex.path=$REL_PATH_PATH \
-w 8


==== cluster, versions, and configuration ====

We have a five-node cluster with a head and four compute nodes. The
head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
and 48 cores.

Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
on the head node, and task trackers on all five nodes.

Hadoop configuration (mapred-site.xml and CDH interface - sorry for
the mix) - not sure I'm listing all of them of interest:

> mapreduce.job.counters.max: 120
> mapred.output.compress: false
> mapred.reduce.tasks: 12
> mapred.child.java.opts: -Xmx2147483648
> mapred.job.reuse.jvm.num.tasks: 1
> MapReduce Child Java Maximum Heap Size: 2 GB
> I/O Sort Memory Buffer (io.sort.mb): 512 MB
> Client Java Heap Size in Bytes: 256 MB
> Java Heap Size of Jobtracker in Bytes 1 GB
> Java Heap Size of TaskTracker in Bytes: 1 GB

Cluster summary from the Job Tracker Web UI:

> Heap Size is 972.69 MB/989.88 MB
> Map Task Capacity: 48
> Reduce Task Capacity: 24
> Avg. Tasks/Node: 14.40

Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG:
Release 1.0.0 - 2013-04-15


==== questions ====

o How can I verify that the failure is actually one of memory? I've
looked fairly carefully at the logs.

o I noticed that not all hosts are being used. I did three runs, two
with 8 workers and one with 12, and I pulled the following from the
task logs ('h' = head node, 0-3 = compute nodes):

> run #1: 0, 2, 3, h, h, h, h, h, h
> run #2: 2, 1, 3, h, h, h, h, h, h
> run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1

Note that there's at least one compute node that isn't listed for each run.

o What's a good # of workers to use?

o What Hadoop parameters should I tweak?
> mapred.job.map.memory.mb=xx
> mapred.map.child.java.opts=xx
> mapred.{map|reduce}.child.ulimit
> mapred.task.profile
> # map slots for each TaskTracker
> number of partitions you keep in memory


o What Giraph parameters should I tweak? I'm currently using defaults
for all, but I found these possibilities:
> giraph.maxPartitionsInMemory
> giraph.useOutOfCoreGraph=true
> giraph.maxPartitionsInMemory=N (default: 10)
> giraph.isStaticGraph=true
> giraph.useOutOfCoreMessages=true (default: disabled)
> giraph.maxMessagesInMemory=N (default: 1000000)

o How can I get a feel for how much more processing and memory might
be needed to finish the job, beyond that it's on the last superstep?
For example, of the ~181M sent messages I see during the run, how many
more might be left?

o Why is the Heap Size from the Cluster summary above (972.69
MB/989.88 MB) so low?

Thanks again!


==== counters from successful four-step run ====

INFO mapred.JobClient: Job complete: job_201409191450_0001
INFO mapred.JobClient: Counters: 39
INFO mapred.JobClient:   File System Counters
INFO mapred.JobClient:     FILE: Number of bytes read=0
INFO mapred.JobClient:     FILE: Number of bytes written=1694975
INFO mapred.JobClient:     FILE: Number of read operations=0
INFO mapred.JobClient:     FILE: Number of large read operations=0
INFO mapred.JobClient:     FILE: Number of write operations=0
INFO mapred.JobClient:     HDFS: Number of bytes read=10016293
INFO mapred.JobClient:     HDFS: Number of bytes written=113612773
INFO mapred.JobClient:     HDFS: Number of read operations=12
INFO mapred.JobClient:     HDFS: Number of large read operations=0
INFO mapred.JobClient:     HDFS: Number of write operations=9
INFO mapred.JobClient:   Job Counters
INFO mapred.JobClient:     Launched map tasks=9
INFO mapred.JobClient:     Total time spent by all maps in occupied
slots (ms)=206659
INFO mapred.JobClient:     Total time spent by all reduces in occupied
slots (ms)=0
INFO mapred.JobClient:     Total time spent by all maps waiting after
reserving slots (ms)=0
INFO mapred.JobClient:     Total time spent by all reduces waiting
after reserving slots (ms)=0
INFO mapred.JobClient:   Map-Reduce Framework
INFO mapred.JobClient:     Map input records=9
INFO mapred.JobClient:     Map output records=0
INFO mapred.JobClient:     Input split bytes=396
INFO mapred.JobClient:     Spilled Records=0
INFO mapred.JobClient:     CPU time spent (ms)=243280
INFO mapred.JobClient:     Physical memory (bytes) snapshot=9947144192
INFO mapred.JobClient:     Virtual memory (bytes) snapshot=25884065792
INFO mapred.JobClient:     Total committed heap usage (bytes)=10392305664
INFO mapred.JobClient:   Giraph Stats
INFO mapred.JobClient:     Aggregate edges=402428
INFO mapred.JobClient:     Aggregate finished vertices=119141
INFO mapred.JobClient:     Aggregate vertices=119141
INFO mapred.JobClient:     Current master task partition=0
INFO mapred.JobClient:     Current workers=8
INFO mapred.JobClient:     Last checkpointed superstep=0
INFO mapred.JobClient:     Sent messages=0
INFO mapred.JobClient:     Superstep=4
INFO mapred.JobClient:   Giraph Timers
INFO mapred.JobClient:     Input superstep (milliseconds)=1689
INFO mapred.JobClient:     Setup (milliseconds)=3977
INFO mapred.JobClient:     Shutdown (milliseconds)=1177
INFO mapred.JobClient:     Superstep 0 (milliseconds)=834
INFO mapred.JobClient:     Superstep 1 (milliseconds)=1836
INFO mapred.JobClient:     Superstep 2 (milliseconds)=2524
INFO mapred.JobClient:     Superstep 3 (milliseconds)=8284
INFO mapred.JobClient:     Total (milliseconds)=20322

==== EOF ====


-- 
Matthew Cornell | matt@matthewcornell.org | 413-626-3621 | 34
Dickinson Street, Amherst MA 01002 | matthewcornell.org

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Posted by Matthew Cornell <ma...@matthewcornell.org>.

On Mon, Sep 22, 2014 at 2:10 PM, Matthew Saltz <sa...@gmail.com> wrote:
> In the logs for the workers, do you have a line that looks like:
> 2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker:
> finishSuperstep: Waiting on all requests, superstep 93 Memory
> (free/total/max) = 21951.08M / 36456.50M / 43691.00M
>
> Looking at the memory usage in the worker that fails at the end of
superstep
> before failure could give you a clue.

Yes, all four workers when I use "-w 4" have those lines:

Task Logs: 'attempt_201409191450_0016_m_000001_0': compute-0-1:
2014-09-25 09:28:13,425 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 242.41M / 438.06M / 1820.50M
2014-09-25 09:28:13,817 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 194.77M / 438.06M / 1820.50M
2014-09-25 09:28:14,936 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 383.74M / 600.38M / 1820.50M
2014-09-25 09:28:17,820 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 362.14M / 1007.50M / 1820.50M
2014-09-25 09:28:31,680 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 203.33M / 1661.50M / 1820.50M

Task Logs: 'attempt_201409191450_0016_m_000002_0': compute-0-1:
2014-09-25 09:28:13,458 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 887.74M / 964.50M / 1820.50M
2014-09-25 09:28:14,381 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 830.14M / 964.50M / 1820.50M
2014-09-25 09:28:15,337 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 785.66M / 1217.00M / 1820.50M
2014-09-25 09:28:18,114 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 661.72M / 1113.50M / 1820.50M
2014-09-25 09:28:52,451 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 285.90M / 1831.00M / 1831.00M

Task Logs: 'attempt_201409191450_0016_m_000003_0': wright:
2014-09-25 09:28:13,456 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 886.23M / 964.50M / 1820.50M
2014-09-25 09:28:14,399 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 826.36M / 964.50M / 1820.50M
2014-09-25 09:28:15,556 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 662.50M / 1217.00M / 1820.50M
2014-09-25 09:28:18,170 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 581.14M / 1115.00M / 1820.50M
2014-09-25 09:29:31,673 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 299.61M / 1834.00M / 1834.00M

Task Logs: 'attempt_201409191450_0016_m_000004_0': wright:
2014-09-25 09:28:13,473 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 887.10M / 964.50M / 1820.50M
2014-09-25 09:28:14,374 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 826.65M / 964.50M / 1820.50M
2014-09-25 09:28:15,755 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 980.33M / 1217.00M / 1820.50M
2014-09-25 09:28:18,254 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 517.13M / 1128.50M / 1820.50M
2014-09-25 09:29:34,392 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 271.52M / 1858.50M / 1858.50M


I'm still not clear on a couple of things:

   1. Each compute node has 16GB of memory, but each task has a max of
   ~1820M (<2GB). In Cloudera's web UI, I set "MapReduce Child Java Maximum
   Heap Size" to 2GB (default is 1GB). I will try upping it to 8GB.
   2. I still don't understand why only two of my five possible nodes are
   being used.

Thank you.



-- 
Matthew Cornell | matt@matthewcornell.org | 413-626-3621 | 34 Dickinson
Street, Amherst MA 01002 | matthewcornell.org

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Posted by Matthew Cornell <ma...@matthewcornell.org>.

I cannot thank you enough, Matthew. You've given me a lot to
experiment with. -- matt

On Mon, Sep 22, 2014 at 2:10 PM, Matthew Saltz <sa...@gmail.com> wrote:
> Hi Matthew,
>
> I answered a few of your questions in-line (unfortunately they might not
> help the larger problem, but hopefully it'll help a bit).
>
> Best,
> Matthew
>
>
> On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell <ma...@matthewcornell.org>
> wrote:
>>
>> Hi Folks,
>>
>> I've spent the last two months learning, installing, coding, and
>> analyzing the performance of our Giraph app, and I'm able to run on
>> small inputs on our tiny cluster (yay!) I am now stuck trying to
>> figure out why larger inputs fail, why only some compute nodes are
>> being used, and generally how to make sure I've configured hadoop and
>> giraph to use all available CPUs and RAM. I feel that I'm "this
>> close," and I could really use some pointers.
>>
>> Below I share our app, configuration, results and log messages, some
>> questions, and counter output for the successful run. My post here is
>> long (I've broken it into sections delimited with '===='), but I hope
>> I've provided good enough information to get help on. I'm happy to add
>> to it.
>>
>> Thanks!
>>
>>
>> ==== application ====
>>
>> Our application is a kind of path search where all nodes have a type
>> and source database ID (e.g., "movie 99"), and searches are expressed
>> as type paths, such as "movie, acted_in, actor", which would start
>> with movies and then find all actors in each movie, for all movies in
>> the database. The program does a kind of filtering by keeping track of
>> previously-processed initial IDs.
>>
>> Our database is a small movie one with 2K movies, 6K users (people who
>> rate movies), and 80K ratings of movies by users. Though small, we've
>> found this kind of search can result in a massive explosion of
>> messages, as was well put by Rob Vesse (
>>
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3CCEC4A409.2D7AD%25rvesse@dotnetrdf.org%3E
>> ):
>>
>> > even with this relatively small graph you get a massive explosion of
>> > messages by the later super steps which exhausts memory (in my graph the
>> > theoretical maximum messages by the last super step was ~3 billion)
>>
>>
>> ==== job failure and error messages ====
>>
>> Currently I have a four-step path that completes in ~20 seconds
>> ("rates, movie, rates, user" - counter output shown at bottom) but a
>> five-step one ("rates, movie, rates, user, rates") fails after a few
>> minutes. I've looked carefully at the task logs, but I find it a
>> little difficult to discern what the actual failure was. However,
>> looking at system information (e.g., top and ganglia) during the run
>> indicates hosts are running out of memory. There are no
>> OutOfMemoryErrors in the logs, and only this one stsands out:
>>
>> > ERROR org.apache.giraph.master.BspServiceMaster:
>> > superstepChosenWorkerAlive: Missing chosen worker
>> > Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>>
>> NB: So far I've been ignoring these other types of messages:
>>
>> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
>> > No last good checkpoints can be found, killing the job.
>>
>> > java.io.FileNotFoundException: File
>> > _bsp/_checkpoints/job_201409191450_0003 does not exist.
>>
>> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
>> > event
>> > (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>> > type=NodeDeleted, state=SyncConnected)
>>
>> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
>> > failure, unregistering health on
>> > /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
>> > on superstep 4
>>
>> The counter statistics are minimal after the run fails, but during it
>> I see something like this when refreshing the Job Tracker Web UI:
>>
>> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot:
>> > ~28GB
>> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
>> > Counters > Giraph Stats > Sent messages: ~181M
>>
>>
>> ==== hadoop/giraph command ====
>>
>> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
>> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
>> -libjars ${LIBJARS} \
>> relpath.RelPathVertex \
>> -wc relpath.RelPathWorkerContext \
>> -mc relpath.RelPathMasterCompute \
>> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
>> -vip $REL_PATH_INPUT \
>> -of relpath.RelPathVertexValueTextOutputFormat \
>> -op $REL_PATH_OUTPUT \
>> -ca RelPathVertex.path=$REL_PATH_PATH \
>> -w 8
>>
>>
>> ==== cluster, versions, and configuration ====
>>
>> We have a five-node cluster with a head and four compute nodes. The
>> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
>> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
>> and 48 cores.
>>
>> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
>> on the head node, and task trackers on all five nodes.
>>
>> Hadoop configuration (mapred-site.xml and CDH interface - sorry for
>> the mix) - not sure I'm listing all of them of interest:
>>
>> > mapreduce.job.counters.max: 120
>> > mapred.output.compress: false
>> > mapred.reduce.tasks: 12
>> > mapred.child.java.opts: -Xmx2147483648
>> > mapred.job.reuse.jvm.num.tasks: 1
>> > MapReduce Child Java Maximum Heap Size: 2 GB
>> > I/O Sort Memory Buffer (io.sort.mb): 512 MB
>> > Client Java Heap Size in Bytes: 256 MB
>> > Java Heap Size of Jobtracker in Bytes 1 GB
>> > Java Heap Size of TaskTracker in Bytes: 1 GB
>>
>> Cluster summary from the Job Tracker Web UI:
>>
>> > Heap Size is 972.69 MB/989.88 MB
>> > Map Task Capacity: 48
>> > Reduce Task Capacity: 24
>> > Avg. Tasks/Node: 14.40
>>
>> Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG:
>> Release 1.0.0 - 2013-04-15
>>
>>
>> ==== questions ====
>>
>> o How can I verify that the failure is actually one of memory? I've
>> looked fairly carefully at the logs.
>
>
> In the logs for the workers, do you have a line that looks like:
> 2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker:
> finishSuperstep: Waiting on all requests, superstep 93 Memory
> (free/total/max) = 21951.08M / 36456.50M / 43691.00M
>
> Looking at the memory usage in the worker that fails at the end of superstep
> before failure could give you a clue. If you don't have that line then you
> can directly use org.apache.utils.MemoryUtils.getRuntimeMemoryStats() in the
> preSuperstep() method of your computation class (or anywhere that will only
> run once on the worker) to print out that info.
>>
>>
>> o I noticed that not all hosts are being used. I did three runs, two
>> with 8 workers and one with 12, and I pulled the following from the
>> task logs ('h' = head node, 0-3 = compute nodes):
>>
>> > run #1: 0, 2, 3, h, h, h, h, h, h
>> > run #2: 2, 1, 3, h, h, h, h, h, h
>> > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1
>>
>> Note that there's at least one compute node that isn't listed for each
>> run.
>>
>> o What's a good # of workers to use?
>
>
> You want # of workers to be equal to the # of worker machines. So with 5
> machines you'd want -w 4.  See this thread. Quote: "Basically, better usage
> of resources: one single JVM, no duplication of
>
> core data structures, less netty threads and communication points, more
> locality (less messages over the network), less actors accessing zookeeper
> etc." Also from that thread, you should set
> mapred.tasktracker.map.tasks.maximum=1, since each worker is a map task and
> you don't want to run multiple workers on the same machine.
>
>
>
>>
>> o What Hadoop parameters should I tweak?
>> > mapred.job.map.memory.mb=xx
>> > mapred.map.child.java.opts=xx
>> > mapred.{map|reduce}.child.ulimit
>> > mapred.task.profile
>> > # map slots for each TaskTracker
>> > number of partitions you keep in memory
>>
>>
>> o What Giraph parameters should I tweak? I'm currently using defaults
>> for all, but I found these possibilities:
>> > giraph.maxPartitionsInMemory
>> > giraph.useOutOfCoreGraph=true
>> > giraph.maxPartitionsInMemory=N (default: 10)
>> > giraph.isStaticGraph=true
>> > giraph.useOutOfCoreMessages=true (default: disabled)
>> > giraph.maxMessagesInMemory=N (default: 1000000)
>>
> giraph.numComputeThreads, giraph.numInputThreads, and
> giraph.numOutputThreads should be set to the number of threads you have
> available, or potentially n - 1 or something since as Claudio mentions in
> that email chain I linked to, Giraph is also doing some additional work in
> the background. By default these are set to 1, so you should change them to
> increase the use of parallelism.
>
> Depending on what you're doing with your edges (adding and removing a lot of
> edges or leaving them alone) you can check out giraph.outEdgesClass and set
> it to one of the subclasses here. For example, if you're adding and removing
> a lot of edges, you could try one of the HashMap/Set backed classes, though
> these take a bit more memory (I'm basing that on the class descriptions).
>
> giraph.metrics.enable=true prints metrics output in the stderr of each
> worker/the master like time spent in communication vs computation, message
> bytes sent, etc.
>
> giraph.oneToAllMessageSending=true : If you're doing sendMessageToAllEdges
> this (supposedly) activates some sort of optimization for this. I read that
> somewhere on the listserv and I feel like looked at it in the code but take
> this advice with a grain of salt.
>
> giraph.userPartitionCount : Chooses the number of partition. Default is #
> workers squared. In your case if you do 4 workers then that gives you 16
> partitions, 4 per worker, which means that if you do numComputeThreads=4
> you'll have one partition to process per thread. You might want to consider
> using a higher number of partitions (maybe 3 per thread, or higher) since
> partitions are processed in parallel in vertex computation. That way, when
> partitions are finer-grained, if you have one partition eating a bunch of
> time the others can keep getting processed by the remaining threads.
>
>>
>> o How can I get a feel for how much more processing and memory might
>> be needed to finish the job, beyond that it's on the last superstep?
>> For example, of the ~181M sent messages I see during the run, how many
>> more might be left?
>>
>> o Why is the Heap Size from the Cluster summary above (972.69
>> MB/989.88 MB) so low?
>>
>> Thanks again!
>>
>>
>> ==== counters from successful four-step run ====
>>
>> INFO mapred.JobClient: Job complete: job_201409191450_0001
>> INFO mapred.JobClient: Counters: 39
>> INFO mapred.JobClient:   File System Counters
>> INFO mapred.JobClient:     FILE: Number of bytes read=0
>> INFO mapred.JobClient:     FILE: Number of bytes written=1694975
>> INFO mapred.JobClient:     FILE: Number of read operations=0
>> INFO mapred.JobClient:     FILE: Number of large read operations=0
>> INFO mapred.JobClient:     FILE: Number of write operations=0
>> INFO mapred.JobClient:     HDFS: Number of bytes read=10016293
>> INFO mapred.JobClient:     HDFS: Number of bytes written=113612773
>> INFO mapred.JobClient:     HDFS: Number of read operations=12
>> INFO mapred.JobClient:     HDFS: Number of large read operations=0
>> INFO mapred.JobClient:     HDFS: Number of write operations=9
>> INFO mapred.JobClient:   Job Counters
>> INFO mapred.JobClient:     Launched map tasks=9
>> INFO mapred.JobClient:     Total time spent by all maps in occupied
>> slots (ms)=206659
>> INFO mapred.JobClient:     Total time spent by all reduces in occupied
>> slots (ms)=0
>> INFO mapred.JobClient:     Total time spent by all maps waiting after
>> reserving slots (ms)=0
>> INFO mapred.JobClient:     Total time spent by all reduces waiting
>> after reserving slots (ms)=0
>> INFO mapred.JobClient:   Map-Reduce Framework
>> INFO mapred.JobClient:     Map input records=9
>> INFO mapred.JobClient:     Map output records=0
>> INFO mapred.JobClient:     Input split bytes=396
>> INFO mapred.JobClient:     Spilled Records=0
>> INFO mapred.JobClient:     CPU time spent (ms)=243280
>> INFO mapred.JobClient:     Physical memory (bytes) snapshot=9947144192
>> INFO mapred.JobClient:     Virtual memory (bytes) snapshot=25884065792
>> INFO mapred.JobClient:     Total committed heap usage (bytes)=10392305664
>> INFO mapred.JobClient:   Giraph Stats
>> INFO mapred.JobClient:     Aggregate edges=402428
>> INFO mapred.JobClient:     Aggregate finished vertices=119141
>> INFO mapred.JobClient:     Aggregate vertices=119141
>> INFO mapred.JobClient:     Current master task partition=0
>> INFO mapred.JobClient:     Current workers=8
>> INFO mapred.JobClient:     Last checkpointed superstep=0
>> INFO mapred.JobClient:     Sent messages=0
>> INFO mapred.JobClient:     Superstep=4
>> INFO mapred.JobClient:   Giraph Timers
>> INFO mapred.JobClient:     Input superstep (milliseconds)=1689
>> INFO mapred.JobClient:     Setup (milliseconds)=3977
>> INFO mapred.JobClient:     Shutdown (milliseconds)=1177
>> INFO mapred.JobClient:     Superstep 0 (milliseconds)=834
>> INFO mapred.JobClient:     Superstep 1 (milliseconds)=1836
>> INFO mapred.JobClient:     Superstep 2 (milliseconds)=2524
>> INFO mapred.JobClient:     Superstep 3 (milliseconds)=8284
>> INFO mapred.JobClient:     Total (milliseconds)=20322
>>
>> ==== EOF ====
>>
>>
>> --
>> Matthew Cornell | matt@matthewcornell.org | 413-626-3621 | 34
>> Dickinson Street, Amherst MA 01002 | matthewcornell.org
>
>



-- 
Matthew Cornell | matt@matthewcornell.org | 413-626-3621 | 34
Dickinson Street, Amherst MA 01002 | matthewcornell.org

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Posted by Matthew Saltz <sa...@gmail.com>.

Sorry, should be
"*org.apache.giraph.utils.MemoryUtils.getRuntimeMemoryStats()",
*I left out the giraph.

On Mon, Sep 22, 2014 at 8:10 PM, Matthew Saltz <sa...@gmail.com> wrote:

> Hi Matthew,
>
> I answered a few of your questions in-line (unfortunately they might not
> help the larger problem, but hopefully it'll help a bit).
>
> Best,
> Matthew
>
>
> On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell <ma...@matthewcornell.org>
> wrote:
>
>> Hi Folks,
>>
>> I've spent the last two months learning, installing, coding, and
>> analyzing the performance of our Giraph app, and I'm able to run on
>> small inputs on our tiny cluster (yay!) I am now stuck trying to
>> figure out why larger inputs fail, why only some compute nodes are
>> being used, and generally how to make sure I've configured hadoop and
>> giraph to use all available CPUs and RAM. I feel that I'm "this
>> close," and I could really use some pointers.
>>
>> Below I share our app, configuration, results and log messages, some
>> questions, and counter output for the successful run. My post here is
>> long (I've broken it into sections delimited with '===='), but I hope
>> I've provided good enough information to get help on. I'm happy to add
>> to it.
>>
>> Thanks!
>>
>>
>> ==== application ====
>>
>> Our application is a kind of path search where all nodes have a type
>> and source database ID (e.g., "movie 99"), and searches are expressed
>> as type paths, such as "movie, acted_in, actor", which would start
>> with movies and then find all actors in each movie, for all movies in
>> the database. The program does a kind of filtering by keeping track of
>> previously-processed initial IDs.
>>
>> Our database is a small movie one with 2K movies, 6K users (people who
>> rate movies), and 80K ratings of movies by users. Though small, we've
>> found this kind of search can result in a massive explosion of
>> messages, as was well put by Rob Vesse (
>>
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3CCEC4A409.2D7AD%25rvesse@dotnetrdf.org%3E
>> ):
>>
>> > even with this relatively small graph you get a massive explosion of
>> messages by the later super steps which exhausts memory (in my graph the
>> theoretical maximum messages by the last super step was ~3 billion)
>>
>>
>> ==== job failure and error messages ====
>>
>> Currently I have a four-step path that completes in ~20 seconds
>> ("rates, movie, rates, user" - counter output shown at bottom) but a
>> five-step one ("rates, movie, rates, user, rates") fails after a few
>> minutes. I've looked carefully at the task logs, but I find it a
>> little difficult to discern what the actual failure was. However,
>> looking at system information (e.g., top and ganglia) during the run
>> indicates hosts are running out of memory. There are no
>> OutOfMemoryErrors in the logs, and only this one stsands out:
>>
>> > ERROR org.apache.giraph.master.BspServiceMaster:
>> superstepChosenWorkerAlive: Missing chosen worker
>> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>>
>> NB: So far I've been ignoring these other types of messages:
>>
>> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
>> No last good checkpoints can be found, killing the job.
>>
>> > java.io.FileNotFoundException: File
>> _bsp/_checkpoints/job_201409191450_0003 does not exist.
>>
>> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>> type=NodeDeleted, state=SyncConnected)
>>
>> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
>> failure, unregistering health on
>> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
>> on superstep 4
>>
>> The counter statistics are minimal after the run fails, but during it
>> I see something like this when refreshing the Job Tracker Web UI:
>>
>> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot:
>> ~28GB
>> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
>> > Counters > Giraph Stats > Sent messages: ~181M
>>
>>
>> ==== hadoop/giraph command ====
>>
>> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
>> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
>> -libjars ${LIBJARS} \
>> relpath.RelPathVertex \
>> -wc relpath.RelPathWorkerContext \
>> -mc relpath.RelPathMasterCompute \
>> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
>> -vip $REL_PATH_INPUT \
>> -of relpath.RelPathVertexValueTextOutputFormat \
>> -op $REL_PATH_OUTPUT \
>> -ca RelPathVertex.path=$REL_PATH_PATH \
>> -w 8
>>
>>
>> ==== cluster, versions, and configuration ====
>>
>> We have a five-node cluster with a head and four compute nodes. The
>> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
>> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
>> and 48 cores.
>>
>> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
>> on the head node, and task trackers on all five nodes.
>>
>> Hadoop configuration (mapred-site.xml and CDH interface - sorry for
>> the mix) - not sure I'm listing all of them of interest:
>>
>> > mapreduce.job.counters.max: 120
>> > mapred.output.compress: false
>> > mapred.reduce.tasks: 12
>> > mapred.child.java.opts: -Xmx2147483648
>> > mapred.job.reuse.jvm.num.tasks: 1
>> > MapReduce Child Java Maximum Heap Size: 2 GB
>> > I/O Sort Memory Buffer (io.sort.mb): 512 MB
>> > Client Java Heap Size in Bytes: 256 MB
>> > Java Heap Size of Jobtracker in Bytes 1 GB
>> > Java Heap Size of TaskTracker in Bytes: 1 GB
>>
>> Cluster summary from the Job Tracker Web UI:
>>
>> > Heap Size is 972.69 MB/989.88 MB
>> > Map Task Capacity: 48
>> > Reduce Task Capacity: 24
>> > Avg. Tasks/Node: 14.40
>>
>> Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG:
>> Release 1.0.0 - 2013-04-15
>>
>>
>> ==== questions ====
>>
>> o How can I verify that the failure is actually one of memory? I've
>> looked fairly carefully at the logs.
>>
>
> *In the logs for the workers, do you have a line that looks like:*
>
> *2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker:
> finishSuperstep: Waiting on all requests, superstep 93 Memory
> (free/total/max) = 21951.08M / 36456.50M / 43691.00M*
>
> *Looking at the memory usage in the worker that fails at the end of
> superstep before failure could give you a clue. If you don't have that line
> then you can directly use
> org.apache.utils.MemoryUtils.getRuntimeMemoryStats() in the preSuperstep()
> method of your computation class (or anywhere that will only run once on
> the worker) to print out that info.*
>
>>
>> o I noticed that not all hosts are being used. I did three runs, two
>> with 8 workers and one with 12, and I pulled the following from the
>> task logs ('h' = head node, 0-3 = compute nodes):
>>
>> > run #1: 0, 2, 3, h, h, h, h, h, h
>> > run #2: 2, 1, 3, h, h, h, h, h, h
>> > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1
>>
>> Note that there's at least one compute node that isn't listed for each
>> run.
>>
>> o What's a good # of workers to use?
>>
>
> *You want # of workers to be equal to the # of worker machines. So with 5
> machines you'd want -w 4.  See this thread
> <https://www.mail-archive.com/user@giraph.apache.org/msg01570.html>. Quote:
> "*Basically, better usage of resources: one single JVM, no duplication of
>
> core data structures, less netty threads and communication points, more
> locality (less messages over the network), less actors accessing zookeeper
> etc." *Also from that thread, you should set mapred.tasktracker.map.tasks.maximum=1, since each worker is a map task and you don't want to run multiple workers on the same machine. *
>
>
>
>
>> o What Hadoop parameters should I tweak?
>> > mapred.job.map.memory.mb=xx
>> > mapred.map.child.java.opts=xx
>> > mapred.{map|reduce}.child.ulimit
>> > mapred.task.profile
>> > # map slots for each TaskTracker
>> > number of partitions you keep in memory
>>
>>
>> o What Giraph parameters should I tweak? I'm currently using defaults
>> for all, but I found these possibilities:
>> > giraph.maxPartitionsInMemory
>> > giraph.useOutOfCoreGraph=true
>> > giraph.maxPartitionsInMemory=N (default: 10)
>> > giraph.isStaticGraph=true
>> > giraph.useOutOfCoreMessages=true (default: disabled)
>> > giraph.maxMessagesInMemory=N (default: 1000000)
>>
>> *giraph.numComputeThreads, giraph.numInputThreads, and
> giraph.numOutputThreads should be set to the number of threads you have
> available, or potentially n - 1 or something since as Claudio mentions in
> that email chain I linked to, Giraph is also doing some additional work in
> the background. By default these are set to 1, so you should change them to
> increase the use of parallelism.  *
>
> *Depending on what you're doing with your edges (adding and removing a lot
> of edges or leaving them alone) you can check out giraph.outEdgesClass and
> set it to one of the subclasses here
> <https://giraph.apache.org/apidocs/org/apache/giraph/edge/OutEdges.html>.
> For example, if you're adding and removing a lot of edges, you could try
> one of the HashMap/Set backed classes, though these take a bit more memory
> (I'm basing that on the class descriptions). *
>
> *giraph.metrics.enable=true prints metrics output in the stderr of each
> worker/the master like time spent in communication vs computation, message
> bytes sent, etc. *
>
> *giraph.oneToAllMessageSending=true : If you're doing
> sendMessageToAllEdges this (supposedly) activates some sort of optimization
> for this. I read that somewhere on the listserv and I feel like looked at
> it in the code but take this advice with a grain of salt. *
>
> *giraph.userPartitionCount : Chooses the number of partition. Default is #
> workers squared. In your case if you do 4 workers then that gives you 16
> partitions, 4 per worker, which means that if you do numComputeThreads=4
> you'll have one partition to process per thread. You might want to consider
> using a higher number of partitions (maybe 3 per thread, or higher) since
> partitions are processed in parallel in vertex computation. That way, when
> partitions are finer-grained, if you have one partition eating a bunch of
> time the others can keep getting processed by the remaining threads. *
>
>
>> o How can I get a feel for how much more processing and memory might
>> be needed to finish the job, beyond that it's on the last superstep?
>> For example, of the ~181M sent messages I see during the run, how many
>> more might be left?
>>
>> o Why is the Heap Size from the Cluster summary above (972.69
>> MB/989.88 MB) so low?
>>
>> Thanks again!
>>
>>
>> ==== counters from successful four-step run ====
>>
>> INFO mapred.JobClient: Job complete: job_201409191450_0001
>> INFO mapred.JobClient: Counters: 39
>> INFO mapred.JobClient:   File System Counters
>> INFO mapred.JobClient:     FILE: Number of bytes read=0
>> INFO mapred.JobClient:     FILE: Number of bytes written=1694975
>> INFO mapred.JobClient:     FILE: Number of read operations=0
>> INFO mapred.JobClient:     FILE: Number of large read operations=0
>> INFO mapred.JobClient:     FILE: Number of write operations=0
>> INFO mapred.JobClient:     HDFS: Number of bytes read=10016293
>> INFO mapred.JobClient:     HDFS: Number of bytes written=113612773
>> INFO mapred.JobClient:     HDFS: Number of read operations=12
>> INFO mapred.JobClient:     HDFS: Number of large read operations=0
>> INFO mapred.JobClient:     HDFS: Number of write operations=9
>> INFO mapred.JobClient:   Job Counters
>> INFO mapred.JobClient:     Launched map tasks=9
>> INFO mapred.JobClient:     Total time spent by all maps in occupied
>> slots (ms)=206659
>> INFO mapred.JobClient:     Total time spent by all reduces in occupied
>> slots (ms)=0
>> INFO mapred.JobClient:     Total time spent by all maps waiting after
>> reserving slots (ms)=0
>> INFO mapred.JobClient:     Total time spent by all reduces waiting
>> after reserving slots (ms)=0
>> INFO mapred.JobClient:   Map-Reduce Framework
>> INFO mapred.JobClient:     Map input records=9
>> INFO mapred.JobClient:     Map output records=0
>> INFO mapred.JobClient:     Input split bytes=396
>> INFO mapred.JobClient:     Spilled Records=0
>> INFO mapred.JobClient:     CPU time spent (ms)=243280
>> INFO mapred.JobClient:     Physical memory (bytes) snapshot=9947144192
>> INFO mapred.JobClient:     Virtual memory (bytes) snapshot=25884065792
>> INFO mapred.JobClient:     Total committed heap usage (bytes)=10392305664
>> INFO mapred.JobClient:   Giraph Stats
>> INFO mapred.JobClient:     Aggregate edges=402428
>> INFO mapred.JobClient:     Aggregate finished vertices=119141
>> INFO mapred.JobClient:     Aggregate vertices=119141
>> INFO mapred.JobClient:     Current master task partition=0
>> INFO mapred.JobClient:     Current workers=8
>> INFO mapred.JobClient:     Last checkpointed superstep=0
>> INFO mapred.JobClient:     Sent messages=0
>> INFO mapred.JobClient:     Superstep=4
>> INFO mapred.JobClient:   Giraph Timers
>> INFO mapred.JobClient:     Input superstep (milliseconds)=1689
>> INFO mapred.JobClient:     Setup (milliseconds)=3977
>> INFO mapred.JobClient:     Shutdown (milliseconds)=1177
>> INFO mapred.JobClient:     Superstep 0 (milliseconds)=834
>> INFO mapred.JobClient:     Superstep 1 (milliseconds)=1836
>> INFO mapred.JobClient:     Superstep 2 (milliseconds)=2524
>> INFO mapred.JobClient:     Superstep 3 (milliseconds)=8284
>> INFO mapred.JobClient:     Total (milliseconds)=20322
>>
>> ==== EOF ====
>>
>>
>> --
>> Matthew Cornell | matt@matthewcornell.org | 413-626-3621 | 34
>> Dickinson Street, Amherst MA 01002 | matthewcornell.org
>>
>
>

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Posted by Matthew Saltz <sa...@gmail.com>.

Hi Matthew,

I answered a few of your questions in-line (unfortunately they might not
help the larger problem, but hopefully it'll help a bit).

Best,
Matthew


On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell <ma...@matthewcornell.org>
wrote:

> Hi Folks,
>
> I've spent the last two months learning, installing, coding, and
> analyzing the performance of our Giraph app, and I'm able to run on
> small inputs on our tiny cluster (yay!) I am now stuck trying to
> figure out why larger inputs fail, why only some compute nodes are
> being used, and generally how to make sure I've configured hadoop and
> giraph to use all available CPUs and RAM. I feel that I'm "this
> close," and I could really use some pointers.
>
> Below I share our app, configuration, results and log messages, some
> questions, and counter output for the successful run. My post here is
> long (I've broken it into sections delimited with '===='), but I hope
> I've provided good enough information to get help on. I'm happy to add
> to it.
>
> Thanks!
>
>
> ==== application ====
>
> Our application is a kind of path search where all nodes have a type
> and source database ID (e.g., "movie 99"), and searches are expressed
> as type paths, such as "movie, acted_in, actor", which would start
> with movies and then find all actors in each movie, for all movies in
> the database. The program does a kind of filtering by keeping track of
> previously-processed initial IDs.
>
> Our database is a small movie one with 2K movies, 6K users (people who
> rate movies), and 80K ratings of movies by users. Though small, we've
> found this kind of search can result in a massive explosion of
> messages, as was well put by Rob Vesse (
>
> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3CCEC4A409.2D7AD%25rvesse@dotnetrdf.org%3E
> ):
>
> > even with this relatively small graph you get a massive explosion of
> messages by the later super steps which exhausts memory (in my graph the
> theoretical maximum messages by the last super step was ~3 billion)
>
>
> ==== job failure and error messages ====
>
> Currently I have a four-step path that completes in ~20 seconds
> ("rates, movie, rates, user" - counter output shown at bottom) but a
> five-step one ("rates, movie, rates, user, rates") fails after a few
> minutes. I've looked carefully at the task logs, but I find it a
> little difficult to discern what the actual failure was. However,
> looking at system information (e.g., top and ganglia) during the run
> indicates hosts are running out of memory. There are no
> OutOfMemoryErrors in the logs, and only this one stsands out:
>
> > ERROR org.apache.giraph.master.BspServiceMaster:
> superstepChosenWorkerAlive: Missing chosen worker
> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>
> NB: So far I've been ignoring these other types of messages:
>
> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
> No last good checkpoints can be found, killing the job.
>
> > java.io.FileNotFoundException: File
> _bsp/_checkpoints/job_201409191450_0003 does not exist.
>
> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
> event
> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
> type=NodeDeleted, state=SyncConnected)
>
> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
> failure, unregistering health on
> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
> on superstep 4
>
> The counter statistics are minimal after the run fails, but during it
> I see something like this when refreshing the Job Tracker Web UI:
>
> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB
> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
> > Counters > Giraph Stats > Sent messages: ~181M
>
>
> ==== hadoop/giraph command ====
>
> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
> -libjars ${LIBJARS} \
> relpath.RelPathVertex \
> -wc relpath.RelPathWorkerContext \
> -mc relpath.RelPathMasterCompute \
> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
> -vip $REL_PATH_INPUT \
> -of relpath.RelPathVertexValueTextOutputFormat \
> -op $REL_PATH_OUTPUT \
> -ca RelPathVertex.path=$REL_PATH_PATH \
> -w 8
>
>
> ==== cluster, versions, and configuration ====
>
> We have a five-node cluster with a head and four compute nodes. The
> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
> and 48 cores.
>
> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
> on the head node, and task trackers on all five nodes.
>
> Hadoop configuration (mapred-site.xml and CDH interface - sorry for
> the mix) - not sure I'm listing all of them of interest:
>
> > mapreduce.job.counters.max: 120
> > mapred.output.compress: false
> > mapred.reduce.tasks: 12
> > mapred.child.java.opts: -Xmx2147483648
> > mapred.job.reuse.jvm.num.tasks: 1
> > MapReduce Child Java Maximum Heap Size: 2 GB
> > I/O Sort Memory Buffer (io.sort.mb): 512 MB
> > Client Java Heap Size in Bytes: 256 MB
> > Java Heap Size of Jobtracker in Bytes 1 GB
> > Java Heap Size of TaskTracker in Bytes: 1 GB
>
> Cluster summary from the Job Tracker Web UI:
>
> > Heap Size is 972.69 MB/989.88 MB
> > Map Task Capacity: 48
> > Reduce Task Capacity: 24
> > Avg. Tasks/Node: 14.40
>
> Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG:
> Release 1.0.0 - 2013-04-15
>
>
> ==== questions ====
>
> o How can I verify that the failure is actually one of memory? I've
> looked fairly carefully at the logs.
>

*In the logs for the workers, do you have a line that looks like:*

*2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 93 Memory
(free/total/max) = 21951.08M / 36456.50M / 43691.00M*

*Looking at the memory usage in the worker that fails at the end of
superstep before failure could give you a clue. If you don't have that line
then you can directly use
org.apache.utils.MemoryUtils.getRuntimeMemoryStats() in the preSuperstep()
method of your computation class (or anywhere that will only run once on
the worker) to print out that info.*

>
> o I noticed that not all hosts are being used. I did three runs, two
> with 8 workers and one with 12, and I pulled the following from the
> task logs ('h' = head node, 0-3 = compute nodes):
>
> > run #1: 0, 2, 3, h, h, h, h, h, h
> > run #2: 2, 1, 3, h, h, h, h, h, h
> > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1
>
> Note that there's at least one compute node that isn't listed for each run.
>
> o What's a good # of workers to use?
>

*You want # of workers to be equal to the # of worker machines. So with 5
machines you'd want -w 4.  See this thread
<https://www.mail-archive.com/user@giraph.apache.org/msg01570.html>. Quote:
"*Basically, better usage of resources: one single JVM, no duplication of

core data structures, less netty threads and communication points, more
locality (less messages over the network), less actors accessing zookeeper
etc." *Also from that thread, you should set
mapred.tasktracker.map.tasks.maximum=1, since each worker is a map
task and you don't want to run multiple workers on the same machine. *




> o What Hadoop parameters should I tweak?
> > mapred.job.map.memory.mb=xx
> > mapred.map.child.java.opts=xx
> > mapred.{map|reduce}.child.ulimit
> > mapred.task.profile
> > # map slots for each TaskTracker
> > number of partitions you keep in memory
>
>
> o What Giraph parameters should I tweak? I'm currently using defaults
> for all, but I found these possibilities:
> > giraph.maxPartitionsInMemory
> > giraph.useOutOfCoreGraph=true
> > giraph.maxPartitionsInMemory=N (default: 10)
> > giraph.isStaticGraph=true
> > giraph.useOutOfCoreMessages=true (default: disabled)
> > giraph.maxMessagesInMemory=N (default: 1000000)
>
> *giraph.numComputeThreads, giraph.numInputThreads, and
giraph.numOutputThreads should be set to the number of threads you have
available, or potentially n - 1 or something since as Claudio mentions in
that email chain I linked to, Giraph is also doing some additional work in
the background. By default these are set to 1, so you should change them to
increase the use of parallelism.  *

*Depending on what you're doing with your edges (adding and removing a lot
of edges or leaving them alone) you can check out giraph.outEdgesClass and
set it to one of the subclasses here
<https://giraph.apache.org/apidocs/org/apache/giraph/edge/OutEdges.html>.
For example, if you're adding and removing a lot of edges, you could try
one of the HashMap/Set backed classes, though these take a bit more memory
(I'm basing that on the class descriptions). *

*giraph.metrics.enable=true prints metrics output in the stderr of each
worker/the master like time spent in communication vs computation, message
bytes sent, etc. *

*giraph.oneToAllMessageSending=true : If you're doing sendMessageToAllEdges
this (supposedly) activates some sort of optimization for this. I read that
somewhere on the listserv and I feel like looked at it in the code but take
this advice with a grain of salt. *

*giraph.userPartitionCount : Chooses the number of partition. Default is #
workers squared. In your case if you do 4 workers then that gives you 16
partitions, 4 per worker, which means that if you do numComputeThreads=4
you'll have one partition to process per thread. You might want to consider
using a higher number of partitions (maybe 3 per thread, or higher) since
partitions are processed in parallel in vertex computation. That way, when
partitions are finer-grained, if you have one partition eating a bunch of
time the others can keep getting processed by the remaining threads. *


> o How can I get a feel for how much more processing and memory might
> be needed to finish the job, beyond that it's on the last superstep?
> For example, of the ~181M sent messages I see during the run, how many
> more might be left?
>
> o Why is the Heap Size from the Cluster summary above (972.69
> MB/989.88 MB) so low?
>
> Thanks again!
>
>
> ==== counters from successful four-step run ====
>
> INFO mapred.JobClient: Job complete: job_201409191450_0001
> INFO mapred.JobClient: Counters: 39
> INFO mapred.JobClient:   File System Counters
> INFO mapred.JobClient:     FILE: Number of bytes read=0
> INFO mapred.JobClient:     FILE: Number of bytes written=1694975
> INFO mapred.JobClient:     FILE: Number of read operations=0
> INFO mapred.JobClient:     FILE: Number of large read operations=0
> INFO mapred.JobClient:     FILE: Number of write operations=0
> INFO mapred.JobClient:     HDFS: Number of bytes read=10016293
> INFO mapred.JobClient:     HDFS: Number of bytes written=113612773
> INFO mapred.JobClient:     HDFS: Number of read operations=12
> INFO mapred.JobClient:     HDFS: Number of large read operations=0
> INFO mapred.JobClient:     HDFS: Number of write operations=9
> INFO mapred.JobClient:   Job Counters
> INFO mapred.JobClient:     Launched map tasks=9
> INFO mapred.JobClient:     Total time spent by all maps in occupied
> slots (ms)=206659
> INFO mapred.JobClient:     Total time spent by all reduces in occupied
> slots (ms)=0
> INFO mapred.JobClient:     Total time spent by all maps waiting after
> reserving slots (ms)=0
> INFO mapred.JobClient:     Total time spent by all reduces waiting
> after reserving slots (ms)=0
> INFO mapred.JobClient:   Map-Reduce Framework
> INFO mapred.JobClient:     Map input records=9
> INFO mapred.JobClient:     Map output records=0
> INFO mapred.JobClient:     Input split bytes=396
> INFO mapred.JobClient:     Spilled Records=0
> INFO mapred.JobClient:     CPU time spent (ms)=243280
> INFO mapred.JobClient:     Physical memory (bytes) snapshot=9947144192
> INFO mapred.JobClient:     Virtual memory (bytes) snapshot=25884065792
> INFO mapred.JobClient:     Total committed heap usage (bytes)=10392305664
> INFO mapred.JobClient:   Giraph Stats
> INFO mapred.JobClient:     Aggregate edges=402428
> INFO mapred.JobClient:     Aggregate finished vertices=119141
> INFO mapred.JobClient:     Aggregate vertices=119141
> INFO mapred.JobClient:     Current master task partition=0
> INFO mapred.JobClient:     Current workers=8
> INFO mapred.JobClient:     Last checkpointed superstep=0
> INFO mapred.JobClient:     Sent messages=0
> INFO mapred.JobClient:     Superstep=4
> INFO mapred.JobClient:   Giraph Timers
> INFO mapred.JobClient:     Input superstep (milliseconds)=1689
> INFO mapred.JobClient:     Setup (milliseconds)=3977
> INFO mapred.JobClient:     Shutdown (milliseconds)=1177
> INFO mapred.JobClient:     Superstep 0 (milliseconds)=834
> INFO mapred.JobClient:     Superstep 1 (milliseconds)=1836
> INFO mapred.JobClient:     Superstep 2 (milliseconds)=2524
> INFO mapred.JobClient:     Superstep 3 (milliseconds)=8284
> INFO mapred.JobClient:     Total (milliseconds)=20322
>
> ==== EOF ====
>
>
> --
> Matthew Cornell | matt@matthewcornell.org | 413-626-3621 | 34
> Dickinson Street, Amherst MA 01002 | matthewcornell.org
>