You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by André Kelpe <ef...@googlemail.com> on 2012/11/28 16:44:35 UTC

Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

Hi Bence,

on older version of hadoop there is a hard limit on counters, which a
job cannot modify. Since the counters are not crucial for the
functioning of giraph, you can turn them off by setting
giraph.useSuperstepCounters to false in your job config.

I would also recommend looking into the GiraphConfiguration class, as
it contains all the settings, that you might be interested in (like
checkpoint frequency etc.):
https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java

HTH

-Andre

2012/11/28 Magyar, Bence (US SSA) <be...@baesystems.com>:
> I have successfully run the shortest path example using Avery’s sample input
> data.  I am now attempting to run the shortest-path algorithm on a much
> larger data set (300,000 nodes) and I am running into errors.  I have a
> 4-node cluster and am running the following command:
>
>
>
>
>
> ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar
> org.apache.giraph.examples.SimpleShortestPathsVertex -if
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip
> /user/hduser/insight -of
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op
> /user/hduser/insight-out -w 3
>
>
>
>
>
> It appears as though the shortest path computation “finishes”.  That is to
> say, I hit “100%”.  Then the job just hangs for about 30 seconds, decreases
> it’s progress to 75%, and then finally throws an exception:
>
>
>
> No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
>
> 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
>
> 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%
>
> 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%
>
> 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%
>
> 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%
>
> 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
> attempt_201211271542_0004_m_000000_0, Status : FAILED
>
> java.lang.Throwable: Child Error
>
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
>
>
>
>
> Digging into the log files a little deeper, I noticed that the number of
> files generated by the last node in my cluster contains more log directories
> than the previous three.
>
>
>
> I see:
>
>
>
> ·        attempt_201211280843_0001_m_000000_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0
>
> ·        attempt_201211280843_0001_m_000000_0.cleanup ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0.cleanup
>
> ·        attempt_201211280843_0001_m_000005_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000005_0
>
> ·        job-acls.xml
>
>
>
> Whereas the first 3 nodes only contain 1 log folder underneath the job,
> something like:  “attempt_201211280843_0001_m_000003_0”.  I am assuming this
> is because something went wrong on node 4 and some “cleanup logic” was
> attempted.
>
>
>
> At any rate, when I cd into the first log folder on the bad node,
> (attempt_201211280843_0001_m_000000_0) and look into “syslog”, I see the
> following error:
>
>
>
>
>
> 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
> barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> collectAndProcessAggregatorValues: Processed aggregators
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> aggregateWorkerStats: Aggregation found
> (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false)
> on superstep = 98
>
> 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
> coordinateSuperstep: Cleaning up old Superstep
> /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
>
> 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
> masterThread: Coordination of superstep 98 took 0.445 seconds ended with
> state THIS_SUPERSTEP_DONE and is now on superstep 99
>
> 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
> uncaughtException: OverrideExceptionHandler on thread
> org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number
> of counters - Counters=120 Limit=120, exiting...
>
> org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded
> limits on number of counters - Counters=120 Limit=120
>
>         at
> org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
>
>         at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
>
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88)
>
>         at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
>
> 2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager:
> onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper
> process.
>
>
>
>
>
> What exactly is this limit on MapReduce job “counters”?  What is a MapReduce
> job “counter”?  I assume it is some variable threshold to keep things in
> check, and I know that I can modify the value in mapred-site.xml:
>
>
>
> <property>
>
>   <name>mapreduce.job.counters.limit</name>
>
>   <value>120</value>
>
>   <description>I have no idea what this does!!!</description>
>
> </property>
>
>
>
> I have tried increasing and decreasing this value and my subsequent jobs
> pick up the change.  However, neither increasing or decreasing this value
> seems to make any difference.  I always reach whatever limit I’ve set and my
> job crashes.  Besides, from outward appearances it looks like the
> computation finished before the crash.  Can anyone please give deeper
> insight into what is happening here, or where I can look for more help?
>
>
>
> Thanks,
>
>
>
> Bence
>
>

RE: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

Posted by "Magyar, Bence (US SSA)" <be...@baesystems.com>.

Thank you Andre, 

Setting "giraph.useSuperstepCounters" = false 

solved my issue.  The job still hung at 100% and then eventually completed successfully.

-Bence

-----Original Message-----
From: André Kelpe [mailto:efeshundertelf@googlemail.com] 
Sent: Wednesday, November 28, 2012 10:45 AM
To: user@giraph.apache.org
Subject: Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

Hi Bence,

on older version of hadoop there is a hard limit on counters, which a job cannot modify. Since the counters are not crucial for the functioning of giraph, you can turn them off by setting giraph.useSuperstepCounters to false in your job config.

I would also recommend looking into the GiraphConfiguration class, as it contains all the settings, that you might be interested in (like checkpoint frequency etc.):
https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java

HTH

-Andre

2012/11/28 Magyar, Bence (US SSA) <be...@baesystems.com>:
> I have successfully run the shortest path example using Avery’s sample 
> input data.  I am now attempting to run the shortest-path algorithm on 
> a much larger data set (300,000 nodes) and I am running into errors.  
> I have a 4-node cluster and am running the following command:
>
>
>
>
>
> ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
> org.apache.giraph.examples.SimpleShortestPathsVertex -if 
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
> /user/hduser/insight -of 
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
> /user/hduser/insight-out -w 3
>
>
>
>
>
> It appears as though the shortest path computation “finishes”.  That 
> is to say, I hit “100%”.  Then the job just hangs for about 30 
> seconds, decreases it’s progress to 75%, and then finally throws an exception:
>
>
>
> No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
>
> 12/11/28 08:26:16 INFO mapred.JobClient: Running job: 
> job_201211271542_0004
>
> 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%
>
> 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%
>
> 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%
>
> 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%
>
> 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
> attempt_201211271542_0004_m_000000_0, Status : FAILED
>
> java.lang.Throwable: Child Error
>
>         at 
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>
>         at 
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
>
>
>
>
> Digging into the log files a little deeper, I noticed that the number 
> of files generated by the last node in my cluster contains more log 
> directories than the previous three.
>
>
>
> I see:
>
>
>
> ·        attempt_201211280843_0001_m_000000_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
> 1211280843_0001_m_000000_0
>
> ·        attempt_201211280843_0001_m_000000_0.cleanup ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
> 1211280843_0001_m_000000_0.cleanup
>
> ·        attempt_201211280843_0001_m_000005_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
> 1211280843_0001_m_000005_0
>
> ·        job-acls.xml
>
>
>
> Whereas the first 3 nodes only contain 1 log folder underneath the 
> job, something like:  “attempt_201211280843_0001_m_000003_0”.  I am 
> assuming this is because something went wrong on node 4 and some 
> “cleanup logic” was attempted.
>
>
>
> At any rate, when I cd into the first log folder on the bad node,
> (attempt_201211280843_0001_m_000000_0) and look into “syslog”, I see 
> the following error:
>
>
>
>
>
> 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
> barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> collectAndProcessAggregatorValues: Processed aggregators
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> aggregateWorkerStats: Aggregation found
> (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=
> false)
> on superstep = 98
>
> 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
> coordinateSuperstep: Cleaning up old Superstep
> /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstep
> Dir/97
>
> 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
> masterThread: Coordination of superstep 98 took 0.445 seconds ended 
> with state THIS_SUPERSTEP_DONE and is now on superstep 99
>
> 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
> uncaughtException: OverrideExceptionHandler on thread 
> org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on 
> number of counters - Counters=120 Limit=120, exiting...
>
> org.apache.hadoop.mapred.Counters$CountersExceededException: Error: 
> Exceeded limits on number of counters - Counters=120 Limit=120
>
>         at
> org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.jav
> a:312)
>
>         at 
> org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
>
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInpu
> tOutputContext.java:88)
>
>         at 
> org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
>
> 2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager:
> onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper 
> process.
>
>
>
>
>
> What exactly is this limit on MapReduce job “counters”?  What is a 
> MapReduce job “counter”?  I assume it is some variable threshold to 
> keep things in check, and I know that I can modify the value in mapred-site.xml:
>
>
>
> <property>
>
>   <name>mapreduce.job.counters.limit</name>
>
>   <value>120</value>
>
>   <description>I have no idea what this does!!!</description>
>
> </property>
>
>
>
> I have tried increasing and decreasing this value and my subsequent 
> jobs pick up the change.  However, neither increasing or decreasing 
> this value seems to make any difference.  I always reach whatever 
> limit I’ve set and my job crashes.  Besides, from outward appearances 
> it looks like the computation finished before the crash.  Can anyone 
> please give deeper insight into what is happening here, or where I can look for more help?
>
>
>
> Thanks,
>
>
>
> Bence
>
>