You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by "Magyar, Bence (US SSA)" <be...@baesystems.com> on 2012/11/28 15:07:04 UTC

ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

I have successfully run the shortest path example using Avery's sample input data.  I am now attempting to run the shortest-path algorithm on a much larger data set (300,000 nodes) and I am running into errors.  I have a 4-node cluster and am running the following command:


./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/insight -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/insight-out -w 3


It appears as though the shortest path computation "finishes".  That is to say, I hit "100%".  Then the job just hangs for about 30 seconds, decreases it's progress to 75%, and then finally throws an exception:

No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%
12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%
12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%
12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%
12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%
12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%
12/11/28 08:27:50 INFO mapred.JobClient: Task Id : attempt_201211271542_0004_m_000000_0, Status : FAILED
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)


Digging into the log files a little deeper, I noticed that the number of files generated by the last node in my cluster contains more log directories than the previous three.

I see:


*        attempt_201211280843_0001_m_000000_0 -> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0

*        attempt_201211280843_0001_m_000000_0.cleanup -> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0.cleanup

*        attempt_201211280843_0001_m_000005_0 -> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000005_0

*        job-acls.xml

Whereas the first 3 nodes only contain 1 log folder underneath the job, something like:  "attempt_201211280843_0001_m_000003_0".  I am assuming this is because something went wrong on node 4 and some "cleanup logic" was attempted.

At any rate, when I cd into the first log folder on the bad node, (attempt_201211280843_0001_m_000000_0) and look into "syslog", I see the following error:


2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: collectAndProcessAggregatorValues: Processed aggregators
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: aggregateWorkerStats: Aggregation found (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false) on superstep = 98
2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Cleaning up old Superstep /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: masterThread: Coordination of superstep 98 took 0.445 seconds ended with state THIS_SUPERSTEP_DONE and is now on superstep 99
2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of counters - Counters=120 Limit=120, exiting...
org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded limits on number of counters - Counters=120 Limit=120
        at org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
        at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
        at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
        at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88)
        at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.


What exactly is this limit on MapReduce job "counters"?  What is a MapReduce job "counter"?  I assume it is some variable threshold to keep things in check, and I know that I can modify the value in mapred-site.xml:

<property>
  <name>mapreduce.job.counters.limit</name>
  <value>120</value>
  <description>I have no idea what this does!!!</description>
</property>

I have tried increasing and decreasing this value and my subsequent jobs pick up the change.  However, neither increasing or decreasing this value seems to make any difference.  I always reach whatever limit I've set and my job crashes.  Besides, from outward appearances it looks like the computation finished before the crash.  Can anyone please give deeper insight into what is happening here, or where I can look for more help?

Thanks,

Bence

Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

Posted by Jonathan Bishop <jb...@gmail.com>.

Bence,

I set that value to 1000000 - I think there is a recommendation to set this
very high. Remember to reboot you cluster after making the change.

Jon


On Wed, Nov 28, 2012 at 6:07 AM, Magyar, Bence (US SSA) <
bence.magyar@baesystems.com> wrote:

>  I have successfully run the shortest path example using Avery’s sample
> input data.  I am now attempting to run the shortest-path algorithm on a
> much larger data set (300,000 nodes) and I am running into errors.  I have
> a 4-node cluster and am running the following command:****
>
> ** **
>
> ** **
>
> ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar
> org.apache.giraph.examples.SimpleShortestPathsVertex -if
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip
> /user/hduser/insight -of
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op
> /user/hduser/insight-out -w 3****
>
> ** **
>
> ** **
>
> It appears as though the shortest path computation “finishes”.  That is to
> say, I hit “100%”.  Then the job just hangs for about 30 seconds, *
> decreases* it’s progress to 75%, and then finally throws an exception:****
>
> ** **
>
> No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf****
>
> 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
> ****
>
> 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%****
>
> 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%****
>
> 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%****
>
> 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%****
>
> 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%****
>
> 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%****
>
> 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
> attempt_201211271542_0004_m_000000_0, Status : FAILED****
>
> java.lang.Throwable: Child Error****
>
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)***
> *
>
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
> ****
>
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)***
> *
>
> ** **
>
> ** **
>
> Digging into the log files a little deeper, I noticed that the number of
> files generated by the *last* node in my cluster contains more log
> directories than the previous three.****
>
> ** **
>
> I see:  ****
>
> ** **
>
> **·        **attempt_201211280843_0001_m_000000_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0
> ****
>
> **·        **attempt_201211280843_0001_m_000000_0.cleanup ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0.cleanup
> ****
>
> **·        **attempt_201211280843_0001_m_000005_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000005_0
> ****
>
> **·        **job-acls.xml****
>
> ** **
>
> Whereas the first 3 nodes only contain 1 log folder underneath the job,
> something like:  “attempt_201211280843_0001_m_000003_0”.  I am assuming
> this is because something went wrong on node 4 and some “cleanup logic” was
> attempted.    ****
>
> ** **
>
> At any rate, when I cd into the first log folder on the bad node,
> (attempt_201211280843_0001_m_000000_0) and look into “syslog”, I see the
> following error:****
>
> ** **
>
> ** **
>
> 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
> barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]****
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> collectAndProcessAggregatorValues: Processed aggregators****
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> aggregateWorkerStats: Aggregation found
> (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false)
> on superstep = 98****
>
> 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
> coordinateSuperstep: Cleaning up old Superstep
> /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
> ****
>
> 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
> masterThread: Coordination of superstep 98 took 0.445 seconds ended with
> state THIS_SUPERSTEP_DONE and is now on superstep 99****
>
> 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
> uncaughtException: OverrideExceptionHandler on thread
> org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on
> number of counters - Counters=120 Limit=120, exiting...****
>
> org.apache.hadoop.mapred.Counters$CountersExceededException: Error:
> Exceeded limits on number of counters - Counters=120 Limit=120****
>
>         at
> org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
> ****
>
>         at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
> ****
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)****
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)****
>
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88)
> ****
>
>         at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
> ****
>
> 2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager:
> onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper
> process.****
>
> ** **
>
> ** **
>
> What exactly is this limit on MapReduce job “counters”?  What is a
> MapReduce job “counter”?  I assume it is some variable threshold to keep
> things in check, and I know that I can modify the value in mapred-site.xml:
> ****
>
> ** **
>
> <property>****
>
>   <name>mapreduce.job.counters.limit</name>****
>
>   <value>120</value>****
>
>   <description>I have no idea what this does!!!</description>****
>
> </property>****
>
> ** **
>
> I have tried increasing and decreasing this value and my subsequent jobs
> pick up the change.  However, neither increasing or decreasing this value
> seems to make any difference.  I always reach whatever limit I’ve set and
> my job crashes.  Besides, from outward appearances it looks like the
> computation finished before the crash.  Can anyone please give deeper
> insight into what is happening here, or where I can look for more help?***
> *
>
> ** **
>
> Thanks, ****
>
> ** **
>
> Bence****
>
> ** **
>

RE: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

Posted by "Magyar, Bence (US SSA)" <be...@baesystems.com>.

Thank you Andre, 

Setting "giraph.useSuperstepCounters" = false 

solved my issue.  The job still hung at 100% and then eventually completed successfully.

-Bence

-----Original Message-----
From: André Kelpe [mailto:efeshundertelf@googlemail.com] 
Sent: Wednesday, November 28, 2012 10:45 AM
To: user@giraph.apache.org
Subject: Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

Hi Bence,

on older version of hadoop there is a hard limit on counters, which a job cannot modify. Since the counters are not crucial for the functioning of giraph, you can turn them off by setting giraph.useSuperstepCounters to false in your job config.

I would also recommend looking into the GiraphConfiguration class, as it contains all the settings, that you might be interested in (like checkpoint frequency etc.):
https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java

HTH

-Andre

2012/11/28 Magyar, Bence (US SSA) <be...@baesystems.com>:
> I have successfully run the shortest path example using Avery’s sample 
> input data.  I am now attempting to run the shortest-path algorithm on 
> a much larger data set (300,000 nodes) and I am running into errors.  
> I have a 4-node cluster and am running the following command:
>
>
>
>
>
> ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
> org.apache.giraph.examples.SimpleShortestPathsVertex -if 
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
> /user/hduser/insight -of 
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
> /user/hduser/insight-out -w 3
>
>
>
>
>
> It appears as though the shortest path computation “finishes”.  That 
> is to say, I hit “100%”.  Then the job just hangs for about 30 
> seconds, decreases it’s progress to 75%, and then finally throws an exception:
>
>
>
> No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
>
> 12/11/28 08:26:16 INFO mapred.JobClient: Running job: 
> job_201211271542_0004
>
> 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%
>
> 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%
>
> 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%
>
> 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%
>
> 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
> attempt_201211271542_0004_m_000000_0, Status : FAILED
>
> java.lang.Throwable: Child Error
>
>         at 
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>
>         at 
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
>
>
>
>
> Digging into the log files a little deeper, I noticed that the number 
> of files generated by the last node in my cluster contains more log 
> directories than the previous three.
>
>
>
> I see:
>
>
>
> ·        attempt_201211280843_0001_m_000000_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
> 1211280843_0001_m_000000_0
>
> ·        attempt_201211280843_0001_m_000000_0.cleanup ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
> 1211280843_0001_m_000000_0.cleanup
>
> ·        attempt_201211280843_0001_m_000005_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
> 1211280843_0001_m_000005_0
>
> ·        job-acls.xml
>
>
>
> Whereas the first 3 nodes only contain 1 log folder underneath the 
> job, something like:  “attempt_201211280843_0001_m_000003_0”.  I am 
> assuming this is because something went wrong on node 4 and some 
> “cleanup logic” was attempted.
>
>
>
> At any rate, when I cd into the first log folder on the bad node,
> (attempt_201211280843_0001_m_000000_0) and look into “syslog”, I see 
> the following error:
>
>
>
>
>
> 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
> barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> collectAndProcessAggregatorValues: Processed aggregators
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> aggregateWorkerStats: Aggregation found
> (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=
> false)
> on superstep = 98
>
> 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
> coordinateSuperstep: Cleaning up old Superstep
> /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstep
> Dir/97
>
> 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
> masterThread: Coordination of superstep 98 took 0.445 seconds ended 
> with state THIS_SUPERSTEP_DONE and is now on superstep 99
>
> 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
> uncaughtException: OverrideExceptionHandler on thread 
> org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on 
> number of counters - Counters=120 Limit=120, exiting...
>
> org.apache.hadoop.mapred.Counters$CountersExceededException: Error: 
> Exceeded limits on number of counters - Counters=120 Limit=120
>
>         at
> org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.jav
> a:312)
>
>         at 
> org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
>
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInpu
> tOutputContext.java:88)
>
>         at 
> org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
>
> 2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager:
> onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper 
> process.
>
>
>
>
>
> What exactly is this limit on MapReduce job “counters”?  What is a 
> MapReduce job “counter”?  I assume it is some variable threshold to 
> keep things in check, and I know that I can modify the value in mapred-site.xml:
>
>
>
> <property>
>
>   <name>mapreduce.job.counters.limit</name>
>
>   <value>120</value>
>
>   <description>I have no idea what this does!!!</description>
>
> </property>
>
>
>
> I have tried increasing and decreasing this value and my subsequent 
> jobs pick up the change.  However, neither increasing or decreasing 
> this value seems to make any difference.  I always reach whatever 
> limit I’ve set and my job crashes.  Besides, from outward appearances 
> it looks like the computation finished before the crash.  Can anyone 
> please give deeper insight into what is happening here, or where I can look for more help?
>
>
>
> Thanks,
>
>
>
> Bence
>
>

Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

Posted by André Kelpe <ef...@googlemail.com>.

Hi Bence,

on older version of hadoop there is a hard limit on counters, which a
job cannot modify. Since the counters are not crucial for the
functioning of giraph, you can turn them off by setting
giraph.useSuperstepCounters to false in your job config.

I would also recommend looking into the GiraphConfiguration class, as
it contains all the settings, that you might be interested in (like
checkpoint frequency etc.):
https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java

HTH

-Andre

2012/11/28 Magyar, Bence (US SSA) <be...@baesystems.com>:
> I have successfully run the shortest path example using Avery’s sample input
> data.  I am now attempting to run the shortest-path algorithm on a much
> larger data set (300,000 nodes) and I am running into errors.  I have a
> 4-node cluster and am running the following command:
>
>
>
>
>
> ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar
> org.apache.giraph.examples.SimpleShortestPathsVertex -if
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip
> /user/hduser/insight -of
> org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op
> /user/hduser/insight-out -w 3
>
>
>
>
>
> It appears as though the shortest path computation “finishes”.  That is to
> say, I hit “100%”.  Then the job just hangs for about 30 seconds, decreases
> it’s progress to 75%, and then finally throws an exception:
>
>
>
> No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
>
> 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
>
> 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%
>
> 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%
>
> 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%
>
> 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%
>
> 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%
>
> 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
> attempt_201211271542_0004_m_000000_0, Status : FAILED
>
> java.lang.Throwable: Child Error
>
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
>
>
>
>
> Digging into the log files a little deeper, I noticed that the number of
> files generated by the last node in my cluster contains more log directories
> than the previous three.
>
>
>
> I see:
>
>
>
> ·        attempt_201211280843_0001_m_000000_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0
>
> ·        attempt_201211280843_0001_m_000000_0.cleanup ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000000_0.cleanup
>
> ·        attempt_201211280843_0001_m_000005_0 ->
> /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_000005_0
>
> ·        job-acls.xml
>
>
>
> Whereas the first 3 nodes only contain 1 log folder underneath the job,
> something like:  “attempt_201211280843_0001_m_000003_0”.  I am assuming this
> is because something went wrong on node 4 and some “cleanup logic” was
> attempted.
>
>
>
> At any rate, when I cd into the first log folder on the bad node,
> (attempt_201211280843_0001_m_000000_0) and look into “syslog”, I see the
> following error:
>
>
>
>
>
> 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
> barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> collectAndProcessAggregatorValues: Processed aggregators
>
> 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
> aggregateWorkerStats: Aggregation found
> (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false)
> on superstep = 98
>
> 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
> coordinateSuperstep: Cleaning up old Superstep
> /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
>
> 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
> masterThread: Coordination of superstep 98 took 0.445 seconds ended with
> state THIS_SUPERSTEP_DONE and is now on superstep 99
>
> 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
> uncaughtException: OverrideExceptionHandler on thread
> org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number
> of counters - Counters=120 Limit=120, exiting...
>
> org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded
> limits on number of counters - Counters=120 Limit=120
>
>         at
> org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
>
>         at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
>
>         at
> org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
>
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88)
>
>         at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
>
> 2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager:
> onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper
> process.
>
>
>
>
>
> What exactly is this limit on MapReduce job “counters”?  What is a MapReduce
> job “counter”?  I assume it is some variable threshold to keep things in
> check, and I know that I can modify the value in mapred-site.xml:
>
>
>
> <property>
>
>   <name>mapreduce.job.counters.limit</name>
>
>   <value>120</value>
>
>   <description>I have no idea what this does!!!</description>
>
> </property>
>
>
>
> I have tried increasing and decreasing this value and my subsequent jobs
> pick up the change.  However, neither increasing or decreasing this value
> seems to make any difference.  I always reach whatever limit I’ve set and my
> job crashes.  Besides, from outward appearances it looks like the
> computation finished before the crash.  Can anyone please give deeper
> insight into what is happening here, or where I can look for more help?
>
>
>
> Thanks,
>
>
>
> Bence
>
>