You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by Kaushik Patnaik <ka...@gmail.com> on 2013/12/04 05:18:42 UTC

Missing chosen worker error on particular superstep

Hi,

I am getting a missing chosen worker error upon running a custom program
(triangle counting) on a graph dataset of about 7K nodes and 200K edges.
The program runs fine on the first two supersteps, but throws a missing
chosen worker error in the third superstep. In the third superstep each
node sends messages to all its neighbors.

The program runs fine and outputs results on test datasets of size 100
nodes and 1000 edges, still I have pasted it below in case I am missing
something.

I am running the job on a single cluster machine though, could that be a
problem ?? Will increasing the "mapreduce.task.timeout" help in any manner ?
I also see a Java Heap Source error in the second map task of the same run.
Are these errors connected ?

Regards
Kaushik

syslogs from attempt_201312031749_0011_m_000000_0

2013-12-03 23:10:50,661 INFO org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 1 took 100.399 seconds ended
with state THIS_SUPERSTEP_DONE and is now on superstep 2
2013-12-03 23:10:50,774 INFO org.apache.giraph.comm.netty.NettyClient:
connectAllAddresses: Successfully added 0 connections, (0 total
connected) 0 failed, 0 failures total.
2013-12-03 23:10:50,774 INFO
org.apache.giraph.partition.PartitionBalancer:
balancePartitionsAcrossWorkers: Using algorithm static
2013-12-03 23:10:50,775 INFO
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Vertices - Mean: 7115, Min: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 7115, Max: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 7115
2013-12-03 23:10:50,775 INFO
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Edges - Mean: 201524, Min: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 201524, Max: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 201524
2013-12-03 23:10:50,806 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312031749_0011/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-03 23:10:50,806 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-03 23:11:01,756 ERROR
org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive:
Missing chosen worker Worker(hostname=localhost, MRtaskID=1,
port=30001) on superstep 2
2013-12-03 23:11:01,757 INFO org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 2 took 11.096 seconds ended
with state WORKER_FAILURE and is now on superstep 2
2013-12-03 23:11:01,761 FATAL
org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No
last good checkpoints can be found, killing the job.

attempt_201312031749_0011_m_000001_0

java.lang.IllegalStateException: run: Caught an unrecoverable
exception waitFor: ExecutionException occurred while waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
	at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: waitFor:
ExecutionException occurred while waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:181)
	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:139)
	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:124)
	at org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.java:87)
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:221)
	at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:736)
	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:284)
	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91)
	... 7 more
Caused by: java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: Java heap space
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
	at java.util.concurrent.FutureTask.get(FutureTask.java:119)
	at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(ProgressableUtils.java:300)
	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:173)


Program -

public class TriangleCounting extends BasicComputation<
    DoubleWritable, DoubleWritable, FloatWritable, DoubleWritable> {

  /** Class logger */
  private static final Logger LOG = Logger.getLogger(TriangleCounting.class);

  @Override
  public void compute(
      Vertex<DoubleWritable, DoubleWritable, FloatWritable> vertex,
      Iterable<DoubleWritable> messages) throws IOException {
		
  /** First Superstep releases messages to vertexIds() whose value is
greater than its value. Both VertexId and Message are double **/
  if (getSuperstep() == 0) {
    for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
	  if (edge.getTargetVertexId().compareTo(vertex.getId()) == 1) {
		sendMessage(edge.getTargetVertexId(), vertex.getId());
        if (LOG.isDebugEnabled()) {
          LOG.debug("Vertex " + vertex.getId() + " sent message " +
vertex.getId() + " to vertex " + edge.getTargetVertexId());
        }
        System.out.println("Vertex " + vertex.getId() + " sent message
" + vertex.getId() + " to vertex " + edge.getTargetVertexId());
	  }
    }
  }

  /** Second superstep releases messages to message.get() <
vertex.getId() < targetVertexId() **/
  if (getSuperstep() == 1) {
	for (DoubleWritable message: messages) {
	    for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
		  if (edge.getTargetVertexId().compareTo(vertex.getId()) +
vertex.getId().compareTo(message) == 2) {
		    sendMessage(edge.getTargetVertexId(), message);
	        if (LOG.isDebugEnabled()) {
	          LOG.debug("Vertex " + vertex.getId() + " sent message " +
message + " to vertex " + edge.getTargetVertexId());
	        }
	        System.out.println("Vertex " + vertex.getId() + " sent
message " + message + " to vertex " + edge.getTargetVertexId());
		  }
	    }
    }
  }
  /** Sends messages to all its neighbours, the messages it receives **/
  if (getSuperstep() == 2) {
    for (DoubleWritable message: messages) {
		sendMessageToAllEdges(vertex, message);
	}
  }

  if (getSuperstep() == 3) {
      double Value = 0.0;
	  for (DoubleWritable message: messages) {
		  if (vertex.getId().equals(message)) {
              Value += 1.0;
              System.out.println("Vertex " + vertex.getId() + "
received message " + message);
	      }
      }
      vertex.setValue(new DoubleWritable(Value));
  }

  vertex.voteToHalt();
  }
}

Re: Missing chosen worker error on particular superstep

Posted by Kaushik Patnaik <ka...@gmail.com>.

Rob,

I implemented all your suggestions and was able to run a graph with 7K
nodes under 5mins. I believe the difference in run times maybe because of
the structure of the graph.

However, when I moved to a larger graph ~27K nodes and 352K edges, Giraph
showed the same error after 50 mins of processing. Is this a timeout issue
? I believe the standard timeout value in Hadoop is 10mins (600000 ms).
I dont know why Giraph should fail now that I am enabling message write to
the disk. Have you faced this issue ?

I will be testing it out on a much larger cluster tomorrow. I hope this
does not repeat there.

Syslogs -

2013-12-04 21:33:35,096 INFO
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Vertices - Mean: 27770, Min: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 27770, Max: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 27770
2013-12-04 21:33:35,096 INFO
org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
Edges - Mean: 704609, Min: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 704609, Max: Worker(hostname=localhost, MRtaskID=1,
port=30001) - 704609
2013-12-04 21:33:35,184 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:33:35,184 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:38:35,202 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:38:35,202 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:43:35,208 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:43:35,208 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:48:35,214 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:48:35,214 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:53:35,220 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:53:35,220 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 21:58:35,225 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 21:58:35,226 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 22:03:35,231 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 22:03:35,231 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 22:08:35,237 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out
of 1 workers finished on superstep 2 on path
/_hadoopBsp/job_201312042044_0003/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-04 22:08:35,237 INFO
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
Waiting on [localhost_1]
2013-12-04 22:11:42,170 ERROR
org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive:
Missing chosen worker Worker(hostname=localhost, MRtaskID=1,
port=30001) on superstep 2
2013-12-04 22:11:42,171 INFO org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 2 took 2297.944 seconds ended
with state WORKER_FAILURE and is now on superstep 2
2013-12-04 22:11:42,222 FATAL
org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No
last good checkpoints can be found, killing the job.

On Wed, Dec 4, 2013 at 4:34 AM, Rob Vesse <rv...@dotnetrdf.org> wrote:

> Kaushik
>
> I'm also running a triangle counting program of my own at the moment and
> have encountered similar problems.
>
> Missing working appears to indicate that one of the workers failed (AFAIK)
> and you'll need to look at the logs for the other map tasks to determine
> why this is.  You appear to have provided those logs and it looks to be the
> same problem I had of out of memory.  Your algorithm appears to be roughly
> similar to mine so what you'll be hitting is that even with this relatively
> small graph you get a massive explosion of messages by the later super
> steps which exhausts memory (in my graph the theoretical maximum messages
> by the last super step was ~3 billion).
>
> In order for me to get a successful run on a single node cluster you'll
> likely have to do some/all of the following:
>
>    - Enable out of core messages by adding appropriate configuration, see
>    http://giraph.apache.org/ooc.html (I set max messages in memory to
>    10,000)
>    - Increase the heap size for the map reduce processes, this can be
>    done with the mapred.child.java.opts setting (I used –Xmx768M –Xms256M).
>     Ideally you should pick values such that the max memory times the number
>    of workers will fit into available RAM as otherwise you'll start hitting
>    swap files and grind to a halt
>    - Increase the number of workers you use (the –w argument you pass to
>    Giraph), you'll need to have also configured your Hadoop cluster to support
>    at least one more map tasks than the number of workers (if not using
>    external Zookeeper) or at least as many map tasks as the number of workers
>    if using external Zookeeper.  Bear in mind that using more workers doesn't
>    necessarily help that much on a single node cluster but it does mean you
>    can keep the heap size settings down lower.
>
> Even with all this my algorithm took 16 minutes to run on a graph with 10K
> vertices and 150K edges, if you have better luck I'd be interested to know.
>
> Hope this helps,
>
> Rob
>
> From: Kaushik Patnaik <ka...@gmail.com>
> Reply-To: <us...@giraph.apache.org>
> Date: Wednesday, 4 December 2013 04:18
> To: <us...@giraph.apache.org>
> Subject: Missing chosen worker error on particular superstep
>
> Hi,
>
> I am getting a missing chosen worker error upon running a custom program
> (triangle counting) on a graph dataset of about 7K nodes and 200K edges.
> The program runs fine on the first two supersteps, but throws a missing
> chosen worker error in the third superstep. In the third superstep each
> node sends messages to all its neighbors.
>
> The program runs fine and outputs results on test datasets of size 100
> nodes and 1000 edges, still I have pasted it below in case I am missing
> something.
>
> I am running the job on a single cluster machine though, could that be a
> problem ?? Will increasing the "mapreduce.task.timeout" help in any
> manner ?
> I also see a Java Heap Source error in the second map task of the same
> run. Are these errors connected ?
>
> Regards
> Kaushik
>
> syslogs from attempt_201312031749_0011_m_000000_0
>
> 2013-12-03 23:10:50,661 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 1 took 100.399 seconds ended with state THIS_SUPERSTEP_DONE and is now on superstep 2
> 2013-12-03 23:10:50,774 INFO org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Successfully added 0 connections, (0 total connected) 0 failed, 0 failures total.
> 2013-12-03 23:10:50,774 INFO org.apache.giraph.partition.PartitionBalancer: balancePartitionsAcrossWorkers: Using algorithm static
> 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Vertices - Mean: 7115, Min: Worker(hostname=localhost, MRtaskID=1, port=30001) - 7115, Max: Worker(hostname=localhost, MRtaskID=1, port=30001) - 7115
> 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Edges - Mean: 201524, Min: Worker(hostname=localhost, MRtaskID=1, port=30001) - 201524, Max: Worker(hostname=localhost, MRtaskID=1, port=30001) - 201524
> 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 1 workers finished on superstep 2 on path /_hadoopBsp/job_201312031749_0011/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
> 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: Waiting on [localhost_1]
> 2013-12-03 23:11:01,756 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=localhost, MRtaskID=1, port=30001) on superstep 2
> 2013-12-03 23:11:01,757 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 2 took 11.096 seconds ended with state WORKER_FAILURE and is now on superstep 2
> 2013-12-03 23:11:01,761 FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.
>
> attempt_201312031749_0011_m_000001_0
>
> java.lang.IllegalStateException: run: Caught an unrecoverable exception waitFor: ExecutionException occurred while waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
> 	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: java.lang.IllegalStateException: waitFor: ExecutionException occurred while waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
> 	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:181)
> 	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:139)
> 	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:124)
> 	at org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.java:87)
> 	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:221)
> 	at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:736)
> 	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:284)
> 	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91)
> 	... 7 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
> 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:119)
> 	at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(ProgressableUtils.java:300)
> 	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:173)
>
>
> Program -
>
> public class TriangleCounting extends BasicComputation<
>     DoubleWritable, DoubleWritable, FloatWritable, DoubleWritable> {
>
>   /** Class logger */
>   private static final Logger LOG = Logger.getLogger(TriangleCounting.class);
>
>   @Override
>   public void compute(
>       Vertex<DoubleWritable, DoubleWritable, FloatWritable> vertex,
>       Iterable<DoubleWritable> messages) throws IOException {
> 		
>   /** First Superstep releases messages to vertexIds() whose value is greater than its value. Both VertexId and Message are double **/		
>   if (getSuperstep() == 0) {
>     for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
> 	  if (edge.getTargetVertexId().compareTo(vertex.getId()) == 1) {
> 		sendMessage(edge.getTargetVertexId(), vertex.getId());
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Vertex " + vertex.getId() + " sent message " + vertex.getId() + " to vertex " + edge.getTargetVertexId());
>         }
>         System.out.println("Vertex " + vertex.getId() + " sent message " + vertex.getId() + " to vertex " + edge.getTargetVertexId());
> 	  }
>     }
>   }
>
>   /** Second superstep releases messages to message.get() < vertex.getId() < targetVertexId() **/
>   if (getSuperstep() == 1) {
> 	for (DoubleWritable message: messages) {
> 	    for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
> 		  if (edge.getTargetVertexId().compareTo(vertex.getId()) + vertex.getId().compareTo(message) == 2) {
> 		    sendMessage(edge.getTargetVertexId(), message);
> 	        if (LOG.isDebugEnabled()) {
> 	          LOG.debug("Vertex " + vertex.getId() + " sent message " + message + " to vertex " + edge.getTargetVertexId());
> 	        }
> 	        System.out.println("Vertex " + vertex.getId() + " sent message " + message + " to vertex " + edge.getTargetVertexId());
> 		  }
> 	    }
>     }
>   }
>   /** Sends messages to all its neighbours, the messages it receives **/
>   if (getSuperstep() == 2) {
>     for (DoubleWritable message: messages) {
> 		sendMessageToAllEdges(vertex, message);
> 	}
>   }
>
>   if (getSuperstep() == 3) {
>       double Value = 0.0;
> 	  for (DoubleWritable message: messages) {
> 		  if (vertex.getId().equals(message)) {
>               Value += 1.0;
>               System.out.println("Vertex " + vertex.getId() + " received message " + message);
> 	      }
>       }
>       vertex.setValue(new DoubleWritable(Value));
>   }
>
>   vertex.voteToHalt();
>   }
> }
>
>

Re: Missing chosen worker error on particular superstep

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Kaushik

I'm also running a triangle counting program of my own at the moment and
have encountered similar problems.

Missing working appears to indicate that one of the workers failed (AFAIK)
and you'll need to look at the logs for the other map tasks to determine why
this is.  You appear to have provided those logs and it looks to be the same
problem I had of out of memory.  Your algorithm appears to be roughly
similar to mine so what you'll be hitting is that even with this relatively
small graph you get a massive explosion of messages by the later super steps
which exhausts memory (in my graph the theoretical maximum messages by the
last super step was ~3 billion).

In order for me to get a successful run on a single node cluster you'll
likely have to do some/all of the following:
* Enable out of core messages by adding appropriate configuration, see
http://giraph.apache.org/ooc.html (I set max messages in memory to 10,000)
* Increase the heap size for the map reduce processes, this can be done with
the mapred.child.java.opts setting (I used Xmx768M Xms256M).  Ideally you
should pick values such that the max memory times the number of workers will
fit into available RAM as otherwise you'll start hitting swap files and
grind to a halt
* Increase the number of workers you use (the w argument you pass to
Giraph), you'll need to have also configured your Hadoop cluster to support
at least one more map tasks than the number of workers (if not using
external Zookeeper) or at least as many map tasks as the number of workers
if using external Zookeeper.  Bear in mind that using more workers doesn't
necessarily help that much on a single node cluster but it does mean you can
keep the heap size settings down lower.
Even with all this my algorithm took 16 minutes to run on a graph with 10K
vertices and 150K edges, if you have better luck I'd be interested to know.

Hope this helps,

Rob

From:  Kaushik Patnaik <ka...@gmail.com>
Reply-To:  <us...@giraph.apache.org>
Date:  Wednesday, 4 December 2013 04:18
To:  <us...@giraph.apache.org>
Subject:  Missing chosen worker error on particular superstep

> Hi,
> 
> I am getting a missing chosen worker error upon running a custom program
> (triangle counting) on a graph dataset of about 7K nodes and 200K edges. The
> program runs fine on the first two supersteps, but throws a missing chosen
> worker error in the third superstep. In the third superstep each node sends
> messages to all its neighbors.
> 
> The program runs fine and outputs results on test datasets of size 100 nodes
> and 1000 edges, still I have pasted it below in case I am missing something.
> 
> I am running the job on a single cluster machine though, could that be a
> problem ?? Will increasing the "mapreduce.task.timeout" help in any manner ?
> I also see a Java Heap Source error in the second map task of the same run.
> Are these errors connected ?
> 
> Regards
> Kaushik
> 
> syslogs from attempt_201312031749_0011_m_000000_0
> 2013-12-03 23:10:50,661 INFO org.apache.giraph.master.MasterThread:
> masterThread: Coordination of superstep 1 took 100.399 seconds ended with
> state THIS_SUPERSTEP_DONE and is now on superstep 2
> 2013-12-03 23:10:50,774 INFO org.apache.giraph.comm.netty.NettyClient:
> connectAllAddresses: Successfully added 0 connections, (0 total connected) 0
> failed, 0 failures total.
> 2013-12-03 23:10:50,774 INFO org.apache.giraph.partition.PartitionBalancer:
> balancePartitionsAcrossWorkers: Using algorithm static
> 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils:
> analyzePartitionStats: Vertices - Mean: 7115, Min: Worker(hostname=localhost,
> MRtaskID=1, port=30001) - 7115, Max: Worker(hostname=localhost, MRtaskID=1,
> port=30001) - 7115
> 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils:
> analyzePartitionStats: Edges - Mean: 201524, Min: Worker(hostname=localhost,
> MRtaskID=1, port=30001) - 201524, Max: Worker(hostname=localhost, MRtaskID=1,
> port=30001) - 201524
> 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster:
> barrierOnWorkerList: 0 out of 1 workers finished on superstep 2 on path
> /_hadoopBsp/job_201312031749_0011/_applicationAttemptsDir/0/_superstepDir/2/_w
> orkerFinishedDir
> 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster:
> barrierOnWorkerList: Waiting on [localhost_1]
> 2013-12-03 23:11:01,756 ERROR org.apache.giraph.master.BspServiceMaster:
> superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=localhost,
> MRtaskID=1, port=30001) on superstep 2
> 2013-12-03 23:11:01,757 INFO org.apache.giraph.master.MasterThread:
> masterThread: Coordination of superstep 2 took 11.096 seconds ended with state
> WORKER_FAILURE and is now on superstep 2
> 2013-12-03 23:11:01,761 FATAL org.apache.giraph.master.BspServiceMaster:
> getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.
> attempt_201312031749_0011_m_000001_0
> java.lang.IllegalStateException: run: Caught an unrecoverable exception
> waitFor: ExecutionException occurred while waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
> 	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java
> :1059)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: java.lang.IllegalStateException: waitFor: ExecutionException
> occurred while waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
> 	at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:181)
> 	at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:1
> 39)
> 	at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:1
> 24)
> 	at 
> org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.ja
> va:87)
> 	at 
> org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(Progressabl
> eUtils.java:221)
> 	at 
> org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManag
> er.java:736)
> 	at 
> org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:284)
> 	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91)
> 	... 7 more
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: Java heap space
> 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:119)
> 	at 
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(ProgressableU
> tils.java:300)
> 	at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:173)
> 
> Program - 
> public class TriangleCounting extends BasicComputation<
>     DoubleWritable, DoubleWritable, FloatWritable, DoubleWritable> {
>   
>   /** Class logger */
>   private static final Logger LOG = Logger.getLogger(TriangleCounting.class);
>     
>   @Override
>   public void compute(
>       Vertex<DoubleWritable, DoubleWritable, FloatWritable> vertex,
>       Iterable<DoubleWritable> messages) throws IOException {
> 		  
>   /** First Superstep releases messages to vertexIds() whose value is greater
> than its value. Both VertexId and Message are double **/		
>   if (getSuperstep() == 0) {
>     for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
> 	  if (edge.getTargetVertexId().compareTo(vertex.getId()) == 1) {
> 		sendMessage(edge.getTargetVertexId(), vertex.getId());
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Vertex " + vertex.getId() + " sent message " +
> vertex.getId() + " to vertex " + edge.getTargetVertexId());
>         }
>         System.out.println("Vertex " + vertex.getId() + " sent message " +
> vertex.getId() + " to vertex " + edge.getTargetVertexId());
> 	  }
>     }
>   }
>     
>   /** Second superstep releases messages to message.get() < vertex.getId() <
> targetVertexId() **/
>   if (getSuperstep() == 1) {
> 	for (DoubleWritable message: messages) {
> 	    for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()) {
> 		  if (edge.getTargetVertexId().compareTo(vertex.getId()) +
> vertex.getId().compareTo(message) == 2) {
> 		    sendMessage(edge.getTargetVertexId(), message);
> 	        if (LOG.isDebugEnabled()) {
> 	          LOG.debug("Vertex " + vertex.getId() + " sent message " + message +
> " to vertex " + edge.getTargetVertexId());
> 	        }
> 	        System.out.println("Vertex " + vertex.getId() + " sent message " +
> message + " to vertex " + edge.getTargetVertexId());
> 		  }
> 	    }
>     }
>   }
>   /** Sends messages to all its neighbours, the messages it receives **/
>   if (getSuperstep() == 2) {
>     for (DoubleWritable message: messages) {
> 		sendMessageToAllEdges(vertex, message);
> 	}
>   }
>   
>   if (getSuperstep() == 3) {
>       double Value = 0.0;
> 	  for (DoubleWritable message: messages) {
> 		  if (vertex.getId().equals(message)) {
>               Value += 1.0;
>               System.out.println("Vertex " + vertex.getId() + " received
> message " + message);
> 	      }
>       }
>       vertex.setValue(new DoubleWritable(Value));
>   } 
>     
>   vertex.voteToHalt();
>   }
> }