You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Matthew Cornell <ma...@matthewcornell.org> on 2014/10/28 16:01:38 UTC

"Missing chosen worker" ERROR drills down to "end of stream exception" ("likely client has closed socket"). help!

Hi All,

I have a Giraph 1.0.0 job that has failed, but I'm not able to get
detail as to what really happened. The master's log says:

> 2014-10-28 10:28:32,006 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=compute-0-0.wright, MRtaskID=1, port=30001) on superstep 4

OK, this seems to say compute-0-0 failed in some way, correct? The
Ganglia pages show no noticeable OS differences between the failed
node and another identical compute node. In the failed node's log I
see two WARNs:

> 2014-10-28 10:28:19,560 WARN org.apache.giraph.bsp.BspService: process: Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent state:Disconnected type:None path:null
> 2014-10-28 10:28:19,560 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem with zookeeper, got event with path null, state Disconnected, event type None

OK, I guess there was a zookeeper issue. In the Zookeeper log I find:

> 2014-10-28 10:28:14,917 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 0x149529c74de0a4d, likely client has closed socket
>         at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
>         at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>         at java.lang.Thread.run(Thread.java:745)

OK, so I guess the socket closure was the problem. But why did *that* happen?

I could really use your help here!

Thank you,

matt


-- 
Matthew Cornell | matt@matthewcornell.org