You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Piotr Kołaczkowski <pk...@datastax.com> on 2014/05/22 10:39:11 UTC

Workers disconnected from master sometimes and never reconnect back

Hi,

Another problem we observed that on a very heavily loaded cluster, if the
worker fails to respond to the heartbeat within 60 seconds, it gets
disconnected permanently from the master and never connects back again. It
is very easy to reproduce - just setup a spark standalone cluster on a
single machine, suspend it for a while and after waking up the cluster
doesn't work anymore because all workers are lost.

Is there any way to mitigate this?

Thanks,
Piotr

-- 
Piotr Kolaczkowski, Lead Software Engineer
pkolaczk@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404