You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by map reduced <k3...@gmail.com> on 2017/04/05 21:01:41 UTC

Master-Worker communication on Standalone cluster issues

Hi,

I was wondering on how often does Worker pings Master to check on Master's
liveness? Or is it the Master (Resource manager) that pings Workers to
check on their liveness and if any workers are dead to spawn ? Or is it
both?

Some info:
Standalone cluster
1 Master - 8core 12Gb
32 workers - each 8 core and 8 Gb

My main problem - Here's what happened:

Master M - running with 32 workers
Worker 1 and 2 died at 03:55:00 - so now the cluster is 30 workers

Worker 1' came up at 03:55:12.000 AM - it connected to M
Worker 2' came up at 03:55:16.000 AM - it connected to M

Master M *dies* at 03:56.00 AM
New master NM' comes up at 03:56:30 AM
Worker 1' and 2' - *DO NOT* connect to NM
Remaining 30 workers connect to NM.

So NM now has 30 workers.

I was wondering on why those two won't connect to new master NM even though
master M is dead for sure.

PS:I have a LB setup for Master which means that whenever a new master
comes in LB will start pointing to new one.

Thanks,
KP

Re: Master-Worker communication on Standalone cluster issues

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
1. For worker and master:
spark.worker.timeout 60s
see: http://spark.apache.org/docs/latest/spark-standalone.html

2. For executor and driver:
spark.executor.heartbeatInterval 10s
see: http://spark.apache.org/docs/latest/configuration.html


Please correct me if I'm wrong.



On Thu, Apr 6, 2017 at 5:01 AM, map reduced <k3...@gmail.com> wrote:

> Hi,
>
> I was wondering on how often does Worker pings Master to check on Master's
> liveness? Or is it the Master (Resource manager) that pings Workers to
> check on their liveness and if any workers are dead to spawn ? Or is it
> both?
>
> Some info:
> Standalone cluster
> 1 Master - 8core 12Gb
> 32 workers - each 8 core and 8 Gb
>
> My main problem - Here's what happened:
>
> Master M - running with 32 workers
> Worker 1 and 2 died at 03:55:00 - so now the cluster is 30 workers
>
> Worker 1' came up at 03:55:12.000 AM - it connected to M
> Worker 2' came up at 03:55:16.000 AM - it connected to M
>
> Master M *dies* at 03:56.00 AM
> New master NM' comes up at 03:56:30 AM
> Worker 1' and 2' - *DO NOT* connect to NM
> Remaining 30 workers connect to NM.
>
> So NM now has 30 workers.
>
> I was wondering on why those two won't connect to new master NM even
> though master M is dead for sure.
>
> PS:I have a LB setup for Master which means that whenever a new master
> comes in LB will start pointing to new one.
>
> Thanks,
> KP
>
>