You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Anthony Tang <aa...@yahoo.com.INVALID> on 2016/02/02 10:48:36 UTC

Master failover results in running job marked as "WAITING"

Hi - 

I'm running Spark 1.5.2 in standalone mode with multiple masters using zookeeper for failover.  The master fails over correctly to the standby when it goes down, and running applications seem to continue to run, but in the new active master web UI, they are marked as "WAITING", and the workers have these entries in their logs: 

16/01/30 00:51:13 ERROR Worker: Connection to master failed! Waiting for master to reconnect... 
16/01/30 00:51:13 WARN Worker: Failed to connect to master XXX:7077 
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkMaster@XXX:7077/), Path(/user/Master)] 

Should they be "RUNNING" still? One time, it looked like the job stopped functioning (This is a continuously running streaming job), but I haven't been able to reproduce it.  FWIW, the driver that started it is still marked as "RUNNING". 

Thanks. 
- Anthony

Re: Master failover results in running job marked as "WAITING"

Posted by Anthony Tang <aa...@yahoo.com.INVALID>.
 blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px #715FFA solid !important;  padding-left:1ex !important; background-color:white !important; }  Yes, it's the IP address/host.


Sent from Yahoo Mail for iPad


On Tuesday, February 2, 2016, 8:04 AM, Ted Yu <yu...@gmail.com> wrote:

bq. Failed to connect to master XXX:7077
Is the 'XXX' above the hostname for the new master ?
Thanks
On Tue, Feb 2, 2016 at 1:48 AM, Anthony Tang <aa...@yahoo.com.invalid> wrote:

Hi - 

I'm running Spark 1.5.2 in standalone mode with multiple masters using zookeeper for failover.  The master fails over correctly to the standby when it goes down, and running applications seem to continue to run, but in the new active master web UI, they are marked as "WAITING", and the workers have these entries in their logs: 

16/01/30 00:51:13 ERROR Worker: Connection to master failed! Waiting for master to reconnect... 
16/01/30 00:51:13 WARN Worker: Failed to connect to master XXX:7077 
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkMaster@XXX:7077/), Path(/user/Master)] 

Should they be "RUNNING" still? One time, it looked like the job stopped functioning (This is a continuously running streaming job), but I haven't been able to reproduce it.  FWIW, the driver that started it is still marked as "RUNNING". 

Thanks. 
- Anthony



 


Re: Master failover results in running job marked as "WAITING"

Posted by Ted Yu <yu...@gmail.com>.
bq. Failed to connect to master XXX:7077

Is the 'XXX' above the hostname for the new master ?

Thanks

On Tue, Feb 2, 2016 at 1:48 AM, Anthony Tang <aa...@yahoo.com.invalid>
wrote:

> Hi -
>
> I'm running Spark 1.5.2 in standalone mode with multiple masters using
> zookeeper for failover.  The master fails over correctly to the standby
> when it goes down, and running applications seem to continue to run, but in
> the new active master web UI, they are marked as "WAITING", and the workers
> have these entries in their logs:
>
> 16/01/30 00:51:13 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 16/01/30 00:51:13 WARN Worker: Failed to connect to master XXX:7077
> akka.actor.ActorNotFound: Actor not found for:
> ActorSelection[Anchor(akka.tcp://sparkMaster@XXX:7077/),
> Path(/user/Master)]
>
> Should they be "RUNNING" still? One time, it looked like the job stopped
> functioning (This is a continuously running streaming job), but I haven't
> been able to reproduce it.  FWIW, the driver that started it is still
> marked as "RUNNING".
>
> Thanks.
> - Anthony
>