You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Mu Kong <ko...@gmail.com> on 2018/02/01 06:11:08 UTC

JobManager doesn't recover in HA mode

Hi all,

I have a Flink HA cluster with 2 job managers and a zookeeper quorum of 3
nodes.

My failed job manager didn't get recovered after I killed it.
Here is how I didn't it and what I've observed:

1. I started the HA cluster with start-cluster.sh
2. Job manager A got elected.
3. I killed job manager A with kill command.
4. Job manager B got elected.
5. Job manager B was working well.
6. But job manager A never recovered since then.

Do I miss something here or is it the case that HA cannot handle such
failover(the flink instance gets killed directly)?

Thanks!

Best regards,
Mu

Re: JobManager doesn't recover in HA mode

Posted by Mu Kong <ko...@gmail.com>.

Ah, I think I can just use ./bin/jobmanager.sh
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html#adding-a-jobmanager

Thanks!

On Thu, Feb 1, 2018 at 4:00 PM, Mu Kong <ko...@gmail.com> wrote:

> Hi Tony,
>
> Thanks for your response!
> I would definitely check supervisord.
>
> I wonder if there is a way that I can recover the killed JM and add it
> back to the cluster by using one of the scripts in the *flink/bin/*
>
>
> Thanks!
>
>
> Best regards,
> Mu
>
>
> On Thu, Feb 1, 2018 at 3:50 PM, Tony Wei <to...@gmail.com> wrote:
>
>> Hi Mu,
>>
>> AFAIK, that is the expected behavior when you launch your cluster in
>> standalone mode. Flink HA guarantees that the standby JM will take over the
>> whole cluster. The illustration just said recovered JM will become another
>> standby machine, but recovering a single instance is not the Flink HA's
>> responsibility.
>> One possible way might be using supervisord [1] to launch your JM
>> instance, it can help you monitor your process and automatically restart
>> when the process accidentally failed. Or you can use YARN cluster, the YARN
>> cluster will be responsible for recovering the dead JM.
>>
>> Best,
>> Tony Wei
>>
>> [1] http://supervisord.org/
>>
>> 2018-02-01 14:11 GMT+08:00 Mu Kong <ko...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I have a Flink HA cluster with 2 job managers and a zookeeper quorum of
>>> 3 nodes.
>>>
>>> My failed job manager didn't get recovered after I killed it.
>>> Here is how I didn't it and what I've observed:
>>>
>>> 1. I started the HA cluster with start-cluster.sh
>>> 2. Job manager A got elected.
>>> 3. I killed job manager A with kill command.
>>> 4. Job manager B got elected.
>>> 5. Job manager B was working well.
>>> 6. But job manager A never recovered since then.
>>>
>>> Do I miss something here or is it the case that HA cannot handle such
>>> failover(the flink instance gets killed directly)?
>>>
>>> Thanks!
>>>
>>> Best regards,
>>> Mu
>>>
>>
>>
>

Re: JobManager doesn't recover in HA mode

Posted by Mu Kong <ko...@gmail.com>.

Hi Tony,

Thanks for your response!
I would definitely check supervisord.

I wonder if there is a way that I can recover the killed JM and add it back
to the cluster by using one of the scripts in the *flink/bin/*


Thanks!


Best regards,
Mu


On Thu, Feb 1, 2018 at 3:50 PM, Tony Wei <to...@gmail.com> wrote:

> Hi Mu,
>
> AFAIK, that is the expected behavior when you launch your cluster in
> standalone mode. Flink HA guarantees that the standby JM will take over the
> whole cluster. The illustration just said recovered JM will become another
> standby machine, but recovering a single instance is not the Flink HA's
> responsibility.
> One possible way might be using supervisord [1] to launch your JM
> instance, it can help you monitor your process and automatically restart
> when the process accidentally failed. Or you can use YARN cluster, the YARN
> cluster will be responsible for recovering the dead JM.
>
> Best,
> Tony Wei
>
> [1] http://supervisord.org/
>
> 2018-02-01 14:11 GMT+08:00 Mu Kong <ko...@gmail.com>:
>
>> Hi all,
>>
>> I have a Flink HA cluster with 2 job managers and a zookeeper quorum of 3
>> nodes.
>>
>> My failed job manager didn't get recovered after I killed it.
>> Here is how I didn't it and what I've observed:
>>
>> 1. I started the HA cluster with start-cluster.sh
>> 2. Job manager A got elected.
>> 3. I killed job manager A with kill command.
>> 4. Job manager B got elected.
>> 5. Job manager B was working well.
>> 6. But job manager A never recovered since then.
>>
>> Do I miss something here or is it the case that HA cannot handle such
>> failover(the flink instance gets killed directly)?
>>
>> Thanks!
>>
>> Best regards,
>> Mu
>>
>
>

Re: JobManager doesn't recover in HA mode

Posted by Tony Wei <to...@gmail.com>.

Hi Mu,

AFAIK, that is the expected behavior when you launch your cluster in
standalone mode. Flink HA guarantees that the standby JM will take over the
whole cluster. The illustration just said recovered JM will become another
standby machine, but recovering a single instance is not the Flink HA's
responsibility.
One possible way might be using supervisord [1] to launch your JM instance,
it can help you monitor your process and automatically restart when the
process accidentally failed. Or you can use YARN cluster, the YARN cluster
will be responsible for recovering the dead JM.

Best,
Tony Wei

[1] http://supervisord.org/

2018-02-01 14:11 GMT+08:00 Mu Kong <ko...@gmail.com>:

> Hi all,
>
> I have a Flink HA cluster with 2 job managers and a zookeeper quorum of 3
> nodes.
>
> My failed job manager didn't get recovered after I killed it.
> Here is how I didn't it and what I've observed:
>
> 1. I started the HA cluster with start-cluster.sh
> 2. Job manager A got elected.
> 3. I killed job manager A with kill command.
> 4. Job manager B got elected.
> 5. Job manager B was working well.
> 6. But job manager A never recovered since then.
>
> Do I miss something here or is it the case that HA cannot handle such
> failover(the flink instance gets killed directly)?
>
> Thanks!
>
> Best regards,
> Mu
>