You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Matthias Pohl via user <us...@flink.apache.org> on 2023/01/02 12:51:01 UTC

Re: The use of zookeeper in flink

And I screwed up the reply again. -.- Here's my previous response for the
ML thread and not only spoon_lz:

Hi spoon_lz,
Thanks for reaching out to the community and sharing your use case. You're
right about the fact that Flink's HA feature relies on the leader election.
The HA backend not being responsive for too long might cause problems. I'm
not sure I understand fully what you mean by the standby JobManagers
struggling with the ZK outage shouldn't affect the running jobs. If ZK is
not responding for the standby JMs, the actual JM leader should be affected
as well which, as a consequence, would affect the job execution. But I
might misunderstand your post. Logs would be helpful to get a better
understanding of your post's context.

Best,
Matthias

FYI: There is also (a kind of stalled) discussion in the dev ML [1] about
recovery of too many jobs affecting Flink's performance.

[1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj

On Thu, Dec 29, 2022 at 8:55 AM spoon_lz <sp...@126.com> wrote:

> Hi All,
> We use zookeeper to achieve high availability of jobs. Recently, a failure
> occurred in our flink cluster. It was due to the abnormal downtime of the
> zookeeper service that all the flink jobs using this zookeeper all occurred
> failover. The failover startup of a large number of jobs in a short period
> of time caused the cluster The pressure is too high, which in turn causes
> the cluster to crash.
> Afterwards, I checked the HA function of zk:
> 1. Leader election
> 2. Service discovery
> 3.State persistence:
>
> The unavailability of the zookeeper service leads to failover of the flink
> job. It seems that because of the first point, JM cannot confirm whether it
> is Active or Standby, and the other two points should not affect it. But we
> didn't use the Standby JobManager.
> So in my opinion, if the JobManager of Standby is not used, whether the zk
> service is available should not affect the jobs that are running
> normally(of course, it is understandable that the task cannot be recovered
> correctly if an exception occurs), and I don’t know if there is a way to
> achieve a similar purpose
>

Re: The use of zookeeper in flink

Posted by Yang Wang <da...@gmail.com>.

The reason why the running jobs try to failover with zookeeper outage is
that the JobManager lost leadership.
Having a standby JobManager or not makes no difference.

Best,
Yang

Matthias Pohl via user <us...@flink.apache.org> 于2023年1月2日周一 20:51写道：

> And I screwed up the reply again. -.- Here's my previous response for the
> ML thread and not only spoon_lz:
>
> Hi spoon_lz,
> Thanks for reaching out to the community and sharing your use case. You're
> right about the fact that Flink's HA feature relies on the leader election.
> The HA backend not being responsive for too long might cause problems. I'm
> not sure I understand fully what you mean by the standby JobManagers
> struggling with the ZK outage shouldn't affect the running jobs. If ZK is
> not responding for the standby JMs, the actual JM leader should be affected
> as well which, as a consequence, would affect the job execution. But I
> might misunderstand your post. Logs would be helpful to get a better
> understanding of your post's context.
>
> Best,
> Matthias
>
> FYI: There is also (a kind of stalled) discussion in the dev ML [1] about
> recovery of too many jobs affecting Flink's performance.
>
> [1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj
>
> On Thu, Dec 29, 2022 at 8:55 AM spoon_lz <sp...@126.com> wrote:
>
>> Hi All,
>> We use zookeeper to achieve high availability of jobs. Recently, a
>> failure occurred in our flink cluster. It was due to the abnormal downtime
>> of the zookeeper service that all the flink jobs using this zookeeper all
>> occurred failover. The failover startup of a large number of jobs in a
>> short period of time caused the cluster The pressure is too high, which in
>> turn causes the cluster to crash.
>> Afterwards, I checked the HA function of zk:
>> 1. Leader election
>> 2. Service discovery
>> 3.State persistence:
>>
>> The unavailability of the zookeeper service leads to failover of the
>> flink job. It seems that because of the first point, JM cannot confirm
>> whether it is Active or Standby, and the other two points should not affect
>> it. But we didn't use the Standby JobManager.
>> So in my opinion, if the JobManager of Standby is not used, whether the
>> zk service is available should not affect the jobs that are running
>> normally(of course, it is understandable that the task cannot be recovered
>> correctly if an exception occurs), and I don’t know if there is a way to
>> achieve a similar purpose
>>
>