You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Zili Chen <wa...@gmail.com> on 2019/08/26 06:13:52 UTC

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Hi Till,

I'd like to revive this thread since 1.9.0 has been released.

IMHO we already reached a consensus on JIRA and if you can review
the pull request we hopefully address the issue in next release.

Best,
tison.


Zili Chen <wa...@gmail.com> 于2019年7月29日周一 下午11:05写道:

> Hi Till,
>
> Thanks for your explanation. Let's pick up this thread in 1.10 developing.
>
> Best,
> tison.
>
>
> Till Rohrmann <tr...@apache.org> 于2019年7月29日周一 下午9:12写道:
>
>> Hi Tison,
>>
>> I would consider this a new feature and as such it won't be possible to
>> include it in the 1.9.0 release since the feature freeze has been passed.
>> We might target 1.10, though.
>>
>> Cheers,
>> Till
>>
>> On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <wa...@gmail.com> wrote:
>>
>> > Hi committers,
>> >
>> > Now that we have an ongoing pr[1] to this JIRA, we need a committer
>> > to push this thread forward. It would be glad to see this issue fixed
>> > in 1.9.0.
>> >
>> > Best,
>> > tison.
>> >
>> > [1] https://github.com/apache/flink/pull/9158
>> >
>> >
>> > 未来阳光 <22...@qq.com> 于2019年7月23日周二 下午9:28写道:
>> >
>> > > Ok, If you have any suggestions, we can talk aobut the details under
>> > > FLINK-10052.
>> > >
>> > >
>> > > Best.
>> > >
>> > >
>> > > ------------------ 原始邮件 ------------------
>> > > 发件人: "Till Rohrmann"<tr...@apache.org>;
>> > > 发送时间: 2019年7月23日(星期二) 晚上9:19
>> > > 收件人: "dev"<de...@flink.apache.org>;
>> > >
>> > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper
>> connections
>> > >
>> > >
>> > >
>> > > Hi Lamber-Ken,
>> > >
>> > > thanks for starting this discussion. I think there is benefit of not
>> > > directly losing leadership if the ZooKeeper connection goes into the
>> > > SUSPENDED state. In particular if we can guarantee that there is only
>> a
>> > > single JobMaster, it might make sense to not overly eagerly give up
>> > > leadership. I would suggest to continue the technical discussion on
>> the
>> > > JIRA issue thread since it already contains a good amount of details.
>> > >
>> > > Cheers,
>> > > Till
>> > >
>> > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <22...@qq.com> wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > Desc
>> > > > We deploy flink streaming jobs on hadoop cluster on per-job model
>> and
>> > use
>> > > > zookeeper as HighAvailabilityService, but we found that flink job
>> will
>> > > > restart because of the network disconnected temporarily between
>> > > jobmanager
>> > > > and zookeeper.So we analyze this problem deeply. Flink JobManager
>> use
>> > > > curator's `LeaderLatch` to maintain the leadership. When network
>> > > > disconncet, the `LeaderLatch` will change leadership to false
>> directly.
>> > > We
>> > > > think it's too brutally that many flink longrunning jobs will
>> restart
>> > > > because of the network shake.Instead of directly revoking the
>> > leadership
>> > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait
>> until
>> > > the
>> > > > ZooKeeper connection is LOST.
>> > > >
>> > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189,
>> they
>> > > are
>> > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
>> > > > FLINK-13189.
>> > > >
>> > > > Solution
>> > > > Back to this problem, there're two ways to solve this currently,
>> one is
>> > > > rewrite LeaderLatch#handleStateChange method, another is upgrade
>> > > > curator-4.2.0. The first way is hackly but right, the second way
>> need
>> > to
>> > > > consider the
>> > > > compatibility. For more detail, please see FLINK-10052.
>> > > >
>> > > > Hope
>> > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we
>> > hope
>> > > > this problem can fix as soon as possible.
>> > > > btw, thanks @TisonKun for talking about this problem and review pr.
>> > > >
>> > > > Links
>> > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
>> > > > https://issues.apache.org/jira/browse/FLINK-10052>
>> > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
>> > > > https://issues.apache.org/jira/browse/FLINK-13189>
>> > > >
>> > > > Any suggestion is welcome, what do you think?
>> > > >
>> > > > Best, lamber-ken.
>> >
>>
>