You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Lijie Wang <wa...@gmail.com> on 2022/05/20 04:57:31 UTC

[VOTE] FLIP-224: Blocklist Mechanism

Hi everyone,

Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
discussion thread [2]

I'd like to start a vote for it. The vote will last for at least 72 hours
unless there is an objection or insufficient votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
[2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h

Best,
Lijie

Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Chesnay Schepler <ch...@apache.org>.
I don't really see a strong need for the blocklist in the FLIP-168.
It states that we need the block mechanism so that the speculative 
executions aren't deployed to the slow node.

I'm wondering what exactly prevents the scheduler from ensuring that 
right now, given that we already have other mechanisms that do 
location-aware decisions (e.g., local recovery or locality).
It could refuse to use slots from slow nodes (and downscale the job 
accordingly), upscale the job to limit the impact of slow nodes, and we 
could also think about extending the requirement declaration to have a 
notion of "undesirable nodes" so you're getting some more slots from 
good nodes (which I /think/ is the main thing you're targeting with the 
blocklist).
In any case I think there are options to explore here. Maybe you already 
did, but there's nothing in the FLIP about rejected alternatives.

It makes me thing that the whole "block the node but keep current stuff 
running" is more of a band-aid for a problem we're about to introduce 
ourselves. In particular because cluster-wide blocks seem rather strange 
in general; this should be scoped to the job because the performance is 
measured relative to other vertices in that same job. Another job might 
have drawn the short straw and only got slow slots; as far as that job 
is concerned the performance is perfectly fine.

On 07/06/2022 10:27, Zhu Zhu wrote:
> Hi Chesnay,
>
> For your information, one major goal of blocklist mechanism is to
> support FLIP-168(speculative execution of batch jobs). When
> speculative execution happens, it needs to keep the existing tasks
> running and launch speculative tasks on other nodes. We have heard
> request of speculative execution from many users, who find the feature
> a blocker for them to run their production batch jobs on Flink.
> Multi-tenant environment is common for batch jobs and temporary
> hotspot becomes a common problem. It cannot be well resolved by fine
> grained resources(machine load is not controlled by Flink) nor by
> killing all tasks on a temporary hotspot(the job may roll back to
> hours ago). Therefore, even just considering this goal, I think it
> adds enough value to users.
>
> Regarding wether we should reject a proposal because it adds
> complexity to the core components. My point is that it depends on
> whether the feature adds enough value to users. And it's also welcome
> if someone has another good idea which adds less complexity.
>
> If you are still concerned about the value of this feature, I'm fine a
> open a survey in the user mailing lists to see how users think about
> it.
> What do you think?
>
> Thanks,
> Zhu
>
> Chesnay Schepler<ch...@apache.org>  于2022年6月7日周二 15:13写道:
>> I've had some time to think about it and concluded to stick to my -1.
>>
>> While BLOCK_WITH_QUARANTINE is easy to implement (un-register TMs and
>> ignore all RPCs (the latter mostly happens automatically)) it doesn't
>> add a whole lot of value as it's pretty much equivalent with shutting
>> down the TM.
>>
>> Meanwhile, BLOCK needs an entirely different implementation that
>> interacts with the slot management on both JM/RM, and tbh I'm not so
>> sure about it's purpose. If the node/process is overloaded because of
>> the running job, well then resource profiles & fine-grained resource
>> management is supposed to address that. If the overloading is externally
>> induced then BLOCK only makes sense if the node is overloaded to a
>> degree where the existing workload is fine (otherwise
>> BLOCK_WITH_QUARANTINE would be a better choice I guess), which seems
>> rather unlikely.
>>
>> I'm against this change because I don't believe it will be useful for
>> the general user-base, nor since this can't be implemented without
>> pushing some complexity into core components.
>>
>> On 28/05/2022 06:48, Zhu Zhu wrote:
>>> Hi Chesnay,
>>> Would you share your thoughts in the discussion thread if there are
>>> still concerns?
>>>
>>> Thanks,
>>> Zhu
>>>
>>> Chesnay Schepler<ch...@apache.org>  于2022年5月27日周五 14:54写道:
>>>
>>>> -1 to put a lid on things for now, because I'm not quite done yet with
>>>> the discussion.
>>>>
>>>> On 27/05/2022 05:25, Yangze Guo wrote:
>>>>> +1 (binding)
>>>>>
>>>>> Best,
>>>>> Yangze Guo
>>>>>
>>>>> On Thu, May 26, 2022 at 3:54 PM Yun Gao<yu...@aliyun.com.invalid>  wrote:
>>>>>> Thanks Lijie and Zhu for driving the FLIP!
>>>>>>
>>>>>> The blocked list functionality helps reduce the complexity in maintenance
>>>>>> and the currently design looks good to me, thus +1 from my side (binding).
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Yun
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------
>>>>>> From:Xintong Song<to...@gmail.com>
>>>>>> Send Time:2022 May 26 (Thu.) 12:51
>>>>>> To:dev<de...@flink.apache.org>
>>>>>> Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism
>>>>>>
>>>>>> Thanks for driving this effort, Lijie.
>>>>>>
>>>>>> I think a nice addition would be to make this feature accessible directly
>>>>>> from webui. However, there's no reason to block this FLIP on it.
>>>>>>
>>>>>> So +1 (binding) from my side.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Xintong
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 20, 2022 at 12:57 PM Lijie Wang<wa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
>>>>>>> discussion thread [2]
>>>>>>>
>>>>>>> I'd like to start a vote for it. The vote will last for at least 72 hours
>>>>>>> unless there is an objection or insufficient votes.
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
>>>>>>> [2]https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
>>>>>>>
>>>>>>> Best,
>>>>>>> Lijie
>>>>>>>

Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.
Hi Chesnay,

For your information, one major goal of blocklist mechanism is to
support FLIP-168(speculative execution of batch jobs). When
speculative execution happens, it needs to keep the existing tasks
running and launch speculative tasks on other nodes. We have heard
request of speculative execution from many users, who find the feature
a blocker for them to run their production batch jobs on Flink.
Multi-tenant environment is common for batch jobs and temporary
hotspot becomes a common problem. It cannot be well resolved by fine
grained resources(machine load is not controlled by Flink) nor by
killing all tasks on a temporary hotspot(the job may roll back to
hours ago). Therefore, even just considering this goal, I think it
adds enough value to users.

Regarding wether we should reject a proposal because it adds
complexity to the core components. My point is that it depends on
whether the feature adds enough value to users. And it's also welcome
if someone has another good idea which adds less complexity.

If you are still concerned about the value of this feature, I'm fine a
open a survey in the user mailing lists to see how users think about
it.
What do you think?

Thanks,
Zhu

Chesnay Schepler <ch...@apache.org> 于2022年6月7日周二 15:13写道:
>
> I've had some time to think about it and concluded to stick to my -1.
>
> While BLOCK_WITH_QUARANTINE is easy to implement (un-register TMs and
> ignore all RPCs (the latter mostly happens automatically)) it doesn't
> add a whole lot of value as it's pretty much equivalent with shutting
> down the TM.
>
> Meanwhile, BLOCK needs an entirely different implementation that
> interacts with the slot management on both JM/RM, and tbh I'm not so
> sure about it's purpose. If the node/process is overloaded because of
> the running job, well then resource profiles & fine-grained resource
> management is supposed to address that. If the overloading is externally
> induced then BLOCK only makes sense if the node is overloaded to a
> degree where the existing workload is fine (otherwise
> BLOCK_WITH_QUARANTINE would be a better choice I guess), which seems
> rather unlikely.
>
> I'm against this change because I don't believe it will be useful for
> the general user-base, nor since this can't be implemented without
> pushing some complexity into core components.
>
> On 28/05/2022 06:48, Zhu Zhu wrote:
> > Hi Chesnay,
> > Would you share your thoughts in the discussion thread if there are
> > still concerns?
> >
> > Thanks,
> > Zhu
> >
> > Chesnay Schepler <ch...@apache.org> 于2022年5月27日周五 14:54写道:
> >
> >> -1 to put a lid on things for now, because I'm not quite done yet with
> >> the discussion.
> >>
> >> On 27/05/2022 05:25, Yangze Guo wrote:
> >>> +1 (binding)
> >>>
> >>> Best,
> >>> Yangze Guo
> >>>
> >>> On Thu, May 26, 2022 at 3:54 PM Yun Gao <yu...@aliyun.com.invalid> wrote:
> >>>> Thanks Lijie and Zhu for driving the FLIP!
> >>>>
> >>>> The blocked list functionality helps reduce the complexity in maintenance
> >>>> and the currently design looks good to me, thus +1 from my side (binding).
> >>>>
> >>>>
> >>>> Best,
> >>>> Yun
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ------------------------------------------------------------------
> >>>> From:Xintong Song <to...@gmail.com>
> >>>> Send Time:2022 May 26 (Thu.) 12:51
> >>>> To:dev <de...@flink.apache.org>
> >>>> Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism
> >>>>
> >>>> Thanks for driving this effort, Lijie.
> >>>>
> >>>> I think a nice addition would be to make this feature accessible directly
> >>>> from webui. However, there's no reason to block this FLIP on it.
> >>>>
> >>>> So +1 (binding) from my side.
> >>>>
> >>>> Best,
> >>>>
> >>>> Xintong
> >>>>
> >>>>
> >>>>
> >>>> On Fri, May 20, 2022 at 12:57 PM Lijie Wang <wa...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
> >>>>> discussion thread [2]
> >>>>>
> >>>>> I'd like to start a vote for it. The vote will last for at least 72 hours
> >>>>> unless there is an objection or insufficient votes.
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
> >>>>> [2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
> >>>>>
> >>>>> Best,
> >>>>> Lijie
> >>>>>
>

Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Chesnay Schepler <ch...@apache.org>.
I've had some time to think about it and concluded to stick to my -1.

While BLOCK_WITH_QUARANTINE is easy to implement (un-register TMs and 
ignore all RPCs (the latter mostly happens automatically)) it doesn't 
add a whole lot of value as it's pretty much equivalent with shutting 
down the TM.

Meanwhile, BLOCK needs an entirely different implementation that 
interacts with the slot management on both JM/RM, and tbh I'm not so 
sure about it's purpose. If the node/process is overloaded because of 
the running job, well then resource profiles & fine-grained resource 
management is supposed to address that. If the overloading is externally 
induced then BLOCK only makes sense if the node is overloaded to a 
degree where the existing workload is fine (otherwise 
BLOCK_WITH_QUARANTINE would be a better choice I guess), which seems 
rather unlikely.

I'm against this change because I don't believe it will be useful for 
the general user-base, nor since this can't be implemented without 
pushing some complexity into core components.

On 28/05/2022 06:48, Zhu Zhu wrote:
> Hi Chesnay,
> Would you share your thoughts in the discussion thread if there are
> still concerns?
>
> Thanks,
> Zhu
>
> Chesnay Schepler <ch...@apache.org> 于2022年5月27日周五 14:54写道:
>
>> -1 to put a lid on things for now, because I'm not quite done yet with
>> the discussion.
>>
>> On 27/05/2022 05:25, Yangze Guo wrote:
>>> +1 (binding)
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Thu, May 26, 2022 at 3:54 PM Yun Gao <yu...@aliyun.com.invalid> wrote:
>>>> Thanks Lijie and Zhu for driving the FLIP!
>>>>
>>>> The blocked list functionality helps reduce the complexity in maintenance
>>>> and the currently design looks good to me, thus +1 from my side (binding).
>>>>
>>>>
>>>> Best,
>>>> Yun
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------
>>>> From:Xintong Song <to...@gmail.com>
>>>> Send Time:2022 May 26 (Thu.) 12:51
>>>> To:dev <de...@flink.apache.org>
>>>> Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism
>>>>
>>>> Thanks for driving this effort, Lijie.
>>>>
>>>> I think a nice addition would be to make this feature accessible directly
>>>> from webui. However, there's no reason to block this FLIP on it.
>>>>
>>>> So +1 (binding) from my side.
>>>>
>>>> Best,
>>>>
>>>> Xintong
>>>>
>>>>
>>>>
>>>> On Fri, May 20, 2022 at 12:57 PM Lijie Wang <wa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
>>>>> discussion thread [2]
>>>>>
>>>>> I'd like to start a vote for it. The vote will last for at least 72 hours
>>>>> unless there is an objection or insufficient votes.
>>>>>
>>>>> [1]
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
>>>>> [2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
>>>>>
>>>>> Best,
>>>>> Lijie
>>>>>


Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.
Hi Chesnay,
Would you share your thoughts in the discussion thread if there are
still concerns?

Thanks,
Zhu

Chesnay Schepler <ch...@apache.org> 于2022年5月27日周五 14:54写道:

>
> -1 to put a lid on things for now, because I'm not quite done yet with
> the discussion.
>
> On 27/05/2022 05:25, Yangze Guo wrote:
> > +1 (binding)
> >
> > Best,
> > Yangze Guo
> >
> > On Thu, May 26, 2022 at 3:54 PM Yun Gao <yu...@aliyun.com.invalid> wrote:
> >> Thanks Lijie and Zhu for driving the FLIP!
> >>
> >> The blocked list functionality helps reduce the complexity in maintenance
> >> and the currently design looks good to me, thus +1 from my side (binding).
> >>
> >>
> >> Best,
> >> Yun
> >>
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> From:Xintong Song <to...@gmail.com>
> >> Send Time:2022 May 26 (Thu.) 12:51
> >> To:dev <de...@flink.apache.org>
> >> Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism
> >>
> >> Thanks for driving this effort, Lijie.
> >>
> >> I think a nice addition would be to make this feature accessible directly
> >> from webui. However, there's no reason to block this FLIP on it.
> >>
> >> So +1 (binding) from my side.
> >>
> >> Best,
> >>
> >> Xintong
> >>
> >>
> >>
> >> On Fri, May 20, 2022 at 12:57 PM Lijie Wang <wa...@gmail.com>
> >> wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
> >>> discussion thread [2]
> >>>
> >>> I'd like to start a vote for it. The vote will last for at least 72 hours
> >>> unless there is an objection or insufficient votes.
> >>>
> >>> [1]
> >>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
> >>> [2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
> >>>
> >>> Best,
> >>> Lijie
> >>>
>

Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Chesnay Schepler <ch...@apache.org>.
-1 to put a lid on things for now, because I'm not quite done yet with 
the discussion.

On 27/05/2022 05:25, Yangze Guo wrote:
> +1 (binding)
>
> Best,
> Yangze Guo
>
> On Thu, May 26, 2022 at 3:54 PM Yun Gao <yu...@aliyun.com.invalid> wrote:
>> Thanks Lijie and Zhu for driving the FLIP!
>>
>> The blocked list functionality helps reduce the complexity in maintenance
>> and the currently design looks good to me, thus +1 from my side (binding).
>>
>>
>> Best,
>> Yun
>>
>>
>>
>>
>> ------------------------------------------------------------------
>> From:Xintong Song <to...@gmail.com>
>> Send Time:2022 May 26 (Thu.) 12:51
>> To:dev <de...@flink.apache.org>
>> Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism
>>
>> Thanks for driving this effort, Lijie.
>>
>> I think a nice addition would be to make this feature accessible directly
>> from webui. However, there's no reason to block this FLIP on it.
>>
>> So +1 (binding) from my side.
>>
>> Best,
>>
>> Xintong
>>
>>
>>
>> On Fri, May 20, 2022 at 12:57 PM Lijie Wang <wa...@gmail.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
>>> discussion thread [2]
>>>
>>> I'd like to start a vote for it. The vote will last for at least 72 hours
>>> unless there is an objection or insufficient votes.
>>>
>>> [1]
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
>>> [2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
>>>
>>> Best,
>>> Lijie
>>>


Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Yangze Guo <ka...@gmail.com>.
+1 (binding)

Best,
Yangze Guo

On Thu, May 26, 2022 at 3:54 PM Yun Gao <yu...@aliyun.com.invalid> wrote:
>
> Thanks Lijie and Zhu for driving the FLIP!
>
> The blocked list functionality helps reduce the complexity in maintenance
> and the currently design looks good to me, thus +1 from my side (binding).
>
>
> Best,
> Yun
>
>
>
>
> ------------------------------------------------------------------
> From:Xintong Song <to...@gmail.com>
> Send Time:2022 May 26 (Thu.) 12:51
> To:dev <de...@flink.apache.org>
> Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism
>
> Thanks for driving this effort, Lijie.
>
> I think a nice addition would be to make this feature accessible directly
> from webui. However, there's no reason to block this FLIP on it.
>
> So +1 (binding) from my side.
>
> Best,
>
> Xintong
>
>
>
> On Fri, May 20, 2022 at 12:57 PM Lijie Wang <wa...@gmail.com>
> wrote:
>
> > Hi everyone,
> >
> > Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
> > discussion thread [2]
> >
> > I'd like to start a vote for it. The vote will last for at least 72 hours
> > unless there is an objection or insufficient votes.
> >
> > [1]
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
> > [2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
> >
> > Best,
> > Lijie
> >
>

Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Yun Gao <yu...@aliyun.com.INVALID>.
Thanks Lijie and Zhu for driving the FLIP!

The blocked list functionality helps reduce the complexity in maintenance
and the currently design looks good to me, thus +1 from my side (binding). 


Best,
Yun 




------------------------------------------------------------------
From:Xintong Song <to...@gmail.com>
Send Time:2022 May 26 (Thu.) 12:51
To:dev <de...@flink.apache.org>
Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism

Thanks for driving this effort, Lijie.

I think a nice addition would be to make this feature accessible directly
from webui. However, there's no reason to block this FLIP on it.

So +1 (binding) from my side.

Best,

Xintong



On Fri, May 20, 2022 at 12:57 PM Lijie Wang <wa...@gmail.com>
wrote:

> Hi everyone,
>
> Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
> discussion thread [2]
>
> I'd like to start a vote for it. The vote will last for at least 72 hours
> unless there is an objection or insufficient votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
> [2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
>
> Best,
> Lijie
>


Re: [VOTE] FLIP-224: Blocklist Mechanism

Posted by Xintong Song <to...@gmail.com>.
Thanks for driving this effort, Lijie.

I think a nice addition would be to make this feature accessible directly
from webui. However, there's no reason to block this FLIP on it.

So +1 (binding) from my side.

Best,

Xintong



On Fri, May 20, 2022 at 12:57 PM Lijie Wang <wa...@gmail.com>
wrote:

> Hi everyone,
>
> Thanks for the feedback for FLIP-224: Blocklist Mechanism [1] on the
> discussion thread [2]
>
> I'd like to start a vote for it. The vote will last for at least 72 hours
> unless there is an objection or insufficient votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blocklist+Mechanism
> [2] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
>
> Best,
> Lijie
>