You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Martijn Visser <ma...@apache.org> on 2022/05/02 06:49:18 UTC

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Hi everyone,

Thanks for creating this FLIP. I can understand the problem and I see value
in the automatic detection and blocklisting. I do have some concerns with
the ability to manually specify to be blocked resources. I have two
concerns;

* Most organizations explicitly have a separation of concerns, meaning that
there's a group who's responsible for managing a cluster and there's a user
group who uses that cluster. With the introduction of this mechanism, the
latter group now can influence the responsibility of the first group. So it
can be possible that someone from the user group blocks something, which
causes an outage (which could result in paging mechanism triggering etc)
which impacts the first group.
* How big is the group of people who can go through the process of manually
identifying a node that isn't behaving as it should be? I do think this
group is relatively limited. Does it then make sense to introduce such a
feature, which would only be used by a really small user group of Flink? We
still have to maintain, test and support such a feature.

I'm +1 for the autodetection features, but I'm leaning towards not exposing
this to the user group but having this available strictly for cluster
operators. They could then also set up their paging/metrics/logging system
to take this into account.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82
https://github.com/MartijnVisser


On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com> wrote:

> Thanks for driving this, Zhu and Lijie.
>
> +1 for the overall proposal. Just share some cents here:
>
> - Why do we need to expose
> cluster.resource-blacklist.item.timeout-check-interval to the user?
> I think the semantics of `cluster.resource-blacklist.item.timeout` is
> sufficient for the user. How to guarantee the timeout mechanism is
> Flink's internal implementation. I think it will be very confusing and
> we do not need to expose it to users.
>
> - ResourceManager can notify the exception of a task manager to
> `BlacklistHandler` as well.
> For example, the slot allocation might fail in case the target task
> manager is busy or has a network jitter. I don't mean we need to cover
> this case in this version, but we can also open a `notifyException` in
> `ResourceManagerBlacklistHandler`.
>
> - Before we sync the blocklist to ResourceManager, will the slot of a
> blocked task manager continues to be released and allocated?
>
> Best,
> Yangze Guo
>
> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <wa...@gmail.com>
> wrote:
> >
> > Hi Konstantin,
> >
> > Thanks for your feedback. I will response your 4 remarks:
> >
> >
> > 1) Thanks for reminding me of the controversy. I think “BlockList” is
> good
> > enough, and I will change it in FLIP.
> >
> >
> > 2) Your suggestion for the REST API is a good idea. Based on the above, I
> > would change REST API as following:
> >
> > POST/GET <host>/blocklist/nodes
> >
> > POST/GET <host>/blocklist/taskmanagers
> >
> > DELETE <host>/blocklist/node/<identifier>
> >
> > DELETE <host>/blocklist/taskmanager/<identifier>
> >
> >
> > 3) If a node is blocking/blocklisted, it means that all task managers on
> > this node are blocklisted. All slots on these TMs are not available. This
> > is actually a bit like TM losts, but these TMs are not really lost, they
> > are in an unavailable status, and they are still registered in this flink
> > cluster. They will be available again once the corresponding blocklist
> item
> > is removed. This behavior is the same in active/non-active clusters.
> > However in the active clusters, these TMs may be released due to idle
> > timeouts.
> >
> >
> > 4) For the item timeout, I prefer to keep it. The reasons are as
> following:
> >
> > a) The timeout will not affect users adding or removing items via REST
> API,
> > and users can disable it by configuring it to Long.MAX_VALUE .
> >
> > b) Some node problems can recover after a period of time (such as machine
> > hotspots), in which case users may prefer that Flink can do this
> > automatically instead of requiring the user to do it manually.
> >
> >
> > Best,
> >
> > Lijie
> >
> > Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
> >
> > > Hi Lijie,
> > >
> > > I think, this makes sense and +1 to only support manually blocking
> > > taskmanagers and nodes. Maybe the different strategies can also be
> > > maintained outside of Apache Flink.
> > >
> > > A few remarks:
> > >
> > > 1) Can we use another term than "bla.cklist" due to the controversy
> around
> > > the term? [1] There was also a Jira Ticket about this topic a while
> back
> > > and there was generally a consensus to avoid the term blacklist &
> whitelist
> > > [2]? We could use "blocklist" "denylist" or "quarantined"
> > > 2) For the REST API, I'd prefer a slightly different design as verbs
> like
> > > add/remove often considered an anti-pattern for REST APIs. POST on a
> list
> > > item is generally the standard to add items. DELETE on the individual
> > > resource is standard to remove an item.
> > >
> > > POST <host>/quarantine/items
> > > DELETE <host>/quarantine/items/<itemidentifier>
> > >
> > > We could also consider to separate taskmanagers and nodes in the REST
> API
> > > (and internal data structures). Any opinion on this?
> > >
> > > POST/GET <host>/quarantine/nodes
> > > POST/GET <host>/quarantine/taskmanager
> > > DELETE <host>/quarantine/nodes/<identifier>
> > > DELETE <host>/quarantine/taskmanager/<identifier>
> > >
> > > 3) How would blocking nodes behave with non-active resource managers,
> i.e.
> > > standalone or reactive mode?
> > >
> > > 4) To keep the implementation even more minimal, do we need the timeout
> > > behavior? If items are added/removed manually we could delegate this
> to the
> > > user easily. In my opinion the timeout behavior would better fit into
> > > specific strategies at a later point.
> > >
> > > Looking forward to your thoughts.
> > >
> > > Cheers and thank you,
> > >
> > > Konstantin
> > >
> > > [1]
> > >
> > >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > [2] https://issues.apache.org/jira/browse/FLINK-18209
> > >
> > > Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> > > wangdachui9501@gmail.com>:
> > >
> > > > Hi all,
> > > >
> > > > Flink job failures may happen due to cluster node issues
> (insufficient
> > > disk
> > > > space, bad hardware, network abnormalities). Flink will take care of
> the
> > > > failures and redeploy the tasks. However, due to data locality and
> > > limited
> > > > resources, the new tasks are very likely to be redeployed to the same
> > > > nodes, which will result in continuous task abnormalities and affect
> job
> > > > progress.
> > > >
> > > > Currently, Flink users need to manually identify the problematic
> node and
> > > > take it offline to solve this problem. But this approach has
> following
> > > > disadvantages:
> > > >
> > > > 1. Taking a node offline can be a heavy process. Users may need to
> > > contact
> > > > cluster administors to do this. The operation can even be dangerous
> and
> > > not
> > > > allowed during some important business events.
> > > >
> > > > 2. Identifying and solving this kind of problems manually would be
> slow
> > > and
> > > > a waste of human resources.
> > > >
> > > > To solve this problem, Zhu Zhu and I propose to introduce a blacklist
> > > > mechanism for Flink to filter out problematic resources.
> > > >
> > > >
> > > > You can find more details in FLIP-224[1]. Looking forward to your
> > > feedback.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > >
> > > >
> > > > Best,
> > > >
> > > > Lijie
> > > >
> > >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Thanks Chesnay & Zhu.

I will start a new vote thread soon.

Best,
Lijie

Chesnay Schepler <ch...@apache.org> 于2022年6月15日周三 15:49写道：

> To expand a bit for transparency, Zhu Zhu and I had long discussion
> (literally spanning days) about this FLIP and it's relation to
> speculative execution.
>
> The gist of it is that speculative execution doesn't strictly need the
> block list; just _some_ mechanism to select/request slots from other nodes.
> It is however at this time the easiest way to implement it because of
> the technical debt we have in the scheduler components (e.g., adaptive
> scheduler not being used for all streaming jobs or default scheduler not
> being integrated fully into declarative resource management).
> Because of that I was worried that we might expand the API now to unlock
> speculative execution, but then end up not actually requiring it down
> the line (but still being stuck with it).
> This should give us quite a bit more freedom as to how we implement it.
> Which is particularly important because we already identified some
> limitations in the current design (e.g., blocks not being scoped to
> jobs, resourceID-based blocks intefeing with hard-coded resource IDs,
> potentially blocking more slots than necessary).
>
> Shall we start a new vote thread, since the design changed quite a bit?
>
> On 15/06/2022 09:33, Zhu Zhu wrote:
> > Hi everyone,
> > Thank you all for the feedback!
> >
> > We receive concerns that the blocklist feature is not strongly
> > required except for the needs of speculative execution. So as the
> > first step, we will limit the scope of FLIP-224 to only support
> > speculative execution, and therefore, not add public interfaces for
> > users to manipulate with the blocklist directly.
> >
> > If later we receive strong requirements for blocklist from users, we
> > will have another FLIP to open it to users with well designed public
> > interfaces and web UI.
> >
> > Thanks,
> > Zhu
> >
> > Zhu Zhu <re...@gmail.com> 于2022年6月10日周五 18:45写道：
> >> 1) With the declarative slot allocation protocol, it's not easy to
> >> keep the slow slots which may have satisfy the slot request already,
> >> but ask for more slots from the resource manager.
> >> And we are also concerned to use detected slow slots, which may add
> >> load to the slow nodes and further slow down the tasks on the slow
> >> nodes. Because we found slow nodes are mostly caused by heavy loads.
> >>
> >> 2) I see your point. The blocker here is that batch jobs currently
> >> uses DefaultScheduler. The scheduler does not see slots directly and
> >> does not know which of them are from slow nodes, so it's hard for it
> >> to rescale the vertex according to this information. Besides that, it
> >> cannot help with the stage that first observed and affected the slow
> >> nodes.
> >>
> >> Thanks,
> >> Zhu
> >>
> >> Chesnay Schepler <ch...@apache.org> 于2022年6月10日周五 17:04写道：
> >>> 1)
> >>> It's true that if we handle this entirely in the scheduler we may get
> a bunch of slow slots from the RM.
> >>> My point is that this isn't necessarily a problem; it depends on how
> you handle those slots.
> >>>
> >>> We anyway need to account for the possibility that we're not getting
> any new fast slots from the RM.
> >>> With that in mind, I'd rather not categorically throw away the slow
> slots we got, but try to make use of them somehow.
> >>>
> >>> 2)
> >>> The idea was to rescale the job when the current stage finishes.
> >>> The speculative execution handles slow nodes being detected in the
> current stage,
> >>> next stage we use _some_ strategy to handle slow nodes (be it ignoring
> those, rescaling the job, ...).
> >>>
> >>> 3)
> >>> It's a fair observation that once you push this to the RM you end up
> with a de-facto blocklist :)
> >>>
> >>> On 08/06/2022 17:11, Zhu Zhu wrote:
> >>>
> >>> Thanks Chesnay for the feedback in the vote thread.
> >>> (https://lists.apache.org/thread/opc7jg3rpxnwotkb0fcn4wnm02m4397o)
> >>>
> >>> I'd like to continue the discussion in this thread so that the
> >>> discussions can be better tracked.
> >>>
> >>> Regarding your questions, here are my thoughts:
> >>> 1. The current locality mechanism does not work well to avoid
> >>> deploying tasks to slow nodes because it cannot proactively reject or
> >>> release slots from the slow nodes. And it cannot help at the resource
> >>> manager side to avoid allocating free slots or launching new
> >>> TaskManagers on slow nodes.
> >>> 2. Dynamically downscaling or upscaling a batch job is usually
> >>> unacceptable because it means to re-run the whole stage.
> >>> 3. Extending the requirement declaration to have a notion of
> >>> "undesirable nodes" is an option. And it is actually how we started.
> >>> Considering implementation details, we found we need that
> >>> - a tracker to collect all the undesirable(detected slow nodes)
> >>> - to filter out slots from undesirable nodes when allocating slots
> >>> from the SlotPool
> >>> - ask the ResourceManager for slots that are not on the undesirable
> >>> nodes. The ResourceManager may further need to ask for new
> >>> TaskManagers that are not on the undesirable nodes.
> >>> Then, with all these functionality, we found that we almost have a
> >>> blocklist mechanism. As blocklist mechanism is a common concept and is
> >>> possible to benefit users, we took this chance to propose the
> >>> blocklist mechanism.
> >>> 4. A cluster-wide shared blocklist is not a must for speculative
> >>> execution at the moment. It is mainly part of a standalone blocklist
> >>> feature to host user specified block items. To avoid sharing job
> >>> specific blocked items between jobs, one way is to add a nullable
> >>> JobID tag for the blocked item.
> >>>
> >>> Thanks,
> >>> Zhu
> >>>
> >>> Zhu Zhu <re...@gmail.com> 于2022年6月7日周二 10:58写道：
> >>>
> >>> Hi Chesnay,
> >>>
> >>> Would you please take a look at the FLIP and discussion to see if all
> >>> your concerns have been addressed?
> >>>
> >>> Thanks,
> >>> Zhu
> >>>
> >>> Zhu Zhu <re...@gmail.com> 于2022年5月28日周六 13:26写道：
> >>>
> >>> Regarding the concern of the SlotManager, my two cents:
> >>> 1. it is necessary for the SlotManager to host blocked slots, in 2
> cases:
> >>>    a. In standalone mode, a taskmanager may be temporarily added to
> >>> the blocklist. We do not want the TM to get disconnected and shut down.
> >>> So we need to keep its connection to RM and keep hosting its slots.
> >>>    b. When we want to avoid allocating slots to a slow nodes but do not
> >>> want to kill current running tasks on the nodes (MARK_BLOCKED mode).
> >>>
> >>> There is possible a way to keep the connection of a blocked task
> manager
> >>> while hide its slots from SlotManager, but I feel it may be even much
> more
> >>> complicated.
> >>>
> >>> 2. It will not complicate the SlotManager too much. The SlotManager
> will
> >>> be offered a BlockedTaskManagerChecker when created, and just need
> >>> to use it to filter out blocked slots on slot request. Therefore I
> think the
> >>> complication is acceptable.
> >>>
> >>> Thanks,
> >>> Zhu
> >>>
> >>> Lijie Wang <wa...@gmail.com> 于2022年5月25日周三 15:26写道：
> >>>
> >>> Hi everyone,
> >>>
> >>> I've updated the FLIP according to Chesnay's feedback, changes as
> follows:
> >>> 1. Change the GET result to a map.
> >>> 2. Only left *endTimestamp* in ADD operation, and change the rest (from
> >>> POST) to PUT
> >>> 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
> >>> encapsulate blocklist related functions.
> >>> 4. Remove *mainThread* from BlocklistTracker, instead provide a
> >>> *removeTimeoutItems* method to be called by outside components。
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：
> >>>
> >>> Hi Chesnay,
> >>>
> >>> Thanks for feedback.
> >>>
> >>> 1. Regarding the TM/Node id. Do you mean special characters may appear
> in
> >>> the rest URL?  Actually, I don't think so. The task manager id in REST
> API
> >>> should be the *ResourceID* of taskmanager in Flink, there should be no
> >>> special characters, and some existing REST APIs are already using it,
> e.g.
> >>> GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The
> node
> >>> id should be an IP of a machine or node name in Yarn/Kubernetes, I
> think it
> >>> should also have no special characters.
> >>> 2. Regarding the GET query responses. I agree with you, it makes sense
> to
> >>> change the GET result to a map.
> >>>
> >>> 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
> >>> cover everything, and the endTimestamp is a unix timestamp, there
> should be
> >>> no timezone issues. But I think PUT and DELETE are enough, no PATCH.
> The
> >>> add rest api is add or update, PUT can cover this semantics.
> >>>
> >>> 4. Regarding the slot pool/manager. I don't think the current slotpool
> >>> and slotmanager are able to support the MARK_BLOCKED(slots that are
> >>> already allocated will not be affected) action. The reasons are as
> >>> follows：
> >>>
> >>> a) for slot pool, with the MARK_BLOCKED action, when a slot state
> changes
> >>> from reserved(task assigned) to free(no task assigned), it is
> necessary to
> >>> check whether the slot should be released immediately(it should be
> released
> >>> immediately if the task manager is blocked, otherwise it may be
> allocated
> >>> to other tasks). I think it cannot be supported without being aware of
> >>> the blocklist information. Compared to the solution in FLIP, a more
> >>> appropriate/prefered way may be: Introduce a new slot pool
> >>> implementation for blocklist(may be named BlocklistSlotPool, it
> >>> extends/wrapps the original slot pool), and implement the parts that
> need
> >>> to be aware of the blocklist in this newly introduced slot pool, and
> the
> >>> original slot pool basically does not need to change.
> >>>
> >>> b) for slot manager, with the MARK_BLOCKED action, there may be free
> but
> >>> blocked slots in slot manager (the corresponding TMs cannot be
> >>> released/unregistered because there are still running tasks on them).
> >>> Therefore, we need to filter out the blocked slots when trying to
> fulfill
> >>> the slot requirements. Therefore it also needs to know the blocklist
> information.
> >>> A better way may be to abstract a resource allocation strategy, and
> make
> >>> the blocklist as a special implementation, then pass the resource
> >>> allocation strategy in when constructing the slot manager.
> Unfortunately,
> >>> the data structures in the two existing slot manager
> >>> implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*)
> are
> >>> quite different, it is not easy to abstract a common resource
> allocation
> >>> strategy, so we prefer to keep the current way(i.e. pass the blocklist
> >>> information directly into slot manager).
> >>>
> >>>
> >>> 5. Regarding the BlocklistTracker. I also agree with you, the
> BlocklistTracker
> >>> does not need to be aware of the executor, and the timeout actions can
> be
> >>> done outside.
> >>>
> >>> Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
> >>>
> >>> I have a number of concerns:
> >>>
> >>> Is the id used for deleting an item the same sent in the initial
> request
> >>> (and not one returned by Flink)?
> >>> I'm very concerned that the tm/node id can contain special characters.
> >>>
> >>> The GET query should return a map, not a list of items. This makes it
> >>> easier to work with.
> >>>
> >>> The duality of endTimestamp and duration is also concerning.
> >>> If we conclude that endTimestamps can in fact work (and aren't utterly
> >>> unusable due to timezones),
> >>> then this should be able to cover everything and rid us of some
> >>> complexity w.r.t. POSTs to the same ID.
> >>> Additions would be a PUT, changes a PATCH, deletes a DELETE.
> >>>
> >>>
> >>> I also dislike how we're pushing more functionality into the
> >>> slotpool/-manager.
> >>> These components are complex enough as-is, and instead I'd propose a
> >>> separate component that interacts with the SlotPool/-Manager instead,
> >>> for example by removing the slots from that TM.
> >>> The reason being that from the slot-pool perspective it is irrelevant
> >>> whether a slot is gone because the TM was lost, or because it was
> blocked.
> >>>
> >>>
> >>> The BlocklistTracker shouldn't be modeled as component that is aware of
> >>> the concept of main threads.
> >>> It really has no business knowing that; all it needs is an executor for
> >>> handling timeouts/periodic actions,
> >>> and a way to interact with the JM/RM (which internally can handle the
> >>> scheduling into the main thread).
> >>>
> >>>
> >>> On 20/05/2022 07:20, Lijie Wang wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> I have started a vote for this FLIP [1]. Please cast your vote there or
> >>>
> >>> ask
> >>>
> >>> additional questions here. [1]
> >>> https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
> >>>
> >>> Best, Lijie
> >>>
> >>> Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
> >>>
> >>> Hi Konstantin,
> >>>
> >>> We found that Flink REST URL does not support the format ":merge" ,
> >>>
> >>> which
> >>>
> >>> will be recognized as a parameter in the URL(due to start with a
> >>>
> >>> colon).
> >>>
> >>> We will keep the previous way, i.e.
> >>>
> >>> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> >>> and the "id" and "merge" flag are put into the request body.
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
> >>>
> >>> Hi Weihua,
> >>> thanks for feedback.
> >>>
> >>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
> >>> step towards auto-detection.
> >>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> >>> exception of insufficient resources.
> >>> 3. No. This FLIP won't change the WebUI. The blocklist information
> >>>
> >>> can be
> >>>
> >>> obtained through REST API and metrics.
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
> >>>
> >>> Hi,
> >>> Thanks for creating this FLIP.
> >>> We have implemented an automatic blocklist detection mechanism
> >>> internally, which is indeed very effective for handling node
> >>>
> >>> failures.
> >>>
> >>> Due to the large number of nodes, although SREs already support
> >>> automatic offline failure nodes, the detection is not 100% accurate
> >>>
> >>> and
> >>>
> >>> there is some delay.
> >>> So the blocklist mechanism can make flink job recover from failure
> >>>
> >>> much
> >>>
> >>> faster.
> >>>
> >>> Here are some of my thoughts:
> >>> 1. In this FLIP, it needs users to locate machine failure manually,
> >>> there is a certain cost of use
> >>> 2. What happens if too many nodes are blocked, resulting in
> >>>
> >>> insufficient
> >>>
> >>> resources? Will there be a special Exception for the user?
> >>> 3. Will we display the blocklist information in the WebUI? The
> >>>
> >>> blocklist
> >>>
> >>> is for cluster level, and if multiple users share a cluster, some
> >>>
> >>> users may
> >>>
> >>> be a little confused when resources are not enough, or when
> >>>
> >>> resources are
> >>>
> >>> applied for more.
> >>>
> >>> Also, Looking forward to the next FLIP on auto-detection.
> >>>
> >>> Best,
> >>> Weihua
> >>>
> >>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> >>>
> >>> Hi Konstantin,
> >>>
> >>> Maybe change it to the following:
> >>>
> >>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >>> Merge is not allowed. If the {id} already exists, return error.
> >>>
> >>> Otherwise,
> >>>
> >>> create a new item.
> >>>
> >>> 2. POST: http://
> >>>
> >>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >>>
> >>> Merge is allowed. If the {id} already exists, merge. Otherwise,
> >>>
> >>> create
> >>>
> >>> a
> >>>
> >>> new item.
> >>>
> >>> WDYT?
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> >>>
> >>> Hi Lijie,
> >>>
> >>> hm, maybe the following is more appropriate in that case
> >>>
> >>> POST: http://
> >>>
> >>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >>>
> >>> Best,
> >>>
> >>> Konstantin
> >>>
> >>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> >>> wangdachui9501@gmail.com>:
> >>>
> >>> Hi Konstantin,
> >>> thanks for your feedback.
> >>>
> >>>   From what I understand, PUT should be idempotent. However, we
> >>>
> >>> have a
> >>>
> >>> *timeout* field in the request. This means that initiating the
> >>>
> >>> same
> >>>
> >>> request
> >>>
> >>> at two different times will lead to different resource status
> >>>
> >>> (timestamps
> >>>
> >>> of the items to be removed will be different).
> >>>
> >>> Should we use PUT in this case? WDYT?
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> >>>
> >>> Hi Lijie,
> >>>
> >>> wouldn't the REST API-idiomatic way for an update/replace be a
> >>>
> >>> PUT
> >>>
> >>> on
> >>>
> >>> the
> >>>
> >>> resource?
> >>>
> >>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >>>
> >>> Best,
> >>>
> >>> Konstantin
> >>>
> >>>
> >>>
> >>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> >>> wangdachui9501@gmail.com>:
> >>>
> >>> Hi everyone,
> >>>
> >>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
> >>>
> >>> made
> >>>
> >>> the
> >>>
> >>> following changes on REST API:
> >>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
> >>>
> >>> choose
> >>>
> >>> one.
> >>>
> >>> If
> >>>
> >>> both are specified, will return error.
> >>> 2.  If the specified item is already there, the *ADD* operation
> >>>
> >>> has
> >>>
> >>> two
> >>>
> >>> behaviors:  *return error*(default value) or *merge/update*,
> >>>
> >>> and we
> >>>
> >>> add a
> >>>
> >>> flag to the request body to control it. You can find more
> >>>
> >>> details
> >>>
> >>> "Public
> >>>
> >>> Interface" section.
> >>>
> >>> If there is no more feedback, we will start the vote thread next
> >>>
> >>> week.
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> >>>
> >>> Hi Becket Qin,
> >>>
> >>> Thanks for your suggestions.  I have moved the description of
> >>> configurations, metrics and REST API into "Public Interface"
> >>>
> >>> section,
> >>>
> >>> and
> >>>
> >>> made a few updates according to your suggestion.  And in this
> >>>
> >>> FLIP,
> >>>
> >>> there
> >>>
> >>> no public java Interfaces or pluggables that users need to
> >>>
> >>> implement
> >>>
> >>> by
> >>>
> >>> themselves.
> >>>
> >>> Answers for you questions:
> >>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> >>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> >>>
> >>> items
> >>>
> >>> can
> >>>
> >>> only be added through the REST API, so these 2 action are
> >>>
> >>> mentioned
> >>>
> >>> in
> >>>
> >>> the
> >>>
> >>> REST API part (The REST API part has beed moved to public
> >>>
> >>> interface
> >>>
> >>> now).
> >>>
> >>> 2. I agree with you. I have changed the "Cause" field to
> >>>
> >>> String,
> >>>
> >>> and
> >>>
> >>> allow
> >>>
> >>> users to specify it via REST API.
> >>> 3. Yes, it is useful to allow different timeouts. As mentioned
> >>>
> >>> above,
> >>>
> >>> we
> >>>
> >>> will introduce 2 fields : *timeout* and *endTimestamp* into the
> >>>
> >>> ADD
> >>>
> >>> REST
> >>>
> >>> API to specify when to remove the blocked item. These 2 fields
> >>>
> >>> are
> >>>
> >>> optional, if neither is specified, it means that the blocked
> >>>
> >>> item
> >>>
> >>> is
> >>>
> >>> permanent and will not be removed. If both are specified, the
> >>>
> >>> minimum
> >>>
> >>> of
> >>>
> >>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
> >>>
> >>> the
> >>>
> >>> time
> >>>
> >>> to
> >>>
> >>> remove the blocked item. To keep the configurations more
> >>>
> >>> minimal,
> >>>
> >>> we
> >>>
> >>> have
> >>>
> >>> removed the *cluster.resource-blocklist.item.timeout*
> >>>
> >>> configuration
> >>>
> >>> option.
> >>> 4. Yes, the block item will be overridden if the specified item
> >>>
> >>> already
> >>>
> >>> exists. The ADD operation is *ADD or UPDATE*.
> >>> 5. Yes. On JM/RM side, all the blocklist information is
> >>>
> >>> maintained
> >>>
> >>> in
> >>>
> >>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> >>>
> >>> abstracted
> >>>
> >>> to other interfaces) will be propagated to different
> >>>
> >>> components.
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> >>>
> >>> Hi Lijie,
> >>>
> >>> Thanks for updating the FLIP. It looks like the public
> >>>
> >>> interface
> >>>
> >>> section
> >>>
> >>> did not fully reflect all the user sensible behavior and API.
> >>>
> >>> Can
> >>>
> >>> you
> >>>
> >>> put
> >>>
> >>> everything that users may be aware of there? That would
> >>>
> >>> include
> >>>
> >>> the
> >>>
> >>> REST
> >>>
> >>> API, metrics, configurations, public java Interfaces or
> >>>
> >>> pluggables
> >>>
> >>> that
> >>>
> >>> users may see or implement by themselves, as well as a brief
> >>>
> >>> summary
> >>>
> >>> of
> >>>
> >>> the
> >>> behavior of the public API.
> >>>
> >>> Besides that, I have a few questions:
> >>>
> >>> 1. According to the conversation in the discussion thread, it
> >>>
> >>> looks
> >>>
> >>> like
> >>>
> >>> the BlockAction will have "MARK_BLOCKLISTED" and
> >>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
> >>>
> >>> so,
> >>>
> >>> can
> >>>
> >>> you
> >>>
> >>> add
> >>> that to the public interface as well?
> >>>
> >>> 2. At this point, the "Cause" field in the BlockingItem is a
> >>>
> >>> Throwable
> >>>
> >>> and
> >>>
> >>> is not reflected in the REST API. Should that be included in
> >>>
> >>> the
> >>>
> >>> query
> >>>
> >>> response? And should we change that field to be a String so
> >>>
> >>> users
> >>>
> >>> may
> >>>
> >>> specify the cause via the REST API when they block some nodes
> >>>
> >>> /
> >>>
> >>> TMs?
> >>>
> >>> 3. Would it be useful to allow users to have different
> >>>
> >>> timeouts
> >>>
> >>> for
> >>>
> >>> different blocked items? So while there is a default timeout,
> >>>
> >>> users
> >>>
> >>> can
> >>>
> >>> also override it via the REST API when they block an entity.
> >>>
> >>> 4. Regarding the ADD operation, if the specified item is
> >>>
> >>> already
> >>>
> >>> there,
> >>>
> >>> will the block item be overridden? For example, if the user
> >>>
> >>> wants
> >>>
> >>> to
> >>>
> >>> extend
> >>> the timeout of a blocked item, can they just  issue an ADD
> >>>
> >>> command
> >>>
> >>> again?
> >>>
> >>> 5. I am not quite familiar with the details of this, but is
> >>>
> >>> there
> >>>
> >>> a
> >>>
> >>> source
> >>>
> >>> of truth for the blocked list? I think it might be good to
> >>>
> >>> have a
> >>>
> >>> single
> >>>
> >>> source of truth for the blocked list and just propagate that
> >>>
> >>> list
> >>>
> >>> to
> >>>
> >>> different components to take the action of actually blocking
> >>>
> >>> the
> >>>
> >>> resource.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> >>>
> >>> wangdachui9501@gmail.com
> >>>
> >>> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> Based on the discussion in the mailing list, I updated the
> >>>
> >>> FLIP
> >>>
> >>> doc,
> >>>
> >>> the
> >>>
> >>> changes include:
> >>> 1. Changed the description of the motivation section to more
> >>>
> >>> clearly
> >>>
> >>> describe the problem this FLIP is trying to solve.
> >>> 2. Only  *Manually* is supported.
> >>> 3. Adopted some suggestions, such as *endTimestamp*.
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>>
> >>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> >>>
> >>> Hi Lijie!
> >>>
> >>>
> >>>
> >>>
> >>> *a) “Probably storing inside Zookeeper/Configmap might be
> >>>
> >>> helpfulhere.”
> >>>
> >>> Can you explain it in detail? I don't fully understand that.
> >>>
> >>> In
> >>>
> >>> myopinion,
> >>>
> >>> non-active and active are the same, and no special treatment
> >>>
> >>> isrequired.*
> >>>
> >>> Sorry this was a misunderstanding from my side. I thought we
> >>>
> >>> were
> >>>
> >>> talking
> >>>
> >>> about the HA mode (but not about Active and Standalone
> >>>
> >>> ResourceManager).
> >>>
> >>> And the original question was - how to handle the
> >>>
> >>> blacklisted
> >>>
> >>> nodes
> >>>
> >>> list
> >>>
> >>> at
> >>>
> >>> the moment of leader change? Should we simply forget about
> >>>
> >>> them
> >>>
> >>> or
> >>>
> >>> try to
> >>>
> >>> pre-save that list on the remote storage?
> >>>
> >>> On Sat, 7 May 2022 at 10:51, Yang Wang <
> >>>
> >>> danrtsey.wy@gmail.com
> >>>
> >>> wrote:
> >>>
> >>> Thanks Lijie and ZhuZhu for the explanation.
> >>>
> >>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> >>>
> >>> it
> >>>
> >>> is
> >>>
> >>> indeed
> >>>
> >>> some functionalities the external tools(e.g. kubectl taint)
> >>>
> >>> could
> >>>
> >>> not
> >>>
> >>> support.
> >>>
> >>>
> >>> Best,
> >>> Yang
> >>>
> >>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
> >>>
> >>> 22:18写道：
> >>>
> >>> Thanks for your feedback, Jiangang and Martijn.
> >>>
> >>> @Jiangang
> >>>
> >>>
> >>> For auto-detecting, I wonder how to make the strategy
> >>>
> >>> and
> >>>
> >>> mark a
> >>>
> >>> node
> >>>
> >>> blocked?
> >>>
> >>> In fact, we currently plan to not support auto-detection
> >>>
> >>> in
> >>>
> >>> this
> >>>
> >>> FLIP.
> >>>
> >>> The
> >>>
> >>> part about auto-detection may be continued in a separate
> >>>
> >>> FLIP
> >>>
> >>> in
> >>>
> >>> the
> >>>
> >>> future. Some guys have the same concerns as you, and the
> >>>
> >>> correctness
> >>>
> >>> and
> >>>
> >>> necessity of auto-detection may require further discussion
> >>>
> >>> in
> >>>
> >>> the
> >>>
> >>> future.
> >>>
> >>> In session mode, multi jobs can fail on the same bad
> >>>
> >>> node
> >>>
> >>> and
> >>>
> >>> the
> >>>
> >>> node
> >>>
> >>> should be marked blocked.
> >>> By design, the blocklist information will be shared among
> >>>
> >>> all
> >>>
> >>> jobs
> >>>
> >>> in a
> >>>
> >>> cluster/session. The JM will sync blocklist information
> >>>
> >>> with
> >>>
> >>> RM.
> >>>
> >>> @Martijn
> >>>
> >>> I agree with Yang Wang on this.
> >>>
> >>> As Zhu Zhu and I mentioned above, we think the
> >>>
> >>> MARK_BLOCKLISTED(Just
> >>>
> >>> limits
> >>>
> >>> the load of the node and does not  kill all the processes
> >>>
> >>> on
> >>>
> >>> it)
> >>>
> >>> is
> >>>
> >>> also
> >>>
> >>> important, and we think that external systems (*yarn
> >>>
> >>> rmadmin
> >>>
> >>> or
> >>>
> >>> kubectl
> >>>
> >>> taint*) cannot support it. So we think it makes sense even
> >>>
> >>> only
> >>>
> >>> *manually*.
> >>>
> >>> I also agree with Chesnay that magical mechanisms are
> >>>
> >>> indeed
> >>>
> >>> super
> >>>
> >>> hard
> >>>
> >>> to get right.
> >>> Yes, as you see, Jiangang(and a few others) have the same
> >>>
> >>> concern.
> >>>
> >>> However, we currently plan to not support auto-detection
> >>>
> >>> in
> >>>
> >>> this
> >>>
> >>> FLIP,
> >>>
> >>> and
> >>>
> >>> only *manually*. In addition, I'd like to say that the
> >>>
> >>> FLIP
> >>>
> >>> provides
> >>>
> >>> a
> >>>
> >>> mechanism to support MARK_BLOCKLISTED and
> >>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> >>> the auto-detection may be done by external systems.
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> >>>
> >>> 19:04写道：
> >>>
> >>> If we only support to block nodes manually, then I
> >>>
> >>> could
> >>>
> >>> not
> >>>
> >>> see
> >>>
> >>> the obvious advantages compared with current SRE's
> >>>
> >>> approach(via
> >>>
> >>> *yarn
> >>>
> >>> rmadmin or kubectl taint*).
> >>>
> >>> I agree with Yang Wang on this.
> >>>
> >>> To me this sounds yet again like one of those magical
> >>>
> >>> mechanisms
> >>>
> >>> that
> >>>
> >>> will rarely work just right.
> >>>
> >>> I also agree with Chesnay that magical mechanisms are
> >>>
> >>> indeed
> >>>
> >>> super
> >>>
> >>> hard
> >>>
> >>> to
> >>>
> >>> get right.
> >>>
> >>> Best regards,
> >>>
> >>> Martijn
> >>>
> >>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> >>>
> >>> liujiangangpeng@gmail.com
> >>>
> >>> wrote:
> >>>
> >>> Thanks for the valuable design. The auto-detecting can
> >>>
> >>> decrease
> >>>
> >>> great
> >>>
> >>> work
> >>>
> >>> for us. We have implemented the similar feature in our
> >>>
> >>> inner
> >>>
> >>> flink
> >>>
> >>> version.
> >>> Below is something that I care about:
> >>>
> >>>     1. For auto-detecting, I wonder how to make the
> >>>
> >>> strategy
> >>>
> >>> and
> >>>
> >>> mark a
> >>>
> >>> node
> >>>     blocked? Sometimes the blocked node is hard to be
> >>>
> >>> detected,
> >>>
> >>> for
> >>>
> >>> example,
> >>>     the upper node or the down node will be blocked when
> >>>
> >>> network
> >>>
> >>> unreachable.
> >>>     2. I see that the strategy is made in JobMaster
> >>>
> >>> side.
> >>>
> >>> How
> >>>
> >>> about
> >>>
> >>>     implementing the similar logic in resource manager?
> >>>
> >>> In
> >>>
> >>> session
> >>>
> >>> mode,
> >>>
> >>> multi
> >>>     jobs can fail on the same bad node and the node
> >>>
> >>> should
> >>>
> >>> be
> >>>
> >>> marked
> >>>
> >>> blocked.
> >>>     If the job makes the strategy, the node may be not
> >>>
> >>> marked
> >>>
> >>> blocked
> >>>
> >>> if
> >>>
> >>> the
> >>>     fail times don't exceed the threshold.
> >>>
> >>>
> >>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> >>>
> >>> Thank you for all your feedback!
> >>>
> >>> Besides the answers from Lijie, I'd like to share
> >>>
> >>> some
> >>>
> >>> of
> >>>
> >>> my
> >>>
> >>> thoughts:
> >>>
> >>> 1. Whether to enable automatical blocklist
> >>> Generally speaking, it is not a goal of FLIP-224.
> >>> The automatical way should be something built upon
> >>>
> >>> the
> >>>
> >>> blocklist
> >>>
> >>> mechanism and well decoupled. It was designed to be a
> >>>
> >>> configurable
> >>>
> >>> blocklist strategy, but I think we can further
> >>>
> >>> decouple
> >>>
> >>> it
> >>>
> >>> by
> >>>
> >>> introducing a abnormal node detector, as Becket
> >>>
> >>> suggested,
> >>>
> >>> which
> >>>
> >>> just
> >>>
> >>> uses the blocklist mechanism once bad nodes are
> >>>
> >>> detected.
> >>>
> >>> However,
> >>>
> >>> it
> >>>
> >>> should be a separate FLIP with further dev
> >>>
> >>> discussions
> >>>
> >>> and
> >>>
> >>> feedback
> >>>
> >>> from users. I also agree with Becket that different
> >>>
> >>> users
> >>>
> >>> have
> >>>
> >>> different
> >>>
> >>> requirements, and we should listen to them.
> >>>
> >>> 2. Is it enough to just take away abnormal nodes
> >>>
> >>> externally
> >>>
> >>> My answer is no. As Lijie has mentioned, we need a
> >>>
> >>> way
> >>>
> >>> to
> >>>
> >>> avoid
> >>>
> >>> deploying tasks to temporary hot nodes. In this case,
> >>>
> >>> users
> >>>
> >>> may
> >>>
> >>> just
> >>>
> >>> want to limit the load of the node and do not want to
> >>>
> >>> kill
> >>>
> >>> all
> >>>
> >>> the
> >>>
> >>> processes on it. Another case is the speculative
> >>>
> >>> execution[1]
> >>>
> >>> which
> >>>
> >>> may also leverage this feature to avoid starting
> >>>
> >>> mirror
> >>>
> >>> tasks on
> >>>
> >>> slow
> >>>
> >>> nodes.
> >>>
> >>> Thanks,
> >>> Zhu
> >>>
> >>> [1]
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >>>
> >>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> >>>
> >>> 15:56写道：
> >>>
> >>> Hi everyone,
> >>>
> >>>
> >>> Thanks for your feedback.
> >>>
> >>>
> >>> There's one detail that I'd like to re-emphasize
> >>>
> >>> here
> >>>
> >>> because
> >>>
> >>> it
> >>>
> >>> can
> >>>
> >>> affect the value and design of the blocklist
> >>>
> >>> mechanism
> >>>
> >>> (perhaps
> >>>
> >>> I
> >>>
> >>> should
> >>>
> >>> highlight it in the FLIP). We propose two actions in
> >>>
> >>> FLIP:
> >>>
> >>> 1) MARK_BLOCKLISTED: Just mark the task manager or
> >>>
> >>> node
> >>>
> >>> as
> >>>
> >>> blocked.
> >>>
> >>> Future slots should not be allocated from the blocked
> >>>
> >>> task
> >>>
> >>> manager
> >>>
> >>> or
> >>>
> >>> node.
> >>>
> >>> But slots that are already allocated will not be
> >>>
> >>> affected.
> >>>
> >>> A
> >>>
> >>> typical
> >>>
> >>> application scenario is to mitigate machine hotspots.
> >>>
> >>> In
> >>>
> >>> this
> >>>
> >>> case,
> >>>
> >>> we
> >>>
> >>> hope
> >>>
> >>> that subsequent resource allocations will not be on
> >>>
> >>> the
> >>>
> >>> hot
> >>>
> >>> machine,
> >>>
> >>> but
> >>>
> >>> tasks currently running on it should not be affected.
> >>>
> >>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> >>>
> >>> task
> >>>
> >>> manager
> >>>
> >>> or
> >>>
> >>> node
> >>>
> >>> as
> >>>
> >>> blocked, and evacuate all tasks on it. Evacuated
> >>>
> >>> tasks
> >>>
> >>> will
> >>>
> >>> be
> >>>
> >>> restarted on
> >>>
> >>> non-blocked task managers.
> >>>
> >>> For the above 2 actions, the former may more
> >>>
> >>> highlight
> >>>
> >>> the
> >>>
> >>> meaning
> >>>
> >>> of
> >>>
> >>> this FLIP, because the external system cannot do
> >>>
> >>> that.
> >>>
> >>> Regarding *Manually* and *Automatically*, I
> >>>
> >>> basically
> >>>
> >>> agree
> >>>
> >>> with
> >>>
> >>> @Becket
> >>>
> >>> Qin: different users have different answers. Not all
> >>>
> >>> users’
> >>>
> >>> deployment
> >>>
> >>> environments have a special external system that can
> >>>
> >>> perform
> >>>
> >>> the
> >>>
> >>> anomaly
> >>>
> >>> detection. In addition, adding pluggable/optional
> >>>
> >>> auto-detection
> >>>
> >>> doesn't
> >>>
> >>> require much extra work on top of manual
> >>>
> >>> specification.
> >>>
> >>> I will answer your other questions one by one.
> >>>
> >>>
> >>> @Yangze
> >>>
> >>> a) I think you are right, we do not need to expose
> >>>
> >>> the
> >>> `cluster.resource-blocklist.item.timeout-check-interval`
> >>>
> >>> to
> >>>
> >>> users.
> >>>
> >>> b) We can abstract the `notifyException` to a
> >>>
> >>> separate
> >>>
> >>> interface
> >>>
> >>> (maybe
> >>>
> >>> BlocklistExceptionListener), and the
> >>>
> >>> ResourceManagerBlocklistHandler
> >>>
> >>> can
> >>>
> >>> implement it in the future.
> >>>
> >>> @Martijn
> >>>
> >>> a) I also think the manual blocking should be done
> >>>
> >>> by
> >>>
> >>> cluster
> >>>
> >>> operators.
> >>>
> >>> b) I think manual blocking makes sense, because
> >>>
> >>> according
> >>>
> >>> to
> >>>
> >>> my
> >>>
> >>> experience, users are often the first to perceive the
> >>>
> >>> machine
> >>>
> >>> problems
> >>>
> >>> (because of job failover or delay), and they will
> >>>
> >>> contact
> >>>
> >>> cluster
> >>>
> >>> operators
> >>>
> >>> to solve it, or even tell the cluster operators which
> >>>
> >>> machine is
> >>>
> >>> problematic. From this point of view, I think the
> >>>
> >>> people
> >>>
> >>> who
> >>>
> >>> really
> >>>
> >>> need
> >>>
> >>> the manual blocking are the users, and it’s just
> >>>
> >>> performed
> >>>
> >>> by
> >>>
> >>> the
> >>>
> >>> cluster
> >>>
> >>> operator, so I think the manual blocking makes sense.
> >>>
> >>> @Chesnay
> >>>
> >>> We need to touch the logic of JM/SlotPool, because
> >>>
> >>> for
> >>>
> >>> MARK_BLOCKLISTED
> >>>
> >>> , we need to know whether the slot is blocklisted
> >>>
> >>> when
> >>>
> >>> the
> >>>
> >>> task
> >>>
> >>> is
> >>>
> >>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> >>>
> >>> release
> >>>
> >>> the
> >>>
> >>> slot
> >>>
> >>> directly to avoid assigning other tasks (of this job)
> >>>
> >>> on
> >>>
> >>> it.
> >>>
> >>> If
> >>>
> >>> we
> >>>
> >>> only
> >>>
> >>> maintain the blocklist information on the RM, JM
> >>>
> >>> needs
> >>>
> >>> to
> >>>
> >>> retrieve
> >>>
> >>> it
> >>>
> >>> by
> >>>
> >>> RPC. I think the performance overhead of that is
> >>>
> >>> relatively
> >>>
> >>> large,
> >>>
> >>> so
> >>>
> >>> I
> >>>
> >>> think it's worth maintaining the blocklist
> >>>
> >>> information
> >>>
> >>> on
> >>>
> >>> the JM
> >>>
> >>> side
> >>>
> >>> and
> >>>
> >>> syncing them.
> >>>
> >>> @Роман
> >>>
> >>>      a) “Probably storing inside Zookeeper/Configmap
> >>>
> >>> might
> >>>
> >>> be
> >>>
> >>> helpful
> >>>
> >>> here.”  Can you explain it in detail? I don't fully
> >>>
> >>> understand
> >>>
> >>> that.
> >>>
> >>> In
> >>>
> >>> my
> >>>
> >>> opinion, non-active and active are the same, and no
> >>>
> >>> special
> >>>
> >>> treatment
> >>>
> >>> is
> >>>
> >>> required.
> >>>
> >>> b) I agree with you, the `endTimestamp` makes
> >>>
> >>> sense,
> >>>
> >>> I
> >>>
> >>> will
> >>>
> >>> add
> >>>
> >>> it
> >>>
> >>> to
> >>>
> >>> FLIP.
> >>>
> >>> @Yang
> >>>
> >>> As mentioned above, AFAK, the external system
> >>>
> >>> cannot
> >>>
> >>> support
> >>>
> >>> the
> >>>
> >>> MARK_BLOCKLISTED action.
> >>>
> >>> Looking forward to your further feedback.
> >>>
> >>>
> >>> Best,
> >>>
> >>> Lijie
> >>>
> >>>
> >>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
> >>>
> >>> 21:09写道：
> >>>
> >>> Thanks Lijie and Zhu for creating the proposal.
> >>>
> >>> I want to share some thoughts about Flink cluster
> >>>
> >>> operations.
> >>>
> >>> In the production environment, the SRE(aka Site
> >>>
> >>> Reliability
> >>>
> >>> Engineer)
> >>>
> >>> already has many tools to detect the unstable
> >>>
> >>> nodes,
> >>>
> >>> which
> >>>
> >>> could
> >>>
> >>> take
> >>>
> >>> the
> >>>
> >>> system logs/metrics into consideration.
> >>> Then they use graceful-decomission in YARN and
> >>>
> >>> taint
> >>>
> >>> in
> >>>
> >>> K8s
> >>>
> >>> to
> >>>
> >>> prevent
> >>>
> >>> new
> >>>
> >>> allocations on these unstable nodes.
> >>> At last, they will evict all the containers and
> >>>
> >>> pods
> >>>
> >>> running
> >>>
> >>> on
> >>>
> >>> these
> >>>
> >>> nodes.
> >>>
> >>> This mechanism also works for planned maintenance.
> >>>
> >>> So
> >>>
> >>> I
> >>>
> >>> am
> >>>
> >>> afraid
> >>>
> >>> this
> >>>
> >>> is
> >>>
> >>> not the typical use case for FLIP-224.
> >>>
> >>> If we only support to block nodes manually, then I
> >>>
> >>> could
> >>>
> >>> not
> >>>
> >>> see
> >>>
> >>> the obvious advantages compared with current SRE's
> >>>
> >>> approach(via
> >>>
> >>> *yarn
> >>>
> >>> rmadmin or kubectl taint*).
> >>> At least, we need to have a pluggable component
> >>>
> >>> which
> >>>
> >>> could
> >>>
> >>> expose
> >>>
> >>> the
> >>>
> >>> potential unstable nodes automatically and block
> >>>
> >>> them
> >>>
> >>> if
> >>>
> >>> enabled
> >>>
> >>> explicitly.
> >>>
> >>> Best,
> >>> Yang
> >>>
> >>>
> >>>
> >>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> >>>
> >>> 16:36写道：
> >>>
> >>> Thanks for the proposal, Lijie.
> >>>
> >>> This is an interesting feature and discussion,
> >>>
> >>> and
> >>>
> >>> somewhat
> >>>
> >>> related
> >>>
> >>> to the
> >>>
> >>> design principle about how people should operate
> >>>
> >>> Flink.
> >>>
> >>> I think there are three things involved in this
> >>>
> >>> FLIP.
> >>>
> >>>       a) Detect and report the unstable node.
> >>>       b) Collect the information of the unstable
> >>>
> >>> node
> >>>
> >>> and
> >>>
> >>> form a
> >>>
> >>> blocklist.
> >>>
> >>>       c) Take the action to block nodes.
> >>>
> >>> My two cents:
> >>>
> >>> 1. It looks like people all agree that Flink
> >>>
> >>> should
> >>>
> >>> have
> >>>
> >>> c).
> >>>
> >>> It
> >>>
> >>> is
> >>>
> >>> not only
> >>>
> >>> useful for cases of node failures, but also
> >>>
> >>> handy
> >>>
> >>> for
> >>>
> >>> some
> >>>
> >>> planned
> >>>
> >>> maintenance.
> >>>
> >>> 2. People have different opinions on b), i.e.
> >>>
> >>> who
> >>>
> >>> should be
> >>>
> >>> the
> >>>
> >>> brain
> >>>
> >>> to
> >>>
> >>> make the decision to block a node. I think this
> >>>
> >>> largely
> >>>
> >>> depends
> >>>
> >>> on
> >>>
> >>> who we
> >>>
> >>> talk to. Different users would probably give
> >>>
> >>> different
> >>>
> >>> answers.
> >>>
> >>> For
> >>>
> >>> people
> >>>
> >>> who do have a centralized node health management
> >>>
> >>> service,
> >>>
> >>> let
> >>>
> >>> Flink
> >>>
> >>> do just
> >>>
> >>> do a) and c) would be preferred. So essentially
> >>>
> >>> Flink
> >>>
> >>> would
> >>>
> >>> be
> >>>
> >>> one
> >>>
> >>> of
> >>>
> >>> the
> >>>
> >>> sources that may detect unstable nodes, report
> >>>
> >>> it
> >>>
> >>> to
> >>>
> >>> that
> >>>
> >>> service,
> >>>
> >>> and then
> >>>
> >>> take the command from that service to block the
> >>>
> >>> problematic
> >>>
> >>> nodes.
> >>>
> >>> On
> >>>
> >>> the
> >>>
> >>> other hand, for users who do not have such a
> >>>
> >>> service,
> >>>
> >>> simply
> >>>
> >>> letting
> >>>
> >>> Flink
> >>>
> >>> be clever by itself to block the suspicious
> >>>
> >>> nodes
> >>>
> >>> might
> >>>
> >>> be
> >>>
> >>> desired
> >>>
> >>> to
> >>>
> >>> ensure the jobs are running smoothly.
> >>>
> >>> So that indicates a) and b) here should be
> >>>
> >>> pluggable /
> >>>
> >>> optional.
> >>>
> >>> In light of this, maybe it would make sense to
> >>>
> >>> have
> >>>
> >>> something
> >>>
> >>> pluggable
> >>>
> >>> like a UnstableNodeReporter which exposes
> >>>
> >>> unstable
> >>>
> >>> nodes
> >>>
> >>> actively.
> >>>
> >>> (A
> >>>
> >>> more
> >>>
> >>> general interface should be JobInfoReporter<T>
> >>>
> >>> which
> >>>
> >>> can be
> >>>
> >>> used
> >>>
> >>> to
> >>>
> >>> report
> >>>
> >>> any information of type <T>. But I'll just keep
> >>>
> >>> the
> >>>
> >>> scope
> >>>
> >>> relevant
> >>>
> >>> to
> >>>
> >>> this
> >>>
> >>> FLIP here). Personally speaking, I think it is
> >>>
> >>> OK
> >>>
> >>> to
> >>>
> >>> have a
> >>>
> >>> default
> >>>
> >>> implementation of a reporter which just tells
> >>>
> >>> Flink
> >>>
> >>> to
> >>>
> >>> take
> >>>
> >>> action
> >>>
> >>> to
> >>>
> >>> block
> >>>
> >>> problematic nodes and also unblocks them after
> >>>
> >>> timeout.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>>
> >>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> >>>
> >>> ro.v.boyko@gmail.com
> >>>
> >>> wrote:
> >>>
> >>> Thanks for good initiative, Lijie and Zhu!
> >>>
> >>> If it's possible I'd like to participate in
> >>>
> >>> development.
> >>>
> >>> I agree with 3rd point of Konstantin's reply -
> >>>
> >>> we
> >>>
> >>> should
> >>>
> >>> consider
> >>>
> >>> to move
> >>>
> >>> somehow the information of blocklisted
> >>>
> >>> nodes/TMs
> >>>
> >>> from
> >>>
> >>> active
> >>>
> >>> ResourceManager to non-active ones. Probably
> >>>
> >>> storing
> >>>
> >>> inside
> >>>
> >>> Zookeeper/Configmap might be helpful here.
> >>>
> >>> And I agree with Martijn that a lot of
> >>>
> >>> organizations
> >>>
> >>> don't
> >>>
> >>> want
> >>>
> >>> to
> >>>
> >>> expose
> >>>
> >>> such API for a cluster user group. But I think
> >>>
> >>> it's
> >>>
> >>> necessary
> >>>
> >>> to
> >>>
> >>> have the
> >>>
> >>> mechanism for unblocking the nodes/TMs anyway
> >>>
> >>> for
> >>>
> >>> avoiding
> >>>
> >>> incorrect
> >>>
> >>> automatic behaviour.
> >>>
> >>> And another one small suggestion - I think it
> >>>
> >>> would
> >>>
> >>> be
> >>>
> >>> better
> >>>
> >>> to
> >>>
> >>> extend
> >>>
> >>> the
> >>>
> >>> *BlocklistedItem* class with the
> >>>
> >>> *endTimestamp*
> >>>
> >>> field
> >>>
> >>> and
> >>>
> >>> fill
> >>>
> >>> it
> >>>
> >>> at the
> >>>
> >>> item creation. This simple addition will allow
> >>>
> >>> to:
> >>>
> >>>     -
> >>>
> >>>     Provide the ability to users to setup the
> >>>
> >>> exact
> >>>
> >>> time
> >>>
> >>> of
> >>>
> >>> blocklist end
> >>>
> >>>     through RestAPI
> >>>     -
> >>>
> >>>     Not being tied to a single value of
> >>>     *cluster.resource-blacklist.item.timeout*
> >>>
> >>>
> >>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> >>>
> >>> <
> >>>
> >>> chesnay@apache.org>
> >>>
> >>> wrote:
> >>>
> >>> I do share the concern between blurring the
> >>>
> >>> lines
> >>>
> >>> a
> >>>
> >>> bit.
> >>>
> >>> That said, I'd prefer to not have any
> >>>
> >>> auto-detection
> >>>
> >>> and
> >>>
> >>> only
> >>>
> >>> have an
> >>>
> >>> opt-in mechanism
> >>> to manually block processes/nodes. To me
> >>>
> >>> this
> >>>
> >>> sounds
> >>>
> >>> yet
> >>>
> >>> again
> >>>
> >>> like one
> >>>
> >>> of those
> >>> magical mechanisms that will rarely work
> >>>
> >>> just
> >>>
> >>> right.
> >>>
> >>> An external system can leverage way more
> >>>
> >>> information
> >>>
> >>> after
> >>>
> >>> all.
> >>>
> >>> Moreover, I'm quite concerned about the
> >>>
> >>> complexity
> >>>
> >>> of
> >>>
> >>> this
> >>>
> >>> proposal.
> >>>
> >>> Tracking on both the RM/JM side; syncing
> >>>
> >>> between
> >>>
> >>> components;
> >>>
> >>> adjustments
> >>>
> >>> to the
> >>> slot and resource protocol.
> >>>
> >>> In a way it seems overly complicated.
> >>>
> >>> If we look at it purely from an active
> >>>
> >>> resource
> >>>
> >>> management
> >>>
> >>> perspective,
> >>>
> >>> then there
> >>> isn't really a need to touch the slot
> >>>
> >>> protocol
> >>>
> >>> at
> >>>
> >>> all
> >>>
> >>> (or
> >>>
> >>> in
> >>>
> >>> fact
> >>>
> >>> to
> >>>
> >>> anything in the JobMaster),
> >>> because there isn't any point in keeping
> >>>
> >>> around
> >>>
> >>> blocked
> >>>
> >>> TMs
> >>>
> >>> in
> >>>
> >>> the
> >>>
> >>> first
> >>>
> >>> place.
> >>> They'd just be idling, potentially shutting
> >>>
> >>> down
> >>>
> >>> after
> >>>
> >>> a
> >>>
> >>> while
> >>>
> >>> by
> >>>
> >>> the
> >>>
> >>> RM
> >>>
> >>> because of
> >>> it (unless we _also_ touch that logic).
> >>> Here the blocking of a process (be it by
> >>>
> >>> blocking
> >>>
> >>> the
> >>>
> >>> process
> >>>
> >>> or
> >>>
> >>> node)
> >>>
> >>> is
> >>>
> >>> equivalent with shutting down the blocked
> >>>
> >>> process(es).
> >>>
> >>> Once the block is lifted we can just spin it
> >>>
> >>> back
> >>>
> >>> up.
> >>>
> >>> And I do wonder whether we couldn't apply
> >>>
> >>> the
> >>>
> >>> same
> >>>
> >>> line
> >>>
> >>> of
> >>>
> >>> thinking to
> >>>
> >>> standalone resource management.
> >>> Here being able to stop/restart a
> >>>
> >>> process/node
> >>>
> >>> manually
> >>>
> >>> should
> >>>
> >>> be
> >>>
> >>> a
> >>>
> >>> core
> >>>
> >>> requirement for a Flink deployment anyway.
> >>>
> >>>
> >>> On 02/05/2022 08:49, Martijn Visser wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> Thanks for creating this FLIP. I can
> >>>
> >>> understand
> >>>
> >>> the
> >>>
> >>> problem
> >>>
> >>> and
> >>>
> >>> I see
> >>>
> >>> value
> >>>
> >>> in the automatic detection and
> >>>
> >>> blocklisting. I
> >>>
> >>> do
> >>>
> >>> have
> >>>
> >>> some
> >>>
> >>> concerns
> >>>
> >>> with
> >>>
> >>> the ability to manually specify to be
> >>>
> >>> blocked
> >>>
> >>> resources.
> >>>
> >>> I
> >>>
> >>> have
> >>>
> >>> two
> >>>
> >>> concerns;
> >>>
> >>> * Most organizations explicitly have a
> >>>
> >>> separation
> >>>
> >>> of
> >>>
> >>> concerns,
> >>>
> >>> meaning
> >>>
> >>> that
> >>>
> >>> there's a group who's responsible for
> >>>
> >>> managing a
> >>>
> >>> cluster
> >>>
> >>> and
> >>>
> >>> there's
> >>>
> >>> a
> >>>
> >>> user
> >>>
> >>> group who uses that cluster. With the
> >>>
> >>> introduction of
> >>>
> >>> this
> >>>
> >>> mechanism,
> >>>
> >>> the
> >>>
> >>> latter group now can influence the
> >>>
> >>> responsibility
> >>>
> >>> of
> >>>
> >>> the
> >>>
> >>> first
> >>>
> >>> group.
> >>>
> >>> So
> >>>
> >>> it
> >>>
> >>> can be possible that someone from the user
> >>>
> >>> group
> >>>
> >>> blocks
> >>>
> >>> something,
> >>>
> >>> which
> >>>
> >>> causes an outage (which could result in
> >>>
> >>> paging
> >>>
> >>> mechanism
> >>>
> >>> triggering
> >>>
> >>> etc)
> >>>
> >>> which impacts the first group.
> >>> * How big is the group of people who can
> >>>
> >>> go
> >>>
> >>> through
> >>>
> >>> the
> >>>
> >>> process
> >>>
> >>> of
> >>>
> >>> manually
> >>>
> >>> identifying a node that isn't behaving as
> >>>
> >>> it
> >>>
> >>> should
> >>>
> >>> be? I
> >>>
> >>> do
> >>>
> >>> think
> >>>
> >>> this
> >>>
> >>> group is relatively limited. Does it then
> >>>
> >>> make
> >>>
> >>> sense
> >>>
> >>> to
> >>>
> >>> introduce
> >>>
> >>> such
> >>>
> >>> a
> >>>
> >>> feature, which would only be used by a
> >>>
> >>> really
> >>>
> >>> small
> >>>
> >>> user
> >>>
> >>> group
> >>>
> >>> of
> >>>
> >>> Flink?
> >>>
> >>> We
> >>>
> >>> still have to maintain, test and support
> >>>
> >>> such
> >>>
> >>> a
> >>>
> >>> feature.
> >>>
> >>> I'm +1 for the autodetection features, but
> >>>
> >>> I'm
> >>>
> >>> leaning
> >>>
> >>> towards
> >>>
> >>> not
> >>>
> >>> exposing
> >>>
> >>> this to the user group but having this
> >>>
> >>> available
> >>>
> >>> strictly
> >>>
> >>> for
> >>>
> >>> cluster
> >>>
> >>> operators. They could then also set up
> >>>
> >>> their
> >>>
> >>> paging/metrics/logging
> >>>
> >>> system
> >>>
> >>> to take this into account.
> >>>
> >>> Best regards,
> >>>
> >>> Martijn Visser
> >>> https://twitter.com/MartijnVisser82
> >>> https://github.com/MartijnVisser
> >>>
> >>>
> >>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> >>>
> >>> karmagyz@gmail.com
> >>>
> >>> wrote:
> >>>
> >>> Thanks for driving this, Zhu and Lijie.
> >>>
> >>> +1 for the overall proposal. Just share
> >>>
> >>> some
> >>>
> >>> cents
> >>>
> >>> here:
> >>>
> >>> - Why do we need to expose
> >>>
> >>> cluster.resource-blacklist.item.timeout-check-interval
> >>>
> >>> to
> >>>
> >>> the
> >>>
> >>> user?
> >>>
> >>> I think the semantics of
> >>>
> >>> `cluster.resource-blacklist.item.timeout`
> >>>
> >>> is
> >>>
> >>> sufficient for the user. How to guarantee
> >>>
> >>> the
> >>>
> >>> timeout
> >>>
> >>> mechanism is
> >>>
> >>> Flink's internal implementation. I think
> >>>
> >>> it
> >>>
> >>> will
> >>>
> >>> be
> >>>
> >>> very
> >>>
> >>> confusing
> >>>
> >>> and
> >>>
> >>> we do not need to expose it to users.
> >>>
> >>> - ResourceManager can notify the
> >>>
> >>> exception
> >>>
> >>> of a
> >>>
> >>> task
> >>>
> >>> manager to
> >>>
> >>> `BlacklistHandler` as well.
> >>> For example, the slot allocation might
> >>>
> >>> fail
> >>>
> >>> in
> >>>
> >>> case
> >>>
> >>> the
> >>>
> >>> target
> >>>
> >>> task
> >>>
> >>> manager is busy or has a network jitter.
> >>>
> >>> I
> >>>
> >>> don't
> >>>
> >>> mean
> >>>
> >>> we
> >>>
> >>> need
> >>>
> >>> to
> >>>
> >>> cover
> >>>
> >>> this case in this version, but we can
> >>>
> >>> also
> >>>
> >>> open a
> >>>
> >>> `notifyException`
> >>>
> >>> in
> >>>
> >>> `ResourceManagerBlacklistHandler`.
> >>>
> >>> - Before we sync the blocklist to
> >>>
> >>> ResourceManager,
> >>>
> >>> will
> >>>
> >>> the
> >>>
> >>> slot of
> >>>
> >>> a
> >>>
> >>> blocked task manager continues to be
> >>>
> >>> released
> >>>
> >>> and
> >>>
> >>> allocated?
> >>>
> >>> Best,
> >>> Yangze Guo
> >>>
> >>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
> >>>
> >>> Wang
> >>>
> >>> <
> >>>
> >>> wangdachui9501@gmail.com>
> >>>
> >>> wrote:
> >>>
> >>> Hi Konstantin,
> >>>
> >>> Thanks for your feedback. I will
> >>>
> >>> response
> >>>
> >>> your 4
> >>>
> >>> remarks:
> >>>
> >>> 1) Thanks for reminding me of the
> >>>
> >>> controversy. I
> >>>
> >>> think
> >>>
> >>> “BlockList”
> >>>
> >>> is
> >>>
> >>> good
> >>>
> >>> enough, and I will change it in FLIP.
> >>>
> >>>
> >>> 2) Your suggestion for the REST API is a
> >>>
> >>> good
> >>>
> >>> idea.
> >>>
> >>> Based
> >>>
> >>> on
> >>>
> >>> the
> >>>
> >>> above, I
> >>>
> >>> would change REST API as following:
> >>>
> >>> POST/GET <host>/blocklist/nodes
> >>>
> >>> POST/GET <host>/blocklist/taskmanagers
> >>>
> >>> DELETE
> >>>
> >>> <host>/blocklist/node/<identifier>
> >>>
> >>> DELETE
> >>>
> >>> <host>/blocklist/taskmanager/<identifier>
> >>>
> >>> 3) If a node is blocking/blocklisted, it
> >>>
> >>> means
> >>>
> >>> that
> >>>
> >>> all
> >>>
> >>> task
> >>>
> >>> managers
> >>>
> >>> on
> >>>
> >>> this node are blocklisted. All slots on
> >>>
> >>> these
> >>>
> >>> TMs
> >>>
> >>> are
> >>>
> >>> not
> >>>
> >>> available.
> >>>
> >>> This
> >>>
> >>> is actually a bit like TM losts, but
> >>>
> >>> these
> >>>
> >>> TMs
> >>>
> >>> are
> >>>
> >>> not
> >>>
> >>> really
> >>>
> >>> lost,
> >>>
> >>> they
> >>>
> >>> are in an unavailable status, and they
> >>>
> >>> are
> >>>
> >>> still
> >>>
> >>> registered
> >>>
> >>> in this
> >>>
> >>> flink
> >>>
> >>> cluster. They will be available again
> >>>
> >>> once
> >>>
> >>> the
> >>>
> >>> corresponding
> >>>
> >>> blocklist
> >>>
> >>> item
> >>>
> >>> is removed. This behavior is the same in
> >>>
> >>> active/non-active
> >>>
> >>> clusters.
> >>>
> >>> However in the active clusters, these
> >>>
> >>> TMs
> >>>
> >>> may
> >>>
> >>> be
> >>>
> >>> released
> >>>
> >>> due
> >>>
> >>> to
> >>>
> >>> idle
> >>>
> >>> timeouts.
> >>>
> >>>
> >>> 4) For the item timeout, I prefer to
> >>>
> >>> keep
> >>>
> >>> it.
> >>>
> >>> The
> >>>
> >>> reasons
> >>>
> >>> are
> >>>
> >>> as
> >>>
> >>> following:
> >>>
> >>> a) The timeout will not affect users
> >>>
> >>> adding
> >>>
> >>> or
> >>>
> >>> removing
> >>>
> >>> items
> >>>
> >>> via
> >>>
> >>> REST
> >>>
> >>> API,
> >>>
> >>> and users can disable it by configuring
> >>>
> >>> it
> >>>
> >>> to
> >>>
> >>> Long.MAX_VALUE .
> >>>
> >>> b) Some node problems can recover after
> >>>
> >>> a
> >>>
> >>> period of
> >>>
> >>> time
> >>>
> >>> (such as
> >>>
> >>> machine
> >>>
> >>> hotspots), in which case users may
> >>>
> >>> prefer
> >>>
> >>> that
> >>>
> >>> Flink
> >>>
> >>> can
> >>>
> >>> do
> >>>
> >>> this
> >>>
> >>> automatically instead of requiring the
> >>>
> >>> user
> >>>
> >>> to
> >>>
> >>> do
> >>>
> >>> it
> >>>
> >>> manually.
> >>>
> >>> Best,
> >>>
> >>> Lijie
> >>>
> >>> Konstantin Knauf <kn...@apache.org>
> >>>
> >>> 于2022年4月27日周三
> >>>
> >>> 19:23写道：
> >>>
> >>> Hi Lijie,
> >>>
> >>> I think, this makes sense and +1 to
> >>>
> >>> only
> >>>
> >>> support
> >>>
> >>> manually
> >>>
> >>> blocking
> >>>
> >>> taskmanagers and nodes. Maybe the
> >>>
> >>> different
> >>>
> >>> strategies
> >>>
> >>> can
> >>>
> >>> also be
> >>>
> >>> maintained outside of Apache Flink.
> >>>
> >>> A few remarks:
> >>>
> >>> 1) Can we use another term than
> >>>
> >>> "bla.cklist"
> >>>
> >>> due
> >>>
> >>> to
> >>>
> >>> the
> >>>
> >>> controversy
> >>>
> >>> around
> >>>
> >>> the term? [1] There was also a Jira
> >>>
> >>> Ticket
> >>>
> >>> about
> >>>
> >>> this
> >>>
> >>> topic a
> >>>
> >>> while
> >>>
> >>> back
> >>>
> >>> and there was generally a consensus to
> >>>
> >>> avoid
> >>>
> >>> the
> >>>
> >>> term
> >>>
> >>> blacklist &
> >>>
> >>> whitelist
> >>>
> >>> [2]? We could use "blocklist"
> >>>
> >>> "denylist"
> >>>
> >>> or
> >>>
> >>> "quarantined"
> >>>
> >>> 2) For the REST API, I'd prefer a
> >>>
> >>> slightly
> >>>
> >>> different
> >>>
> >>> design
> >>>
> >>> as
> >>>
> >>> verbs
> >>>
> >>> like
> >>>
> >>> add/remove often considered an
> >>>
> >>> anti-pattern
> >>>
> >>> for
> >>>
> >>> REST
> >>>
> >>> APIs.
> >>>
> >>> POST
> >>>
> >>> on a
> >>>
> >>> list
> >>>
> >>> item is generally the standard to add
> >>>
> >>> items.
> >>>
> >>> DELETE
> >>>
> >>> on
> >>>
> >>> the
> >>>
> >>> individual
> >>>
> >>> resource is standard to remove an item.
> >>>
> >>> POST <host>/quarantine/items
> >>> DELETE
> >>>
> >>> <host>/quarantine/items/<itemidentifier>
> >>>
> >>> We could also consider to separate
> >>>
> >>> taskmanagers
> >>>
> >>> and
> >>>
> >>> nodes
> >>>
> >>> in
> >>>
> >>> the
> >>>
> >>> REST
> >>>
> >>> API
> >>>
> >>> (and internal data structures). Any
> >>>
> >>> opinion
> >>>
> >>> on
> >>>
> >>> this?
> >>>
> >>> POST/GET <host>/quarantine/nodes
> >>> POST/GET <host>/quarantine/taskmanager
> >>> DELETE
> >>>
> >>> <host>/quarantine/nodes/<identifier>
> >>>
> >>> DELETE
> >>>
> >>> <host>/quarantine/taskmanager/<identifier>
> >>>
> >>> 3) How would blocking nodes behave with
> >>>
> >>> non-active
> >>>
> >>> resource
> >>>
> >>> managers,
> >>>
> >>> i.e.
> >>>
> >>> standalone or reactive mode?
> >>>
> >>> 4) To keep the implementation even more
> >>>
> >>> minimal,
> >>>
> >>> do
> >>>
> >>> we
> >>>
> >>> need
> >>>
> >>> the
> >>>
> >>> timeout
> >>>
> >>> behavior? If items are added/removed
> >>>
> >>> manually
> >>>
> >>> we
> >>>
> >>> could
> >>>
> >>> delegate
> >>>
> >>> this
> >>>
> >>> to the
> >>>
> >>> user easily. In my opinion the timeout
> >>>
> >>> behavior
> >>>
> >>> would
> >>>
> >>> better
> >>>
> >>> fit
> >>>
> >>> into
> >>>
> >>> specific strategies at a later point.
> >>>
> >>> Looking forward to your thoughts.
> >>>
> >>> Cheers and thank you,
> >>>
> >>> Konstantin
> >>>
> >>> [1]
> >>>
> >>>
> >>>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >>>
> >>> [2]
> >>>
> >>> https://issues.apache.org/jira/browse/FLINK-18209
> >>>
> >>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> >>>
> >>> schrieb
> >>>
> >>> Lijie
> >>>
> >>> Wang
> >>>
> >>> <
> >>>
> >>> wangdachui9501@gmail.com>:
> >>>
> >>> Hi all,
> >>>
> >>> Flink job failures may happen due to
> >>>
> >>> cluster
> >>>
> >>> node
> >>>
> >>> issues
> >>>
> >>> (insufficient
> >>>
> >>> disk
> >>>
> >>> space, bad hardware, network
> >>>
> >>> abnormalities).
> >>>
> >>> Flink
> >>>
> >>> will
> >>>
> >>> take care
> >>>
> >>> of
> >>>
> >>> the
> >>>
> >>> failures and redeploy the tasks.
> >>>
> >>> However,
> >>>
> >>> due
> >>>
> >>> to
> >>>
> >>> data
> >>>
> >>> locality
> >>>
> >>> and
> >>>
> >>> limited
> >>>
> >>> resources, the new tasks are very
> >>>
> >>> likely
> >>>
> >>> to
> >>>
> >>> be
> >>>
> >>> redeployed
> >>>
> >>> to the
> >>>
> >>> same
> >>>
> >>> nodes, which will result in continuous
> >>>
> >>> task
> >>>
> >>> abnormalities
> >>>
> >>> and
> >>>
> >>> affect
> >>>
> >>> job
> >>>
> >>> progress.
> >>>
> >>> Currently, Flink users need to
> >>>
> >>> manually
> >>>
> >>> identify
> >>>
> >>> the
> >>>
> >>> problematic
> >>>
> >>> node and
> >>>
> >>> take it offline to solve this problem.
> >>>
> >>> But
> >>>
> >>> this
> >>>
> >>> approach
> >>>
> >>> has
> >>>
> >>> following
> >>>
> >>> disadvantages:
> >>>
> >>> 1. Taking a node offline can be a
> >>>
> >>> heavy
> >>>
> >>> process.
> >>>
> >>> Users
> >>>
> >>> may
> >>>
> >>> need
> >>>
> >>> to
> >>>
> >>> contact
> >>>
> >>> cluster administors to do this. The
> >>>
> >>> operation
> >>>
> >>> can
> >>>
> >>> even
> >>>
> >>> be
> >>>
> >>> dangerous
> >>>
> >>> and
> >>>
> >>> not
> >>>
> >>> allowed during some important business
> >>>
> >>> events.
> >>>
> >>> 2. Identifying and solving this kind
> >>>
> >>> of
> >>>
> >>> problems
> >>>
> >>> manually
> >>>
> >>> would
> >>>
> >>> be
> >>>
> >>> slow
> >>>
> >>> and
> >>>
> >>> a waste of human resources.
> >>>
> >>> To solve this problem, Zhu Zhu and I
> >>>
> >>> propose
> >>>
> >>> to
> >>>
> >>> introduce a
> >>>
> >>> blacklist
> >>>
> >>> mechanism for Flink to filter out
> >>>
> >>> problematic
> >>>
> >>> resources.
> >>>
> >>> You can find more details in
> >>>
> >>> FLIP-224[1].
> >>>
> >>> Looking
> >>>
> >>> forward
> >>>
> >>> to your
> >>>
> >>> feedback.
> >>>
> >>> [1]
> >>>
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >>>
> >>> Best,
> >>>
> >>> Lijie
> >>>
> >>> --
> >>> Best regards,
> >>> Roman Boyko
> >>> e.: ro.v.boyko@gmail.com
> >>>
> >>> --
> >>> https://twitter.com/snntrable
> >>> https://github.com/knaufk
> >>>
> >>> --
> >>> https://twitter.com/snntrable
> >>> https://github.com/knaufk
> >>>
> >>>
> >>>
>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Chesnay Schepler <ch...@apache.org>.

To expand a bit for transparency, Zhu Zhu and I had long discussion 
(literally spanning days) about this FLIP and it's relation to 
speculative execution.

The gist of it is that speculative execution doesn't strictly need the 
block list; just _some_ mechanism to select/request slots from other nodes.
It is however at this time the easiest way to implement it because of 
the technical debt we have in the scheduler components (e.g., adaptive 
scheduler not being used for all streaming jobs or default scheduler not 
being integrated fully into declarative resource management).
Because of that I was worried that we might expand the API now to unlock 
speculative execution, but then end up not actually requiring it down 
the line (but still being stuck with it).
This should give us quite a bit more freedom as to how we implement it.
Which is particularly important because we already identified some 
limitations in the current design (e.g., blocks not being scoped to 
jobs, resourceID-based blocks intefeing with hard-coded resource IDs, 
potentially blocking more slots than necessary).

Shall we start a new vote thread, since the design changed quite a bit?

On 15/06/2022 09:33, Zhu Zhu wrote:
> Hi everyone,
> Thank you all for the feedback!
>
> We receive concerns that the blocklist feature is not strongly
> required except for the needs of speculative execution. So as the
> first step, we will limit the scope of FLIP-224 to only support
> speculative execution, and therefore, not add public interfaces for
> users to manipulate with the blocklist directly.
>
> If later we receive strong requirements for blocklist from users, we
> will have another FLIP to open it to users with well designed public
> interfaces and web UI.
>
> Thanks,
> Zhu
>
> Zhu Zhu <re...@gmail.com> 于2022年6月10日周五 18:45写道：
>> 1) With the declarative slot allocation protocol, it's not easy to
>> keep the slow slots which may have satisfy the slot request already,
>> but ask for more slots from the resource manager.
>> And we are also concerned to use detected slow slots, which may add
>> load to the slow nodes and further slow down the tasks on the slow
>> nodes. Because we found slow nodes are mostly caused by heavy loads.
>>
>> 2) I see your point. The blocker here is that batch jobs currently
>> uses DefaultScheduler. The scheduler does not see slots directly and
>> does not know which of them are from slow nodes, so it's hard for it
>> to rescale the vertex according to this information. Besides that, it
>> cannot help with the stage that first observed and affected the slow
>> nodes.
>>
>> Thanks,
>> Zhu
>>
>> Chesnay Schepler <ch...@apache.org> 于2022年6月10日周五 17:04写道：
>>> 1)
>>> It's true that if we handle this entirely in the scheduler we may get a bunch of slow slots from the RM.
>>> My point is that this isn't necessarily a problem; it depends on how you handle those slots.
>>>
>>> We anyway need to account for the possibility that we're not getting any new fast slots from the RM.
>>> With that in mind, I'd rather not categorically throw away the slow slots we got, but try to make use of them somehow.
>>>
>>> 2)
>>> The idea was to rescale the job when the current stage finishes.
>>> The speculative execution handles slow nodes being detected in the current stage,
>>> next stage we use _some_ strategy to handle slow nodes (be it ignoring those, rescaling the job, ...).
>>>
>>> 3)
>>> It's a fair observation that once you push this to the RM you end up with a de-facto blocklist :)
>>>
>>> On 08/06/2022 17:11, Zhu Zhu wrote:
>>>
>>> Thanks Chesnay for the feedback in the vote thread.
>>> (https://lists.apache.org/thread/opc7jg3rpxnwotkb0fcn4wnm02m4397o)
>>>
>>> I'd like to continue the discussion in this thread so that the
>>> discussions can be better tracked.
>>>
>>> Regarding your questions, here are my thoughts:
>>> 1. The current locality mechanism does not work well to avoid
>>> deploying tasks to slow nodes because it cannot proactively reject or
>>> release slots from the slow nodes. And it cannot help at the resource
>>> manager side to avoid allocating free slots or launching new
>>> TaskManagers on slow nodes.
>>> 2. Dynamically downscaling or upscaling a batch job is usually
>>> unacceptable because it means to re-run the whole stage.
>>> 3. Extending the requirement declaration to have a notion of
>>> "undesirable nodes" is an option. And it is actually how we started.
>>> Considering implementation details, we found we need that
>>> - a tracker to collect all the undesirable(detected slow nodes)
>>> - to filter out slots from undesirable nodes when allocating slots
>>> from the SlotPool
>>> - ask the ResourceManager for slots that are not on the undesirable
>>> nodes. The ResourceManager may further need to ask for new
>>> TaskManagers that are not on the undesirable nodes.
>>> Then, with all these functionality, we found that we almost have a
>>> blocklist mechanism. As blocklist mechanism is a common concept and is
>>> possible to benefit users, we took this chance to propose the
>>> blocklist mechanism.
>>> 4. A cluster-wide shared blocklist is not a must for speculative
>>> execution at the moment. It is mainly part of a standalone blocklist
>>> feature to host user specified block items. To avoid sharing job
>>> specific blocked items between jobs, one way is to add a nullable
>>> JobID tag for the blocked item.
>>>
>>> Thanks,
>>> Zhu
>>>
>>> Zhu Zhu <re...@gmail.com> 于2022年6月7日周二 10:58写道：
>>>
>>> Hi Chesnay,
>>>
>>> Would you please take a look at the FLIP and discussion to see if all
>>> your concerns have been addressed?
>>>
>>> Thanks,
>>> Zhu
>>>
>>> Zhu Zhu <re...@gmail.com> 于2022年5月28日周六 13:26写道：
>>>
>>> Regarding the concern of the SlotManager, my two cents:
>>> 1. it is necessary for the SlotManager to host blocked slots, in 2 cases:
>>>    a. In standalone mode, a taskmanager may be temporarily added to
>>> the blocklist. We do not want the TM to get disconnected and shut down.
>>> So we need to keep its connection to RM and keep hosting its slots.
>>>    b. When we want to avoid allocating slots to a slow nodes but do not
>>> want to kill current running tasks on the nodes (MARK_BLOCKED mode).
>>>
>>> There is possible a way to keep the connection of a blocked task manager
>>> while hide its slots from SlotManager, but I feel it may be even much more
>>> complicated.
>>>
>>> 2. It will not complicate the SlotManager too much. The SlotManager will
>>> be offered a BlockedTaskManagerChecker when created, and just need
>>> to use it to filter out blocked slots on slot request. Therefore I think the
>>> complication is acceptable.
>>>
>>> Thanks,
>>> Zhu
>>>
>>> Lijie Wang <wa...@gmail.com> 于2022年5月25日周三 15:26写道：
>>>
>>> Hi everyone,
>>>
>>> I've updated the FLIP according to Chesnay's feedback, changes as follows:
>>> 1. Change the GET result to a map.
>>> 2. Only left *endTimestamp* in ADD operation, and change the rest (from
>>> POST) to PUT
>>> 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
>>> encapsulate blocklist related functions.
>>> 4. Remove *mainThread* from BlocklistTracker, instead provide a
>>> *removeTimeoutItems* method to be called by outside components。
>>>
>>> Best,
>>> Lijie
>>>
>>> Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：
>>>
>>> Hi Chesnay,
>>>
>>> Thanks for feedback.
>>>
>>> 1. Regarding the TM/Node id. Do you mean special characters may appear in
>>> the rest URL?  Actually, I don't think so. The task manager id in REST API
>>> should be the *ResourceID* of taskmanager in Flink, there should be no
>>> special characters, and some existing REST APIs are already using it, e.g.
>>> GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
>>> id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
>>> should also have no special characters.
>>> 2. Regarding the GET query responses. I agree with you, it makes sense to
>>> change the GET result to a map.
>>>
>>> 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
>>> cover everything, and the endTimestamp is a unix timestamp, there should be
>>> no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
>>> add rest api is add or update, PUT can cover this semantics.
>>>
>>> 4. Regarding the slot pool/manager. I don't think the current slotpool
>>> and slotmanager are able to support the MARK_BLOCKED(slots that are
>>> already allocated will not be affected) action. The reasons are as
>>> follows：
>>>
>>> a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
>>> from reserved(task assigned) to free(no task assigned), it is necessary to
>>> check whether the slot should be released immediately(it should be released
>>> immediately if the task manager is blocked, otherwise it may be allocated
>>> to other tasks). I think it cannot be supported without being aware of
>>> the blocklist information. Compared to the solution in FLIP, a more
>>> appropriate/prefered way may be: Introduce a new slot pool
>>> implementation for blocklist(may be named BlocklistSlotPool, it
>>> extends/wrapps the original slot pool), and implement the parts that need
>>> to be aware of the blocklist in this newly introduced slot pool, and the
>>> original slot pool basically does not need to change.
>>>
>>> b) for slot manager, with the MARK_BLOCKED action, there may be free but
>>> blocked slots in slot manager (the corresponding TMs cannot be
>>> released/unregistered because there are still running tasks on them).
>>> Therefore, we need to filter out the blocked slots when trying to fulfill
>>> the slot requirements. Therefore it also needs to know the blocklist information.
>>> A better way may be to abstract a resource allocation strategy, and make
>>> the blocklist as a special implementation, then pass the resource
>>> allocation strategy in when constructing the slot manager. Unfortunately,
>>> the data structures in the two existing slot manager
>>> implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
>>> quite different, it is not easy to abstract a common resource allocation
>>> strategy, so we prefer to keep the current way(i.e. pass the blocklist
>>> information directly into slot manager).
>>>
>>>
>>> 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
>>> does not need to be aware of the executor, and the timeout actions can be
>>> done outside.
>>>
>>> Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
>>>
>>> I have a number of concerns:
>>>
>>> Is the id used for deleting an item the same sent in the initial request
>>> (and not one returned by Flink)?
>>> I'm very concerned that the tm/node id can contain special characters.
>>>
>>> The GET query should return a map, not a list of items. This makes it
>>> easier to work with.
>>>
>>> The duality of endTimestamp and duration is also concerning.
>>> If we conclude that endTimestamps can in fact work (and aren't utterly
>>> unusable due to timezones),
>>> then this should be able to cover everything and rid us of some
>>> complexity w.r.t. POSTs to the same ID.
>>> Additions would be a PUT, changes a PATCH, deletes a DELETE.
>>>
>>>
>>> I also dislike how we're pushing more functionality into the
>>> slotpool/-manager.
>>> These components are complex enough as-is, and instead I'd propose a
>>> separate component that interacts with the SlotPool/-Manager instead,
>>> for example by removing the slots from that TM.
>>> The reason being that from the slot-pool perspective it is irrelevant
>>> whether a slot is gone because the TM was lost, or because it was blocked.
>>>
>>>
>>> The BlocklistTracker shouldn't be modeled as component that is aware of
>>> the concept of main threads.
>>> It really has no business knowing that; all it needs is an executor for
>>> handling timeouts/periodic actions,
>>> and a way to interact with the JM/RM (which internally can handle the
>>> scheduling into the main thread).
>>>
>>>
>>> On 20/05/2022 07:20, Lijie Wang wrote:
>>>
>>> Hi everyone,
>>>
>>> I have started a vote for this FLIP [1]. Please cast your vote there or
>>>
>>> ask
>>>
>>> additional questions here. [1]
>>> https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
>>>
>>> Best, Lijie
>>>
>>> Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
>>>
>>> Hi Konstantin,
>>>
>>> We found that Flink REST URL does not support the format ":merge" ,
>>>
>>> which
>>>
>>> will be recognized as a parameter in the URL(due to start with a
>>>
>>> colon).
>>>
>>> We will keep the previous way, i.e.
>>>
>>> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
>>> and the "id" and "merge" flag are put into the request body.
>>>
>>> Best,
>>> Lijie
>>>
>>> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
>>>
>>> Hi Weihua,
>>> thanks for feedback.
>>>
>>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
>>> step towards auto-detection.
>>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
>>> exception of insufficient resources.
>>> 3. No. This FLIP won't change the WebUI. The blocklist information
>>>
>>> can be
>>>
>>> obtained through REST API and metrics.
>>>
>>> Best,
>>> Lijie
>>>
>>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
>>>
>>> Hi,
>>> Thanks for creating this FLIP.
>>> We have implemented an automatic blocklist detection mechanism
>>> internally, which is indeed very effective for handling node
>>>
>>> failures.
>>>
>>> Due to the large number of nodes, although SREs already support
>>> automatic offline failure nodes, the detection is not 100% accurate
>>>
>>> and
>>>
>>> there is some delay.
>>> So the blocklist mechanism can make flink job recover from failure
>>>
>>> much
>>>
>>> faster.
>>>
>>> Here are some of my thoughts:
>>> 1. In this FLIP, it needs users to locate machine failure manually,
>>> there is a certain cost of use
>>> 2. What happens if too many nodes are blocked, resulting in
>>>
>>> insufficient
>>>
>>> resources? Will there be a special Exception for the user?
>>> 3. Will we display the blocklist information in the WebUI? The
>>>
>>> blocklist
>>>
>>> is for cluster level, and if multiple users share a cluster, some
>>>
>>> users may
>>>
>>> be a little confused when resources are not enough, or when
>>>
>>> resources are
>>>
>>> applied for more.
>>>
>>> Also, Looking forward to the next FLIP on auto-detection.
>>>
>>> Best,
>>> Weihua
>>>
>>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
>>>
>>> Hi Konstantin,
>>>
>>> Maybe change it to the following:
>>>
>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>> Merge is not allowed. If the {id} already exists, return error.
>>>
>>> Otherwise,
>>>
>>> create a new item.
>>>
>>> 2. POST: http://
>>>
>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>>
>>> Merge is allowed. If the {id} already exists, merge. Otherwise,
>>>
>>> create
>>>
>>> a
>>>
>>> new item.
>>>
>>> WDYT?
>>>
>>> Best,
>>> Lijie
>>>
>>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
>>>
>>> Hi Lijie,
>>>
>>> hm, maybe the following is more appropriate in that case
>>>
>>> POST: http://
>>>
>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>>
>>> Best,
>>>
>>> Konstantin
>>>
>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>>> wangdachui9501@gmail.com>:
>>>
>>> Hi Konstantin,
>>> thanks for your feedback.
>>>
>>>   From what I understand, PUT should be idempotent. However, we
>>>
>>> have a
>>>
>>> *timeout* field in the request. This means that initiating the
>>>
>>> same
>>>
>>> request
>>>
>>> at two different times will lead to different resource status
>>>
>>> (timestamps
>>>
>>> of the items to be removed will be different).
>>>
>>> Should we use PUT in this case? WDYT?
>>>
>>> Best,
>>> Lijie
>>>
>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>>>
>>> Hi Lijie,
>>>
>>> wouldn't the REST API-idiomatic way for an update/replace be a
>>>
>>> PUT
>>>
>>> on
>>>
>>> the
>>>
>>> resource?
>>>
>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>>
>>> Best,
>>>
>>> Konstantin
>>>
>>>
>>>
>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>>> wangdachui9501@gmail.com>:
>>>
>>> Hi everyone,
>>>
>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
>>>
>>> made
>>>
>>> the
>>>
>>> following changes on REST API:
>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
>>>
>>> choose
>>>
>>> one.
>>>
>>> If
>>>
>>> both are specified, will return error.
>>> 2.  If the specified item is already there, the *ADD* operation
>>>
>>> has
>>>
>>> two
>>>
>>> behaviors:  *return error*(default value) or *merge/update*,
>>>
>>> and we
>>>
>>> add a
>>>
>>> flag to the request body to control it. You can find more
>>>
>>> details
>>>
>>> "Public
>>>
>>> Interface" section.
>>>
>>> If there is no more feedback, we will start the vote thread next
>>>
>>> week.
>>>
>>> Best,
>>> Lijie
>>>
>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>>>
>>> Hi Becket Qin,
>>>
>>> Thanks for your suggestions.  I have moved the description of
>>> configurations, metrics and REST API into "Public Interface"
>>>
>>> section,
>>>
>>> and
>>>
>>> made a few updates according to your suggestion.  And in this
>>>
>>> FLIP,
>>>
>>> there
>>>
>>> no public java Interfaces or pluggables that users need to
>>>
>>> implement
>>>
>>> by
>>>
>>> themselves.
>>>
>>> Answers for you questions:
>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>>>
>>> items
>>>
>>> can
>>>
>>> only be added through the REST API, so these 2 action are
>>>
>>> mentioned
>>>
>>> in
>>>
>>> the
>>>
>>> REST API part (The REST API part has beed moved to public
>>>
>>> interface
>>>
>>> now).
>>>
>>> 2. I agree with you. I have changed the "Cause" field to
>>>
>>> String,
>>>
>>> and
>>>
>>> allow
>>>
>>> users to specify it via REST API.
>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>>>
>>> above,
>>>
>>> we
>>>
>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
>>>
>>> ADD
>>>
>>> REST
>>>
>>> API to specify when to remove the blocked item. These 2 fields
>>>
>>> are
>>>
>>> optional, if neither is specified, it means that the blocked
>>>
>>> item
>>>
>>> is
>>>
>>> permanent and will not be removed. If both are specified, the
>>>
>>> minimum
>>>
>>> of
>>>
>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
>>>
>>> the
>>>
>>> time
>>>
>>> to
>>>
>>> remove the blocked item. To keep the configurations more
>>>
>>> minimal,
>>>
>>> we
>>>
>>> have
>>>
>>> removed the *cluster.resource-blocklist.item.timeout*
>>>
>>> configuration
>>>
>>> option.
>>> 4. Yes, the block item will be overridden if the specified item
>>>
>>> already
>>>
>>> exists. The ADD operation is *ADD or UPDATE*.
>>> 5. Yes. On JM/RM side, all the blocklist information is
>>>
>>> maintained
>>>
>>> in
>>>
>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>>>
>>> abstracted
>>>
>>> to other interfaces) will be propagated to different
>>>
>>> components.
>>>
>>> Best,
>>> Lijie
>>>
>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>>>
>>> Hi Lijie,
>>>
>>> Thanks for updating the FLIP. It looks like the public
>>>
>>> interface
>>>
>>> section
>>>
>>> did not fully reflect all the user sensible behavior and API.
>>>
>>> Can
>>>
>>> you
>>>
>>> put
>>>
>>> everything that users may be aware of there? That would
>>>
>>> include
>>>
>>> the
>>>
>>> REST
>>>
>>> API, metrics, configurations, public java Interfaces or
>>>
>>> pluggables
>>>
>>> that
>>>
>>> users may see or implement by themselves, as well as a brief
>>>
>>> summary
>>>
>>> of
>>>
>>> the
>>> behavior of the public API.
>>>
>>> Besides that, I have a few questions:
>>>
>>> 1. According to the conversation in the discussion thread, it
>>>
>>> looks
>>>
>>> like
>>>
>>> the BlockAction will have "MARK_BLOCKLISTED" and
>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
>>>
>>> so,
>>>
>>> can
>>>
>>> you
>>>
>>> add
>>> that to the public interface as well?
>>>
>>> 2. At this point, the "Cause" field in the BlockingItem is a
>>>
>>> Throwable
>>>
>>> and
>>>
>>> is not reflected in the REST API. Should that be included in
>>>
>>> the
>>>
>>> query
>>>
>>> response? And should we change that field to be a String so
>>>
>>> users
>>>
>>> may
>>>
>>> specify the cause via the REST API when they block some nodes
>>>
>>> /
>>>
>>> TMs?
>>>
>>> 3. Would it be useful to allow users to have different
>>>
>>> timeouts
>>>
>>> for
>>>
>>> different blocked items? So while there is a default timeout,
>>>
>>> users
>>>
>>> can
>>>
>>> also override it via the REST API when they block an entity.
>>>
>>> 4. Regarding the ADD operation, if the specified item is
>>>
>>> already
>>>
>>> there,
>>>
>>> will the block item be overridden? For example, if the user
>>>
>>> wants
>>>
>>> to
>>>
>>> extend
>>> the timeout of a blocked item, can they just  issue an ADD
>>>
>>> command
>>>
>>> again?
>>>
>>> 5. I am not quite familiar with the details of this, but is
>>>
>>> there
>>>
>>> a
>>>
>>> source
>>>
>>> of truth for the blocked list? I think it might be good to
>>>
>>> have a
>>>
>>> single
>>>
>>> source of truth for the blocked list and just propagate that
>>>
>>> list
>>>
>>> to
>>>
>>> different components to take the action of actually blocking
>>>
>>> the
>>>
>>> resource.
>>>
>>> Thanks,
>>>
>>> Jiangjie (Becket) Qin
>>>
>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>>>
>>> wangdachui9501@gmail.com
>>>
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> Based on the discussion in the mailing list, I updated the
>>>
>>> FLIP
>>>
>>> doc,
>>>
>>> the
>>>
>>> changes include:
>>> 1. Changed the description of the motivation section to more
>>>
>>> clearly
>>>
>>> describe the problem this FLIP is trying to solve.
>>> 2. Only  *Manually* is supported.
>>> 3. Adopted some suggestions, such as *endTimestamp*.
>>>
>>> Best,
>>> Lijie
>>>
>>>
>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>>>
>>> Hi Lijie!
>>>
>>>
>>>
>>>
>>> *a) “Probably storing inside Zookeeper/Configmap might be
>>>
>>> helpfulhere.”
>>>
>>> Can you explain it in detail? I don't fully understand that.
>>>
>>> In
>>>
>>> myopinion,
>>>
>>> non-active and active are the same, and no special treatment
>>>
>>> isrequired.*
>>>
>>> Sorry this was a misunderstanding from my side. I thought we
>>>
>>> were
>>>
>>> talking
>>>
>>> about the HA mode (but not about Active and Standalone
>>>
>>> ResourceManager).
>>>
>>> And the original question was - how to handle the
>>>
>>> blacklisted
>>>
>>> nodes
>>>
>>> list
>>>
>>> at
>>>
>>> the moment of leader change? Should we simply forget about
>>>
>>> them
>>>
>>> or
>>>
>>> try to
>>>
>>> pre-save that list on the remote storage?
>>>
>>> On Sat, 7 May 2022 at 10:51, Yang Wang <
>>>
>>> danrtsey.wy@gmail.com
>>>
>>> wrote:
>>>
>>> Thanks Lijie and ZhuZhu for the explanation.
>>>
>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>>>
>>> it
>>>
>>> is
>>>
>>> indeed
>>>
>>> some functionalities the external tools(e.g. kubectl taint)
>>>
>>> could
>>>
>>> not
>>>
>>> support.
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
>>>
>>> 22:18写道：
>>>
>>> Thanks for your feedback, Jiangang and Martijn.
>>>
>>> @Jiangang
>>>
>>>
>>> For auto-detecting, I wonder how to make the strategy
>>>
>>> and
>>>
>>> mark a
>>>
>>> node
>>>
>>> blocked?
>>>
>>> In fact, we currently plan to not support auto-detection
>>>
>>> in
>>>
>>> this
>>>
>>> FLIP.
>>>
>>> The
>>>
>>> part about auto-detection may be continued in a separate
>>>
>>> FLIP
>>>
>>> in
>>>
>>> the
>>>
>>> future. Some guys have the same concerns as you, and the
>>>
>>> correctness
>>>
>>> and
>>>
>>> necessity of auto-detection may require further discussion
>>>
>>> in
>>>
>>> the
>>>
>>> future.
>>>
>>> In session mode, multi jobs can fail on the same bad
>>>
>>> node
>>>
>>> and
>>>
>>> the
>>>
>>> node
>>>
>>> should be marked blocked.
>>> By design, the blocklist information will be shared among
>>>
>>> all
>>>
>>> jobs
>>>
>>> in a
>>>
>>> cluster/session. The JM will sync blocklist information
>>>
>>> with
>>>
>>> RM.
>>>
>>> @Martijn
>>>
>>> I agree with Yang Wang on this.
>>>
>>> As Zhu Zhu and I mentioned above, we think the
>>>
>>> MARK_BLOCKLISTED(Just
>>>
>>> limits
>>>
>>> the load of the node and does not  kill all the processes
>>>
>>> on
>>>
>>> it)
>>>
>>> is
>>>
>>> also
>>>
>>> important, and we think that external systems (*yarn
>>>
>>> rmadmin
>>>
>>> or
>>>
>>> kubectl
>>>
>>> taint*) cannot support it. So we think it makes sense even
>>>
>>> only
>>>
>>> *manually*.
>>>
>>> I also agree with Chesnay that magical mechanisms are
>>>
>>> indeed
>>>
>>> super
>>>
>>> hard
>>>
>>> to get right.
>>> Yes, as you see, Jiangang(and a few others) have the same
>>>
>>> concern.
>>>
>>> However, we currently plan to not support auto-detection
>>>
>>> in
>>>
>>> this
>>>
>>> FLIP,
>>>
>>> and
>>>
>>> only *manually*. In addition, I'd like to say that the
>>>
>>> FLIP
>>>
>>> provides
>>>
>>> a
>>>
>>> mechanism to support MARK_BLOCKLISTED and
>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>>> the auto-detection may be done by external systems.
>>>
>>> Best,
>>> Lijie
>>>
>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
>>>
>>> 19:04写道：
>>>
>>> If we only support to block nodes manually, then I
>>>
>>> could
>>>
>>> not
>>>
>>> see
>>>
>>> the obvious advantages compared with current SRE's
>>>
>>> approach(via
>>>
>>> *yarn
>>>
>>> rmadmin or kubectl taint*).
>>>
>>> I agree with Yang Wang on this.
>>>
>>> To me this sounds yet again like one of those magical
>>>
>>> mechanisms
>>>
>>> that
>>>
>>> will rarely work just right.
>>>
>>> I also agree with Chesnay that magical mechanisms are
>>>
>>> indeed
>>>
>>> super
>>>
>>> hard
>>>
>>> to
>>>
>>> get right.
>>>
>>> Best regards,
>>>
>>> Martijn
>>>
>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>>>
>>> liujiangangpeng@gmail.com
>>>
>>> wrote:
>>>
>>> Thanks for the valuable design. The auto-detecting can
>>>
>>> decrease
>>>
>>> great
>>>
>>> work
>>>
>>> for us. We have implemented the similar feature in our
>>>
>>> inner
>>>
>>> flink
>>>
>>> version.
>>> Below is something that I care about:
>>>
>>>     1. For auto-detecting, I wonder how to make the
>>>
>>> strategy
>>>
>>> and
>>>
>>> mark a
>>>
>>> node
>>>     blocked? Sometimes the blocked node is hard to be
>>>
>>> detected,
>>>
>>> for
>>>
>>> example,
>>>     the upper node or the down node will be blocked when
>>>
>>> network
>>>
>>> unreachable.
>>>     2. I see that the strategy is made in JobMaster
>>>
>>> side.
>>>
>>> How
>>>
>>> about
>>>
>>>     implementing the similar logic in resource manager?
>>>
>>> In
>>>
>>> session
>>>
>>> mode,
>>>
>>> multi
>>>     jobs can fail on the same bad node and the node
>>>
>>> should
>>>
>>> be
>>>
>>> marked
>>>
>>> blocked.
>>>     If the job makes the strategy, the node may be not
>>>
>>> marked
>>>
>>> blocked
>>>
>>> if
>>>
>>> the
>>>     fail times don't exceed the threshold.
>>>
>>>
>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>>>
>>> Thank you for all your feedback!
>>>
>>> Besides the answers from Lijie, I'd like to share
>>>
>>> some
>>>
>>> of
>>>
>>> my
>>>
>>> thoughts:
>>>
>>> 1. Whether to enable automatical blocklist
>>> Generally speaking, it is not a goal of FLIP-224.
>>> The automatical way should be something built upon
>>>
>>> the
>>>
>>> blocklist
>>>
>>> mechanism and well decoupled. It was designed to be a
>>>
>>> configurable
>>>
>>> blocklist strategy, but I think we can further
>>>
>>> decouple
>>>
>>> it
>>>
>>> by
>>>
>>> introducing a abnormal node detector, as Becket
>>>
>>> suggested,
>>>
>>> which
>>>
>>> just
>>>
>>> uses the blocklist mechanism once bad nodes are
>>>
>>> detected.
>>>
>>> However,
>>>
>>> it
>>>
>>> should be a separate FLIP with further dev
>>>
>>> discussions
>>>
>>> and
>>>
>>> feedback
>>>
>>> from users. I also agree with Becket that different
>>>
>>> users
>>>
>>> have
>>>
>>> different
>>>
>>> requirements, and we should listen to them.
>>>
>>> 2. Is it enough to just take away abnormal nodes
>>>
>>> externally
>>>
>>> My answer is no. As Lijie has mentioned, we need a
>>>
>>> way
>>>
>>> to
>>>
>>> avoid
>>>
>>> deploying tasks to temporary hot nodes. In this case,
>>>
>>> users
>>>
>>> may
>>>
>>> just
>>>
>>> want to limit the load of the node and do not want to
>>>
>>> kill
>>>
>>> all
>>>
>>> the
>>>
>>> processes on it. Another case is the speculative
>>>
>>> execution[1]
>>>
>>> which
>>>
>>> may also leverage this feature to avoid starting
>>>
>>> mirror
>>>
>>> tasks on
>>>
>>> slow
>>>
>>> nodes.
>>>
>>> Thanks,
>>> Zhu
>>>
>>> [1]
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>>>
>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
>>>
>>> 15:56写道：
>>>
>>> Hi everyone,
>>>
>>>
>>> Thanks for your feedback.
>>>
>>>
>>> There's one detail that I'd like to re-emphasize
>>>
>>> here
>>>
>>> because
>>>
>>> it
>>>
>>> can
>>>
>>> affect the value and design of the blocklist
>>>
>>> mechanism
>>>
>>> (perhaps
>>>
>>> I
>>>
>>> should
>>>
>>> highlight it in the FLIP). We propose two actions in
>>>
>>> FLIP:
>>>
>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>>>
>>> node
>>>
>>> as
>>>
>>> blocked.
>>>
>>> Future slots should not be allocated from the blocked
>>>
>>> task
>>>
>>> manager
>>>
>>> or
>>>
>>> node.
>>>
>>> But slots that are already allocated will not be
>>>
>>> affected.
>>>
>>> A
>>>
>>> typical
>>>
>>> application scenario is to mitigate machine hotspots.
>>>
>>> In
>>>
>>> this
>>>
>>> case,
>>>
>>> we
>>>
>>> hope
>>>
>>> that subsequent resource allocations will not be on
>>>
>>> the
>>>
>>> hot
>>>
>>> machine,
>>>
>>> but
>>>
>>> tasks currently running on it should not be affected.
>>>
>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>>>
>>> task
>>>
>>> manager
>>>
>>> or
>>>
>>> node
>>>
>>> as
>>>
>>> blocked, and evacuate all tasks on it. Evacuated
>>>
>>> tasks
>>>
>>> will
>>>
>>> be
>>>
>>> restarted on
>>>
>>> non-blocked task managers.
>>>
>>> For the above 2 actions, the former may more
>>>
>>> highlight
>>>
>>> the
>>>
>>> meaning
>>>
>>> of
>>>
>>> this FLIP, because the external system cannot do
>>>
>>> that.
>>>
>>> Regarding *Manually* and *Automatically*, I
>>>
>>> basically
>>>
>>> agree
>>>
>>> with
>>>
>>> @Becket
>>>
>>> Qin: different users have different answers. Not all
>>>
>>> users’
>>>
>>> deployment
>>>
>>> environments have a special external system that can
>>>
>>> perform
>>>
>>> the
>>>
>>> anomaly
>>>
>>> detection. In addition, adding pluggable/optional
>>>
>>> auto-detection
>>>
>>> doesn't
>>>
>>> require much extra work on top of manual
>>>
>>> specification.
>>>
>>> I will answer your other questions one by one.
>>>
>>>
>>> @Yangze
>>>
>>> a) I think you are right, we do not need to expose
>>>
>>> the
>>> `cluster.resource-blocklist.item.timeout-check-interval`
>>>
>>> to
>>>
>>> users.
>>>
>>> b) We can abstract the `notifyException` to a
>>>
>>> separate
>>>
>>> interface
>>>
>>> (maybe
>>>
>>> BlocklistExceptionListener), and the
>>>
>>> ResourceManagerBlocklistHandler
>>>
>>> can
>>>
>>> implement it in the future.
>>>
>>> @Martijn
>>>
>>> a) I also think the manual blocking should be done
>>>
>>> by
>>>
>>> cluster
>>>
>>> operators.
>>>
>>> b) I think manual blocking makes sense, because
>>>
>>> according
>>>
>>> to
>>>
>>> my
>>>
>>> experience, users are often the first to perceive the
>>>
>>> machine
>>>
>>> problems
>>>
>>> (because of job failover or delay), and they will
>>>
>>> contact
>>>
>>> cluster
>>>
>>> operators
>>>
>>> to solve it, or even tell the cluster operators which
>>>
>>> machine is
>>>
>>> problematic. From this point of view, I think the
>>>
>>> people
>>>
>>> who
>>>
>>> really
>>>
>>> need
>>>
>>> the manual blocking are the users, and it’s just
>>>
>>> performed
>>>
>>> by
>>>
>>> the
>>>
>>> cluster
>>>
>>> operator, so I think the manual blocking makes sense.
>>>
>>> @Chesnay
>>>
>>> We need to touch the logic of JM/SlotPool, because
>>>
>>> for
>>>
>>> MARK_BLOCKLISTED
>>>
>>> , we need to know whether the slot is blocklisted
>>>
>>> when
>>>
>>> the
>>>
>>> task
>>>
>>> is
>>>
>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>>>
>>> release
>>>
>>> the
>>>
>>> slot
>>>
>>> directly to avoid assigning other tasks (of this job)
>>>
>>> on
>>>
>>> it.
>>>
>>> If
>>>
>>> we
>>>
>>> only
>>>
>>> maintain the blocklist information on the RM, JM
>>>
>>> needs
>>>
>>> to
>>>
>>> retrieve
>>>
>>> it
>>>
>>> by
>>>
>>> RPC. I think the performance overhead of that is
>>>
>>> relatively
>>>
>>> large,
>>>
>>> so
>>>
>>> I
>>>
>>> think it's worth maintaining the blocklist
>>>
>>> information
>>>
>>> on
>>>
>>> the JM
>>>
>>> side
>>>
>>> and
>>>
>>> syncing them.
>>>
>>> @Роман
>>>
>>>      a) “Probably storing inside Zookeeper/Configmap
>>>
>>> might
>>>
>>> be
>>>
>>> helpful
>>>
>>> here.”  Can you explain it in detail? I don't fully
>>>
>>> understand
>>>
>>> that.
>>>
>>> In
>>>
>>> my
>>>
>>> opinion, non-active and active are the same, and no
>>>
>>> special
>>>
>>> treatment
>>>
>>> is
>>>
>>> required.
>>>
>>> b) I agree with you, the `endTimestamp` makes
>>>
>>> sense,
>>>
>>> I
>>>
>>> will
>>>
>>> add
>>>
>>> it
>>>
>>> to
>>>
>>> FLIP.
>>>
>>> @Yang
>>>
>>> As mentioned above, AFAK, the external system
>>>
>>> cannot
>>>
>>> support
>>>
>>> the
>>>
>>> MARK_BLOCKLISTED action.
>>>
>>> Looking forward to your further feedback.
>>>
>>>
>>> Best,
>>>
>>> Lijie
>>>
>>>
>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
>>>
>>> 21:09写道：
>>>
>>> Thanks Lijie and Zhu for creating the proposal.
>>>
>>> I want to share some thoughts about Flink cluster
>>>
>>> operations.
>>>
>>> In the production environment, the SRE(aka Site
>>>
>>> Reliability
>>>
>>> Engineer)
>>>
>>> already has many tools to detect the unstable
>>>
>>> nodes,
>>>
>>> which
>>>
>>> could
>>>
>>> take
>>>
>>> the
>>>
>>> system logs/metrics into consideration.
>>> Then they use graceful-decomission in YARN and
>>>
>>> taint
>>>
>>> in
>>>
>>> K8s
>>>
>>> to
>>>
>>> prevent
>>>
>>> new
>>>
>>> allocations on these unstable nodes.
>>> At last, they will evict all the containers and
>>>
>>> pods
>>>
>>> running
>>>
>>> on
>>>
>>> these
>>>
>>> nodes.
>>>
>>> This mechanism also works for planned maintenance.
>>>
>>> So
>>>
>>> I
>>>
>>> am
>>>
>>> afraid
>>>
>>> this
>>>
>>> is
>>>
>>> not the typical use case for FLIP-224.
>>>
>>> If we only support to block nodes manually, then I
>>>
>>> could
>>>
>>> not
>>>
>>> see
>>>
>>> the obvious advantages compared with current SRE's
>>>
>>> approach(via
>>>
>>> *yarn
>>>
>>> rmadmin or kubectl taint*).
>>> At least, we need to have a pluggable component
>>>
>>> which
>>>
>>> could
>>>
>>> expose
>>>
>>> the
>>>
>>> potential unstable nodes automatically and block
>>>
>>> them
>>>
>>> if
>>>
>>> enabled
>>>
>>> explicitly.
>>>
>>> Best,
>>> Yang
>>>
>>>
>>>
>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
>>>
>>> 16:36写道：
>>>
>>> Thanks for the proposal, Lijie.
>>>
>>> This is an interesting feature and discussion,
>>>
>>> and
>>>
>>> somewhat
>>>
>>> related
>>>
>>> to the
>>>
>>> design principle about how people should operate
>>>
>>> Flink.
>>>
>>> I think there are three things involved in this
>>>
>>> FLIP.
>>>
>>>       a) Detect and report the unstable node.
>>>       b) Collect the information of the unstable
>>>
>>> node
>>>
>>> and
>>>
>>> form a
>>>
>>> blocklist.
>>>
>>>       c) Take the action to block nodes.
>>>
>>> My two cents:
>>>
>>> 1. It looks like people all agree that Flink
>>>
>>> should
>>>
>>> have
>>>
>>> c).
>>>
>>> It
>>>
>>> is
>>>
>>> not only
>>>
>>> useful for cases of node failures, but also
>>>
>>> handy
>>>
>>> for
>>>
>>> some
>>>
>>> planned
>>>
>>> maintenance.
>>>
>>> 2. People have different opinions on b), i.e.
>>>
>>> who
>>>
>>> should be
>>>
>>> the
>>>
>>> brain
>>>
>>> to
>>>
>>> make the decision to block a node. I think this
>>>
>>> largely
>>>
>>> depends
>>>
>>> on
>>>
>>> who we
>>>
>>> talk to. Different users would probably give
>>>
>>> different
>>>
>>> answers.
>>>
>>> For
>>>
>>> people
>>>
>>> who do have a centralized node health management
>>>
>>> service,
>>>
>>> let
>>>
>>> Flink
>>>
>>> do just
>>>
>>> do a) and c) would be preferred. So essentially
>>>
>>> Flink
>>>
>>> would
>>>
>>> be
>>>
>>> one
>>>
>>> of
>>>
>>> the
>>>
>>> sources that may detect unstable nodes, report
>>>
>>> it
>>>
>>> to
>>>
>>> that
>>>
>>> service,
>>>
>>> and then
>>>
>>> take the command from that service to block the
>>>
>>> problematic
>>>
>>> nodes.
>>>
>>> On
>>>
>>> the
>>>
>>> other hand, for users who do not have such a
>>>
>>> service,
>>>
>>> simply
>>>
>>> letting
>>>
>>> Flink
>>>
>>> be clever by itself to block the suspicious
>>>
>>> nodes
>>>
>>> might
>>>
>>> be
>>>
>>> desired
>>>
>>> to
>>>
>>> ensure the jobs are running smoothly.
>>>
>>> So that indicates a) and b) here should be
>>>
>>> pluggable /
>>>
>>> optional.
>>>
>>> In light of this, maybe it would make sense to
>>>
>>> have
>>>
>>> something
>>>
>>> pluggable
>>>
>>> like a UnstableNodeReporter which exposes
>>>
>>> unstable
>>>
>>> nodes
>>>
>>> actively.
>>>
>>> (A
>>>
>>> more
>>>
>>> general interface should be JobInfoReporter<T>
>>>
>>> which
>>>
>>> can be
>>>
>>> used
>>>
>>> to
>>>
>>> report
>>>
>>> any information of type <T>. But I'll just keep
>>>
>>> the
>>>
>>> scope
>>>
>>> relevant
>>>
>>> to
>>>
>>> this
>>>
>>> FLIP here). Personally speaking, I think it is
>>>
>>> OK
>>>
>>> to
>>>
>>> have a
>>>
>>> default
>>>
>>> implementation of a reporter which just tells
>>>
>>> Flink
>>>
>>> to
>>>
>>> take
>>>
>>> action
>>>
>>> to
>>>
>>> block
>>>
>>> problematic nodes and also unblocks them after
>>>
>>> timeout.
>>>
>>> Thanks,
>>>
>>> Jiangjie (Becket) Qin
>>>
>>>
>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>>>
>>> ro.v.boyko@gmail.com
>>>
>>> wrote:
>>>
>>> Thanks for good initiative, Lijie and Zhu!
>>>
>>> If it's possible I'd like to participate in
>>>
>>> development.
>>>
>>> I agree with 3rd point of Konstantin's reply -
>>>
>>> we
>>>
>>> should
>>>
>>> consider
>>>
>>> to move
>>>
>>> somehow the information of blocklisted
>>>
>>> nodes/TMs
>>>
>>> from
>>>
>>> active
>>>
>>> ResourceManager to non-active ones. Probably
>>>
>>> storing
>>>
>>> inside
>>>
>>> Zookeeper/Configmap might be helpful here.
>>>
>>> And I agree with Martijn that a lot of
>>>
>>> organizations
>>>
>>> don't
>>>
>>> want
>>>
>>> to
>>>
>>> expose
>>>
>>> such API for a cluster user group. But I think
>>>
>>> it's
>>>
>>> necessary
>>>
>>> to
>>>
>>> have the
>>>
>>> mechanism for unblocking the nodes/TMs anyway
>>>
>>> for
>>>
>>> avoiding
>>>
>>> incorrect
>>>
>>> automatic behaviour.
>>>
>>> And another one small suggestion - I think it
>>>
>>> would
>>>
>>> be
>>>
>>> better
>>>
>>> to
>>>
>>> extend
>>>
>>> the
>>>
>>> *BlocklistedItem* class with the
>>>
>>> *endTimestamp*
>>>
>>> field
>>>
>>> and
>>>
>>> fill
>>>
>>> it
>>>
>>> at the
>>>
>>> item creation. This simple addition will allow
>>>
>>> to:
>>>
>>>     -
>>>
>>>     Provide the ability to users to setup the
>>>
>>> exact
>>>
>>> time
>>>
>>> of
>>>
>>> blocklist end
>>>
>>>     through RestAPI
>>>     -
>>>
>>>     Not being tied to a single value of
>>>     *cluster.resource-blacklist.item.timeout*
>>>
>>>
>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>>>
>>> <
>>>
>>> chesnay@apache.org>
>>>
>>> wrote:
>>>
>>> I do share the concern between blurring the
>>>
>>> lines
>>>
>>> a
>>>
>>> bit.
>>>
>>> That said, I'd prefer to not have any
>>>
>>> auto-detection
>>>
>>> and
>>>
>>> only
>>>
>>> have an
>>>
>>> opt-in mechanism
>>> to manually block processes/nodes. To me
>>>
>>> this
>>>
>>> sounds
>>>
>>> yet
>>>
>>> again
>>>
>>> like one
>>>
>>> of those
>>> magical mechanisms that will rarely work
>>>
>>> just
>>>
>>> right.
>>>
>>> An external system can leverage way more
>>>
>>> information
>>>
>>> after
>>>
>>> all.
>>>
>>> Moreover, I'm quite concerned about the
>>>
>>> complexity
>>>
>>> of
>>>
>>> this
>>>
>>> proposal.
>>>
>>> Tracking on both the RM/JM side; syncing
>>>
>>> between
>>>
>>> components;
>>>
>>> adjustments
>>>
>>> to the
>>> slot and resource protocol.
>>>
>>> In a way it seems overly complicated.
>>>
>>> If we look at it purely from an active
>>>
>>> resource
>>>
>>> management
>>>
>>> perspective,
>>>
>>> then there
>>> isn't really a need to touch the slot
>>>
>>> protocol
>>>
>>> at
>>>
>>> all
>>>
>>> (or
>>>
>>> in
>>>
>>> fact
>>>
>>> to
>>>
>>> anything in the JobMaster),
>>> because there isn't any point in keeping
>>>
>>> around
>>>
>>> blocked
>>>
>>> TMs
>>>
>>> in
>>>
>>> the
>>>
>>> first
>>>
>>> place.
>>> They'd just be idling, potentially shutting
>>>
>>> down
>>>
>>> after
>>>
>>> a
>>>
>>> while
>>>
>>> by
>>>
>>> the
>>>
>>> RM
>>>
>>> because of
>>> it (unless we _also_ touch that logic).
>>> Here the blocking of a process (be it by
>>>
>>> blocking
>>>
>>> the
>>>
>>> process
>>>
>>> or
>>>
>>> node)
>>>
>>> is
>>>
>>> equivalent with shutting down the blocked
>>>
>>> process(es).
>>>
>>> Once the block is lifted we can just spin it
>>>
>>> back
>>>
>>> up.
>>>
>>> And I do wonder whether we couldn't apply
>>>
>>> the
>>>
>>> same
>>>
>>> line
>>>
>>> of
>>>
>>> thinking to
>>>
>>> standalone resource management.
>>> Here being able to stop/restart a
>>>
>>> process/node
>>>
>>> manually
>>>
>>> should
>>>
>>> be
>>>
>>> a
>>>
>>> core
>>>
>>> requirement for a Flink deployment anyway.
>>>
>>>
>>> On 02/05/2022 08:49, Martijn Visser wrote:
>>>
>>> Hi everyone,
>>>
>>> Thanks for creating this FLIP. I can
>>>
>>> understand
>>>
>>> the
>>>
>>> problem
>>>
>>> and
>>>
>>> I see
>>>
>>> value
>>>
>>> in the automatic detection and
>>>
>>> blocklisting. I
>>>
>>> do
>>>
>>> have
>>>
>>> some
>>>
>>> concerns
>>>
>>> with
>>>
>>> the ability to manually specify to be
>>>
>>> blocked
>>>
>>> resources.
>>>
>>> I
>>>
>>> have
>>>
>>> two
>>>
>>> concerns;
>>>
>>> * Most organizations explicitly have a
>>>
>>> separation
>>>
>>> of
>>>
>>> concerns,
>>>
>>> meaning
>>>
>>> that
>>>
>>> there's a group who's responsible for
>>>
>>> managing a
>>>
>>> cluster
>>>
>>> and
>>>
>>> there's
>>>
>>> a
>>>
>>> user
>>>
>>> group who uses that cluster. With the
>>>
>>> introduction of
>>>
>>> this
>>>
>>> mechanism,
>>>
>>> the
>>>
>>> latter group now can influence the
>>>
>>> responsibility
>>>
>>> of
>>>
>>> the
>>>
>>> first
>>>
>>> group.
>>>
>>> So
>>>
>>> it
>>>
>>> can be possible that someone from the user
>>>
>>> group
>>>
>>> blocks
>>>
>>> something,
>>>
>>> which
>>>
>>> causes an outage (which could result in
>>>
>>> paging
>>>
>>> mechanism
>>>
>>> triggering
>>>
>>> etc)
>>>
>>> which impacts the first group.
>>> * How big is the group of people who can
>>>
>>> go
>>>
>>> through
>>>
>>> the
>>>
>>> process
>>>
>>> of
>>>
>>> manually
>>>
>>> identifying a node that isn't behaving as
>>>
>>> it
>>>
>>> should
>>>
>>> be? I
>>>
>>> do
>>>
>>> think
>>>
>>> this
>>>
>>> group is relatively limited. Does it then
>>>
>>> make
>>>
>>> sense
>>>
>>> to
>>>
>>> introduce
>>>
>>> such
>>>
>>> a
>>>
>>> feature, which would only be used by a
>>>
>>> really
>>>
>>> small
>>>
>>> user
>>>
>>> group
>>>
>>> of
>>>
>>> Flink?
>>>
>>> We
>>>
>>> still have to maintain, test and support
>>>
>>> such
>>>
>>> a
>>>
>>> feature.
>>>
>>> I'm +1 for the autodetection features, but
>>>
>>> I'm
>>>
>>> leaning
>>>
>>> towards
>>>
>>> not
>>>
>>> exposing
>>>
>>> this to the user group but having this
>>>
>>> available
>>>
>>> strictly
>>>
>>> for
>>>
>>> cluster
>>>
>>> operators. They could then also set up
>>>
>>> their
>>>
>>> paging/metrics/logging
>>>
>>> system
>>>
>>> to take this into account.
>>>
>>> Best regards,
>>>
>>> Martijn Visser
>>> https://twitter.com/MartijnVisser82
>>> https://github.com/MartijnVisser
>>>
>>>
>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>>>
>>> karmagyz@gmail.com
>>>
>>> wrote:
>>>
>>> Thanks for driving this, Zhu and Lijie.
>>>
>>> +1 for the overall proposal. Just share
>>>
>>> some
>>>
>>> cents
>>>
>>> here:
>>>
>>> - Why do we need to expose
>>>
>>> cluster.resource-blacklist.item.timeout-check-interval
>>>
>>> to
>>>
>>> the
>>>
>>> user?
>>>
>>> I think the semantics of
>>>
>>> `cluster.resource-blacklist.item.timeout`
>>>
>>> is
>>>
>>> sufficient for the user. How to guarantee
>>>
>>> the
>>>
>>> timeout
>>>
>>> mechanism is
>>>
>>> Flink's internal implementation. I think
>>>
>>> it
>>>
>>> will
>>>
>>> be
>>>
>>> very
>>>
>>> confusing
>>>
>>> and
>>>
>>> we do not need to expose it to users.
>>>
>>> - ResourceManager can notify the
>>>
>>> exception
>>>
>>> of a
>>>
>>> task
>>>
>>> manager to
>>>
>>> `BlacklistHandler` as well.
>>> For example, the slot allocation might
>>>
>>> fail
>>>
>>> in
>>>
>>> case
>>>
>>> the
>>>
>>> target
>>>
>>> task
>>>
>>> manager is busy or has a network jitter.
>>>
>>> I
>>>
>>> don't
>>>
>>> mean
>>>
>>> we
>>>
>>> need
>>>
>>> to
>>>
>>> cover
>>>
>>> this case in this version, but we can
>>>
>>> also
>>>
>>> open a
>>>
>>> `notifyException`
>>>
>>> in
>>>
>>> `ResourceManagerBlacklistHandler`.
>>>
>>> - Before we sync the blocklist to
>>>
>>> ResourceManager,
>>>
>>> will
>>>
>>> the
>>>
>>> slot of
>>>
>>> a
>>>
>>> blocked task manager continues to be
>>>
>>> released
>>>
>>> and
>>>
>>> allocated?
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>>>
>>> Wang
>>>
>>> <
>>>
>>> wangdachui9501@gmail.com>
>>>
>>> wrote:
>>>
>>> Hi Konstantin,
>>>
>>> Thanks for your feedback. I will
>>>
>>> response
>>>
>>> your 4
>>>
>>> remarks:
>>>
>>> 1) Thanks for reminding me of the
>>>
>>> controversy. I
>>>
>>> think
>>>
>>> “BlockList”
>>>
>>> is
>>>
>>> good
>>>
>>> enough, and I will change it in FLIP.
>>>
>>>
>>> 2) Your suggestion for the REST API is a
>>>
>>> good
>>>
>>> idea.
>>>
>>> Based
>>>
>>> on
>>>
>>> the
>>>
>>> above, I
>>>
>>> would change REST API as following:
>>>
>>> POST/GET <host>/blocklist/nodes
>>>
>>> POST/GET <host>/blocklist/taskmanagers
>>>
>>> DELETE
>>>
>>> <host>/blocklist/node/<identifier>
>>>
>>> DELETE
>>>
>>> <host>/blocklist/taskmanager/<identifier>
>>>
>>> 3) If a node is blocking/blocklisted, it
>>>
>>> means
>>>
>>> that
>>>
>>> all
>>>
>>> task
>>>
>>> managers
>>>
>>> on
>>>
>>> this node are blocklisted. All slots on
>>>
>>> these
>>>
>>> TMs
>>>
>>> are
>>>
>>> not
>>>
>>> available.
>>>
>>> This
>>>
>>> is actually a bit like TM losts, but
>>>
>>> these
>>>
>>> TMs
>>>
>>> are
>>>
>>> not
>>>
>>> really
>>>
>>> lost,
>>>
>>> they
>>>
>>> are in an unavailable status, and they
>>>
>>> are
>>>
>>> still
>>>
>>> registered
>>>
>>> in this
>>>
>>> flink
>>>
>>> cluster. They will be available again
>>>
>>> once
>>>
>>> the
>>>
>>> corresponding
>>>
>>> blocklist
>>>
>>> item
>>>
>>> is removed. This behavior is the same in
>>>
>>> active/non-active
>>>
>>> clusters.
>>>
>>> However in the active clusters, these
>>>
>>> TMs
>>>
>>> may
>>>
>>> be
>>>
>>> released
>>>
>>> due
>>>
>>> to
>>>
>>> idle
>>>
>>> timeouts.
>>>
>>>
>>> 4) For the item timeout, I prefer to
>>>
>>> keep
>>>
>>> it.
>>>
>>> The
>>>
>>> reasons
>>>
>>> are
>>>
>>> as
>>>
>>> following:
>>>
>>> a) The timeout will not affect users
>>>
>>> adding
>>>
>>> or
>>>
>>> removing
>>>
>>> items
>>>
>>> via
>>>
>>> REST
>>>
>>> API,
>>>
>>> and users can disable it by configuring
>>>
>>> it
>>>
>>> to
>>>
>>> Long.MAX_VALUE .
>>>
>>> b) Some node problems can recover after
>>>
>>> a
>>>
>>> period of
>>>
>>> time
>>>
>>> (such as
>>>
>>> machine
>>>
>>> hotspots), in which case users may
>>>
>>> prefer
>>>
>>> that
>>>
>>> Flink
>>>
>>> can
>>>
>>> do
>>>
>>> this
>>>
>>> automatically instead of requiring the
>>>
>>> user
>>>
>>> to
>>>
>>> do
>>>
>>> it
>>>
>>> manually.
>>>
>>> Best,
>>>
>>> Lijie
>>>
>>> Konstantin Knauf <kn...@apache.org>
>>>
>>> 于2022年4月27日周三
>>>
>>> 19:23写道：
>>>
>>> Hi Lijie,
>>>
>>> I think, this makes sense and +1 to
>>>
>>> only
>>>
>>> support
>>>
>>> manually
>>>
>>> blocking
>>>
>>> taskmanagers and nodes. Maybe the
>>>
>>> different
>>>
>>> strategies
>>>
>>> can
>>>
>>> also be
>>>
>>> maintained outside of Apache Flink.
>>>
>>> A few remarks:
>>>
>>> 1) Can we use another term than
>>>
>>> "bla.cklist"
>>>
>>> due
>>>
>>> to
>>>
>>> the
>>>
>>> controversy
>>>
>>> around
>>>
>>> the term? [1] There was also a Jira
>>>
>>> Ticket
>>>
>>> about
>>>
>>> this
>>>
>>> topic a
>>>
>>> while
>>>
>>> back
>>>
>>> and there was generally a consensus to
>>>
>>> avoid
>>>
>>> the
>>>
>>> term
>>>
>>> blacklist &
>>>
>>> whitelist
>>>
>>> [2]? We could use "blocklist"
>>>
>>> "denylist"
>>>
>>> or
>>>
>>> "quarantined"
>>>
>>> 2) For the REST API, I'd prefer a
>>>
>>> slightly
>>>
>>> different
>>>
>>> design
>>>
>>> as
>>>
>>> verbs
>>>
>>> like
>>>
>>> add/remove often considered an
>>>
>>> anti-pattern
>>>
>>> for
>>>
>>> REST
>>>
>>> APIs.
>>>
>>> POST
>>>
>>> on a
>>>
>>> list
>>>
>>> item is generally the standard to add
>>>
>>> items.
>>>
>>> DELETE
>>>
>>> on
>>>
>>> the
>>>
>>> individual
>>>
>>> resource is standard to remove an item.
>>>
>>> POST <host>/quarantine/items
>>> DELETE
>>>
>>> <host>/quarantine/items/<itemidentifier>
>>>
>>> We could also consider to separate
>>>
>>> taskmanagers
>>>
>>> and
>>>
>>> nodes
>>>
>>> in
>>>
>>> the
>>>
>>> REST
>>>
>>> API
>>>
>>> (and internal data structures). Any
>>>
>>> opinion
>>>
>>> on
>>>
>>> this?
>>>
>>> POST/GET <host>/quarantine/nodes
>>> POST/GET <host>/quarantine/taskmanager
>>> DELETE
>>>
>>> <host>/quarantine/nodes/<identifier>
>>>
>>> DELETE
>>>
>>> <host>/quarantine/taskmanager/<identifier>
>>>
>>> 3) How would blocking nodes behave with
>>>
>>> non-active
>>>
>>> resource
>>>
>>> managers,
>>>
>>> i.e.
>>>
>>> standalone or reactive mode?
>>>
>>> 4) To keep the implementation even more
>>>
>>> minimal,
>>>
>>> do
>>>
>>> we
>>>
>>> need
>>>
>>> the
>>>
>>> timeout
>>>
>>> behavior? If items are added/removed
>>>
>>> manually
>>>
>>> we
>>>
>>> could
>>>
>>> delegate
>>>
>>> this
>>>
>>> to the
>>>
>>> user easily. In my opinion the timeout
>>>
>>> behavior
>>>
>>> would
>>>
>>> better
>>>
>>> fit
>>>
>>> into
>>>
>>> specific strategies at a later point.
>>>
>>> Looking forward to your thoughts.
>>>
>>> Cheers and thank you,
>>>
>>> Konstantin
>>>
>>> [1]
>>>
>>>
>>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>>>
>>> [2]
>>>
>>> https://issues.apache.org/jira/browse/FLINK-18209
>>>
>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>>>
>>> schrieb
>>>
>>> Lijie
>>>
>>> Wang
>>>
>>> <
>>>
>>> wangdachui9501@gmail.com>:
>>>
>>> Hi all,
>>>
>>> Flink job failures may happen due to
>>>
>>> cluster
>>>
>>> node
>>>
>>> issues
>>>
>>> (insufficient
>>>
>>> disk
>>>
>>> space, bad hardware, network
>>>
>>> abnormalities).
>>>
>>> Flink
>>>
>>> will
>>>
>>> take care
>>>
>>> of
>>>
>>> the
>>>
>>> failures and redeploy the tasks.
>>>
>>> However,
>>>
>>> due
>>>
>>> to
>>>
>>> data
>>>
>>> locality
>>>
>>> and
>>>
>>> limited
>>>
>>> resources, the new tasks are very
>>>
>>> likely
>>>
>>> to
>>>
>>> be
>>>
>>> redeployed
>>>
>>> to the
>>>
>>> same
>>>
>>> nodes, which will result in continuous
>>>
>>> task
>>>
>>> abnormalities
>>>
>>> and
>>>
>>> affect
>>>
>>> job
>>>
>>> progress.
>>>
>>> Currently, Flink users need to
>>>
>>> manually
>>>
>>> identify
>>>
>>> the
>>>
>>> problematic
>>>
>>> node and
>>>
>>> take it offline to solve this problem.
>>>
>>> But
>>>
>>> this
>>>
>>> approach
>>>
>>> has
>>>
>>> following
>>>
>>> disadvantages:
>>>
>>> 1. Taking a node offline can be a
>>>
>>> heavy
>>>
>>> process.
>>>
>>> Users
>>>
>>> may
>>>
>>> need
>>>
>>> to
>>>
>>> contact
>>>
>>> cluster administors to do this. The
>>>
>>> operation
>>>
>>> can
>>>
>>> even
>>>
>>> be
>>>
>>> dangerous
>>>
>>> and
>>>
>>> not
>>>
>>> allowed during some important business
>>>
>>> events.
>>>
>>> 2. Identifying and solving this kind
>>>
>>> of
>>>
>>> problems
>>>
>>> manually
>>>
>>> would
>>>
>>> be
>>>
>>> slow
>>>
>>> and
>>>
>>> a waste of human resources.
>>>
>>> To solve this problem, Zhu Zhu and I
>>>
>>> propose
>>>
>>> to
>>>
>>> introduce a
>>>
>>> blacklist
>>>
>>> mechanism for Flink to filter out
>>>
>>> problematic
>>>
>>> resources.
>>>
>>> You can find more details in
>>>
>>> FLIP-224[1].
>>>
>>> Looking
>>>
>>> forward
>>>
>>> to your
>>>
>>> feedback.
>>>
>>> [1]
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>>>
>>> Best,
>>>
>>> Lijie
>>>
>>> --
>>> Best regards,
>>> Roman Boyko
>>> e.: ro.v.boyko@gmail.com
>>>
>>> --
>>> https://twitter.com/snntrable
>>> https://github.com/knaufk
>>>
>>> --
>>> https://twitter.com/snntrable
>>> https://github.com/knaufk
>>>
>>>
>>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.

Hi everyone,
Thank you all for the feedback!

We receive concerns that the blocklist feature is not strongly
required except for the needs of speculative execution. So as the
first step, we will limit the scope of FLIP-224 to only support
speculative execution, and therefore, not add public interfaces for
users to manipulate with the blocklist directly.

If later we receive strong requirements for blocklist from users, we
will have another FLIP to open it to users with well designed public
interfaces and web UI.

Thanks,
Zhu

Zhu Zhu <re...@gmail.com> 于2022年6月10日周五 18:45写道：
>
> 1) With the declarative slot allocation protocol, it's not easy to
> keep the slow slots which may have satisfy the slot request already,
> but ask for more slots from the resource manager.
> And we are also concerned to use detected slow slots, which may add
> load to the slow nodes and further slow down the tasks on the slow
> nodes. Because we found slow nodes are mostly caused by heavy loads.
>
> 2) I see your point. The blocker here is that batch jobs currently
> uses DefaultScheduler. The scheduler does not see slots directly and
> does not know which of them are from slow nodes, so it's hard for it
> to rescale the vertex according to this information. Besides that, it
> cannot help with the stage that first observed and affected the slow
> nodes.
>
> Thanks,
> Zhu
>
> Chesnay Schepler <ch...@apache.org> 于2022年6月10日周五 17:04写道：
> >
> > 1)
> > It's true that if we handle this entirely in the scheduler we may get a bunch of slow slots from the RM.
> > My point is that this isn't necessarily a problem; it depends on how you handle those slots.
> >
> > We anyway need to account for the possibility that we're not getting any new fast slots from the RM.
> > With that in mind, I'd rather not categorically throw away the slow slots we got, but try to make use of them somehow.
> >
> > 2)
> > The idea was to rescale the job when the current stage finishes.
> > The speculative execution handles slow nodes being detected in the current stage,
> > next stage we use _some_ strategy to handle slow nodes (be it ignoring those, rescaling the job, ...).
> >
> > 3)
> > It's a fair observation that once you push this to the RM you end up with a de-facto blocklist :)
> >
> > On 08/06/2022 17:11, Zhu Zhu wrote:
> >
> > Thanks Chesnay for the feedback in the vote thread.
> > (https://lists.apache.org/thread/opc7jg3rpxnwotkb0fcn4wnm02m4397o)
> >
> > I'd like to continue the discussion in this thread so that the
> > discussions can be better tracked.
> >
> > Regarding your questions, here are my thoughts:
> > 1. The current locality mechanism does not work well to avoid
> > deploying tasks to slow nodes because it cannot proactively reject or
> > release slots from the slow nodes. And it cannot help at the resource
> > manager side to avoid allocating free slots or launching new
> > TaskManagers on slow nodes.
> > 2. Dynamically downscaling or upscaling a batch job is usually
> > unacceptable because it means to re-run the whole stage.
> > 3. Extending the requirement declaration to have a notion of
> > "undesirable nodes" is an option. And it is actually how we started.
> > Considering implementation details, we found we need that
> > - a tracker to collect all the undesirable(detected slow nodes)
> > - to filter out slots from undesirable nodes when allocating slots
> > from the SlotPool
> > - ask the ResourceManager for slots that are not on the undesirable
> > nodes. The ResourceManager may further need to ask for new
> > TaskManagers that are not on the undesirable nodes.
> > Then, with all these functionality, we found that we almost have a
> > blocklist mechanism. As blocklist mechanism is a common concept and is
> > possible to benefit users, we took this chance to propose the
> > blocklist mechanism.
> > 4. A cluster-wide shared blocklist is not a must for speculative
> > execution at the moment. It is mainly part of a standalone blocklist
> > feature to host user specified block items. To avoid sharing job
> > specific blocked items between jobs, one way is to add a nullable
> > JobID tag for the blocked item.
> >
> > Thanks,
> > Zhu
> >
> > Zhu Zhu <re...@gmail.com> 于2022年6月7日周二 10:58写道：
> >
> > Hi Chesnay,
> >
> > Would you please take a look at the FLIP and discussion to see if all
> > your concerns have been addressed?
> >
> > Thanks,
> > Zhu
> >
> > Zhu Zhu <re...@gmail.com> 于2022年5月28日周六 13:26写道：
> >
> > Regarding the concern of the SlotManager, my two cents:
> > 1. it is necessary for the SlotManager to host blocked slots, in 2 cases:
> >   a. In standalone mode, a taskmanager may be temporarily added to
> > the blocklist. We do not want the TM to get disconnected and shut down.
> > So we need to keep its connection to RM and keep hosting its slots.
> >   b. When we want to avoid allocating slots to a slow nodes but do not
> > want to kill current running tasks on the nodes (MARK_BLOCKED mode).
> >
> > There is possible a way to keep the connection of a blocked task manager
> > while hide its slots from SlotManager, but I feel it may be even much more
> > complicated.
> >
> > 2. It will not complicate the SlotManager too much. The SlotManager will
> > be offered a BlockedTaskManagerChecker when created, and just need
> > to use it to filter out blocked slots on slot request. Therefore I think the
> > complication is acceptable.
> >
> > Thanks,
> > Zhu
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月25日周三 15:26写道：
> >
> > Hi everyone,
> >
> > I've updated the FLIP according to Chesnay's feedback, changes as follows:
> > 1. Change the GET result to a map.
> > 2. Only left *endTimestamp* in ADD operation, and change the rest (from
> > POST) to PUT
> > 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
> > encapsulate blocklist related functions.
> > 4. Remove *mainThread* from BlocklistTracker, instead provide a
> > *removeTimeoutItems* method to be called by outside components。
> >
> > Best,
> > Lijie
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：
> >
> > Hi Chesnay,
> >
> > Thanks for feedback.
> >
> > 1. Regarding the TM/Node id. Do you mean special characters may appear in
> > the rest URL?  Actually, I don't think so. The task manager id in REST API
> > should be the *ResourceID* of taskmanager in Flink, there should be no
> > special characters, and some existing REST APIs are already using it, e.g.
> > GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
> > id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
> > should also have no special characters.
> > 2. Regarding the GET query responses. I agree with you, it makes sense to
> > change the GET result to a map.
> >
> > 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
> > cover everything, and the endTimestamp is a unix timestamp, there should be
> > no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
> > add rest api is add or update, PUT can cover this semantics.
> >
> > 4. Regarding the slot pool/manager. I don't think the current slotpool
> > and slotmanager are able to support the MARK_BLOCKED(slots that are
> > already allocated will not be affected) action. The reasons are as
> > follows：
> >
> > a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
> > from reserved(task assigned) to free(no task assigned), it is necessary to
> > check whether the slot should be released immediately(it should be released
> > immediately if the task manager is blocked, otherwise it may be allocated
> > to other tasks). I think it cannot be supported without being aware of
> > the blocklist information. Compared to the solution in FLIP, a more
> > appropriate/prefered way may be: Introduce a new slot pool
> > implementation for blocklist(may be named BlocklistSlotPool, it
> > extends/wrapps the original slot pool), and implement the parts that need
> > to be aware of the blocklist in this newly introduced slot pool, and the
> > original slot pool basically does not need to change.
> >
> > b) for slot manager, with the MARK_BLOCKED action, there may be free but
> > blocked slots in slot manager (the corresponding TMs cannot be
> > released/unregistered because there are still running tasks on them).
> > Therefore, we need to filter out the blocked slots when trying to fulfill
> > the slot requirements. Therefore it also needs to know the blocklist information.
> > A better way may be to abstract a resource allocation strategy, and make
> > the blocklist as a special implementation, then pass the resource
> > allocation strategy in when constructing the slot manager. Unfortunately,
> > the data structures in the two existing slot manager
> > implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
> > quite different, it is not easy to abstract a common resource allocation
> > strategy, so we prefer to keep the current way(i.e. pass the blocklist
> > information directly into slot manager).
> >
> >
> > 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
> > does not need to be aware of the executor, and the timeout actions can be
> > done outside.
> >
> > Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
> >
> > I have a number of concerns:
> >
> > Is the id used for deleting an item the same sent in the initial request
> > (and not one returned by Flink)?
> > I'm very concerned that the tm/node id can contain special characters.
> >
> > The GET query should return a map, not a list of items. This makes it
> > easier to work with.
> >
> > The duality of endTimestamp and duration is also concerning.
> > If we conclude that endTimestamps can in fact work (and aren't utterly
> > unusable due to timezones),
> > then this should be able to cover everything and rid us of some
> > complexity w.r.t. POSTs to the same ID.
> > Additions would be a PUT, changes a PATCH, deletes a DELETE.
> >
> >
> > I also dislike how we're pushing more functionality into the
> > slotpool/-manager.
> > These components are complex enough as-is, and instead I'd propose a
> > separate component that interacts with the SlotPool/-Manager instead,
> > for example by removing the slots from that TM.
> > The reason being that from the slot-pool perspective it is irrelevant
> > whether a slot is gone because the TM was lost, or because it was blocked.
> >
> >
> > The BlocklistTracker shouldn't be modeled as component that is aware of
> > the concept of main threads.
> > It really has no business knowing that; all it needs is an executor for
> > handling timeouts/periodic actions,
> > and a way to interact with the JM/RM (which internally can handle the
> > scheduling into the main thread).
> >
> >
> > On 20/05/2022 07:20, Lijie Wang wrote:
> >
> > Hi everyone,
> >
> > I have started a vote for this FLIP [1]. Please cast your vote there or
> >
> > ask
> >
> > additional questions here. [1]
> > https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
> >
> > Best, Lijie
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
> >
> > Hi Konstantin,
> >
> > We found that Flink REST URL does not support the format ":merge" ,
> >
> > which
> >
> > will be recognized as a parameter in the URL(due to start with a
> >
> > colon).
> >
> > We will keep the previous way, i.e.
> >
> > POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> > and the "id" and "merge" flag are put into the request body.
> >
> > Best,
> > Lijie
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
> >
> > Hi Weihua,
> > thanks for feedback.
> >
> > 1. Yes, only *Manually* is supported in this FLIP, but it's the first
> > step towards auto-detection.
> > 2. We wii print the blocked nodes in logs. Maybe also put it into the
> > exception of insufficient resources.
> > 3. No. This FLIP won't change the WebUI. The blocklist information
> >
> > can be
> >
> > obtained through REST API and metrics.
> >
> > Best,
> > Lijie
> >
> > Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
> >
> > Hi,
> > Thanks for creating this FLIP.
> > We have implemented an automatic blocklist detection mechanism
> > internally, which is indeed very effective for handling node
> >
> > failures.
> >
> > Due to the large number of nodes, although SREs already support
> > automatic offline failure nodes, the detection is not 100% accurate
> >
> > and
> >
> > there is some delay.
> > So the blocklist mechanism can make flink job recover from failure
> >
> > much
> >
> > faster.
> >
> > Here are some of my thoughts:
> > 1. In this FLIP, it needs users to locate machine failure manually,
> > there is a certain cost of use
> > 2. What happens if too many nodes are blocked, resulting in
> >
> > insufficient
> >
> > resources? Will there be a special Exception for the user?
> > 3. Will we display the blocklist information in the WebUI? The
> >
> > blocklist
> >
> > is for cluster level, and if multiple users share a cluster, some
> >
> > users may
> >
> > be a little confused when resources are not enough, or when
> >
> > resources are
> >
> > applied for more.
> >
> > Also, Looking forward to the next FLIP on auto-detection.
> >
> > Best,
> > Weihua
> >
> > 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> >
> > Hi Konstantin,
> >
> > Maybe change it to the following:
> >
> > 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> > Merge is not allowed. If the {id} already exists, return error.
> >
> > Otherwise,
> >
> > create a new item.
> >
> > 2. POST: http://
> >
> > {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >
> > Merge is allowed. If the {id} already exists, merge. Otherwise,
> >
> > create
> >
> > a
> >
> > new item.
> >
> > WDYT?
> >
> > Best,
> > Lijie
> >
> > Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> >
> > Hi Lijie,
> >
> > hm, maybe the following is more appropriate in that case
> >
> > POST: http://
> >
> > {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >
> > Best,
> >
> > Konstantin
> >
> > Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> > wangdachui9501@gmail.com>:
> >
> > Hi Konstantin,
> > thanks for your feedback.
> >
> >  From what I understand, PUT should be idempotent. However, we
> >
> > have a
> >
> > *timeout* field in the request. This means that initiating the
> >
> > same
> >
> > request
> >
> > at two different times will lead to different resource status
> >
> > (timestamps
> >
> > of the items to be removed will be different).
> >
> > Should we use PUT in this case? WDYT?
> >
> > Best,
> > Lijie
> >
> > Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> >
> > Hi Lijie,
> >
> > wouldn't the REST API-idiomatic way for an update/replace be a
> >
> > PUT
> >
> > on
> >
> > the
> >
> > resource?
> >
> > PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >
> > Best,
> >
> > Konstantin
> >
> >
> >
> > Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> > wangdachui9501@gmail.com>:
> >
> > Hi everyone,
> >
> > I've had an offline discussion with Becket Qin and Zhu Zhu, and
> >
> > made
> >
> > the
> >
> > following changes on REST API:
> > 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
> >
> > choose
> >
> > one.
> >
> > If
> >
> > both are specified, will return error.
> > 2.  If the specified item is already there, the *ADD* operation
> >
> > has
> >
> > two
> >
> > behaviors:  *return error*(default value) or *merge/update*,
> >
> > and we
> >
> > add a
> >
> > flag to the request body to control it. You can find more
> >
> > details
> >
> > "Public
> >
> > Interface" section.
> >
> > If there is no more feedback, we will start the vote thread next
> >
> > week.
> >
> > Best,
> > Lijie
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> >
> > Hi Becket Qin,
> >
> > Thanks for your suggestions.  I have moved the description of
> > configurations, metrics and REST API into "Public Interface"
> >
> > section,
> >
> > and
> >
> > made a few updates according to your suggestion.  And in this
> >
> > FLIP,
> >
> > there
> >
> > no public java Interfaces or pluggables that users need to
> >
> > implement
> >
> > by
> >
> > themselves.
> >
> > Answers for you questions:
> > 1. Yes, there 2 block actions: MARK_BLOCKED and.
> > MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> >
> > items
> >
> > can
> >
> > only be added through the REST API, so these 2 action are
> >
> > mentioned
> >
> > in
> >
> > the
> >
> > REST API part (The REST API part has beed moved to public
> >
> > interface
> >
> > now).
> >
> > 2. I agree with you. I have changed the "Cause" field to
> >
> > String,
> >
> > and
> >
> > allow
> >
> > users to specify it via REST API.
> > 3. Yes, it is useful to allow different timeouts. As mentioned
> >
> > above,
> >
> > we
> >
> > will introduce 2 fields : *timeout* and *endTimestamp* into the
> >
> > ADD
> >
> > REST
> >
> > API to specify when to remove the blocked item. These 2 fields
> >
> > are
> >
> > optional, if neither is specified, it means that the blocked
> >
> > item
> >
> > is
> >
> > permanent and will not be removed. If both are specified, the
> >
> > minimum
> >
> > of
> >
> > *currentTimestamp+tiemout *and* endTimestamp* will be used as
> >
> > the
> >
> > time
> >
> > to
> >
> > remove the blocked item. To keep the configurations more
> >
> > minimal,
> >
> > we
> >
> > have
> >
> > removed the *cluster.resource-blocklist.item.timeout*
> >
> > configuration
> >
> > option.
> > 4. Yes, the block item will be overridden if the specified item
> >
> > already
> >
> > exists. The ADD operation is *ADD or UPDATE*.
> > 5. Yes. On JM/RM side, all the blocklist information is
> >
> > maintained
> >
> > in
> >
> > JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> >
> > abstracted
> >
> > to other interfaces) will be propagated to different
> >
> > components.
> >
> > Best,
> > Lijie
> >
> > Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> >
> > Hi Lijie,
> >
> > Thanks for updating the FLIP. It looks like the public
> >
> > interface
> >
> > section
> >
> > did not fully reflect all the user sensible behavior and API.
> >
> > Can
> >
> > you
> >
> > put
> >
> > everything that users may be aware of there? That would
> >
> > include
> >
> > the
> >
> > REST
> >
> > API, metrics, configurations, public java Interfaces or
> >
> > pluggables
> >
> > that
> >
> > users may see or implement by themselves, as well as a brief
> >
> > summary
> >
> > of
> >
> > the
> > behavior of the public API.
> >
> > Besides that, I have a few questions:
> >
> > 1. According to the conversation in the discussion thread, it
> >
> > looks
> >
> > like
> >
> > the BlockAction will have "MARK_BLOCKLISTED" and
> > "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
> >
> > so,
> >
> > can
> >
> > you
> >
> > add
> > that to the public interface as well?
> >
> > 2. At this point, the "Cause" field in the BlockingItem is a
> >
> > Throwable
> >
> > and
> >
> > is not reflected in the REST API. Should that be included in
> >
> > the
> >
> > query
> >
> > response? And should we change that field to be a String so
> >
> > users
> >
> > may
> >
> > specify the cause via the REST API when they block some nodes
> >
> > /
> >
> > TMs?
> >
> > 3. Would it be useful to allow users to have different
> >
> > timeouts
> >
> > for
> >
> > different blocked items? So while there is a default timeout,
> >
> > users
> >
> > can
> >
> > also override it via the REST API when they block an entity.
> >
> > 4. Regarding the ADD operation, if the specified item is
> >
> > already
> >
> > there,
> >
> > will the block item be overridden? For example, if the user
> >
> > wants
> >
> > to
> >
> > extend
> > the timeout of a blocked item, can they just  issue an ADD
> >
> > command
> >
> > again?
> >
> > 5. I am not quite familiar with the details of this, but is
> >
> > there
> >
> > a
> >
> > source
> >
> > of truth for the blocked list? I think it might be good to
> >
> > have a
> >
> > single
> >
> > source of truth for the blocked list and just propagate that
> >
> > list
> >
> > to
> >
> > different components to take the action of actually blocking
> >
> > the
> >
> > resource.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> >
> > wangdachui9501@gmail.com
> >
> > wrote:
> >
> > Hi everyone,
> >
> > Based on the discussion in the mailing list, I updated the
> >
> > FLIP
> >
> > doc,
> >
> > the
> >
> > changes include:
> > 1. Changed the description of the motivation section to more
> >
> > clearly
> >
> > describe the problem this FLIP is trying to solve.
> > 2. Only  *Manually* is supported.
> > 3. Adopted some suggestions, such as *endTimestamp*.
> >
> > Best,
> > Lijie
> >
> >
> > Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> >
> > Hi Lijie!
> >
> >
> >
> >
> > *a) “Probably storing inside Zookeeper/Configmap might be
> >
> > helpfulhere.”
> >
> > Can you explain it in detail? I don't fully understand that.
> >
> > In
> >
> > myopinion,
> >
> > non-active and active are the same, and no special treatment
> >
> > isrequired.*
> >
> > Sorry this was a misunderstanding from my side. I thought we
> >
> > were
> >
> > talking
> >
> > about the HA mode (but not about Active and Standalone
> >
> > ResourceManager).
> >
> > And the original question was - how to handle the
> >
> > blacklisted
> >
> > nodes
> >
> > list
> >
> > at
> >
> > the moment of leader change? Should we simply forget about
> >
> > them
> >
> > or
> >
> > try to
> >
> > pre-save that list on the remote storage?
> >
> > On Sat, 7 May 2022 at 10:51, Yang Wang <
> >
> > danrtsey.wy@gmail.com
> >
> > wrote:
> >
> > Thanks Lijie and ZhuZhu for the explanation.
> >
> > I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> >
> > it
> >
> > is
> >
> > indeed
> >
> > some functionalities the external tools(e.g. kubectl taint)
> >
> > could
> >
> > not
> >
> > support.
> >
> >
> > Best,
> > Yang
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
> >
> > 22:18写道：
> >
> > Thanks for your feedback, Jiangang and Martijn.
> >
> > @Jiangang
> >
> >
> > For auto-detecting, I wonder how to make the strategy
> >
> > and
> >
> > mark a
> >
> > node
> >
> > blocked?
> >
> > In fact, we currently plan to not support auto-detection
> >
> > in
> >
> > this
> >
> > FLIP.
> >
> > The
> >
> > part about auto-detection may be continued in a separate
> >
> > FLIP
> >
> > in
> >
> > the
> >
> > future. Some guys have the same concerns as you, and the
> >
> > correctness
> >
> > and
> >
> > necessity of auto-detection may require further discussion
> >
> > in
> >
> > the
> >
> > future.
> >
> > In session mode, multi jobs can fail on the same bad
> >
> > node
> >
> > and
> >
> > the
> >
> > node
> >
> > should be marked blocked.
> > By design, the blocklist information will be shared among
> >
> > all
> >
> > jobs
> >
> > in a
> >
> > cluster/session. The JM will sync blocklist information
> >
> > with
> >
> > RM.
> >
> > @Martijn
> >
> > I agree with Yang Wang on this.
> >
> > As Zhu Zhu and I mentioned above, we think the
> >
> > MARK_BLOCKLISTED(Just
> >
> > limits
> >
> > the load of the node and does not  kill all the processes
> >
> > on
> >
> > it)
> >
> > is
> >
> > also
> >
> > important, and we think that external systems (*yarn
> >
> > rmadmin
> >
> > or
> >
> > kubectl
> >
> > taint*) cannot support it. So we think it makes sense even
> >
> > only
> >
> > *manually*.
> >
> > I also agree with Chesnay that magical mechanisms are
> >
> > indeed
> >
> > super
> >
> > hard
> >
> > to get right.
> > Yes, as you see, Jiangang(and a few others) have the same
> >
> > concern.
> >
> > However, we currently plan to not support auto-detection
> >
> > in
> >
> > this
> >
> > FLIP,
> >
> > and
> >
> > only *manually*. In addition, I'd like to say that the
> >
> > FLIP
> >
> > provides
> >
> > a
> >
> > mechanism to support MARK_BLOCKLISTED and
> > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > the auto-detection may be done by external systems.
> >
> > Best,
> > Lijie
> >
> > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> >
> > 19:04写道：
> >
> > If we only support to block nodes manually, then I
> >
> > could
> >
> > not
> >
> > see
> >
> > the obvious advantages compared with current SRE's
> >
> > approach(via
> >
> > *yarn
> >
> > rmadmin or kubectl taint*).
> >
> > I agree with Yang Wang on this.
> >
> > To me this sounds yet again like one of those magical
> >
> > mechanisms
> >
> > that
> >
> > will rarely work just right.
> >
> > I also agree with Chesnay that magical mechanisms are
> >
> > indeed
> >
> > super
> >
> > hard
> >
> > to
> >
> > get right.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> >
> > liujiangangpeng@gmail.com
> >
> > wrote:
> >
> > Thanks for the valuable design. The auto-detecting can
> >
> > decrease
> >
> > great
> >
> > work
> >
> > for us. We have implemented the similar feature in our
> >
> > inner
> >
> > flink
> >
> > version.
> > Below is something that I care about:
> >
> >    1. For auto-detecting, I wonder how to make the
> >
> > strategy
> >
> > and
> >
> > mark a
> >
> > node
> >    blocked? Sometimes the blocked node is hard to be
> >
> > detected,
> >
> > for
> >
> > example,
> >    the upper node or the down node will be blocked when
> >
> > network
> >
> > unreachable.
> >    2. I see that the strategy is made in JobMaster
> >
> > side.
> >
> > How
> >
> > about
> >
> >    implementing the similar logic in resource manager?
> >
> > In
> >
> > session
> >
> > mode,
> >
> > multi
> >    jobs can fail on the same bad node and the node
> >
> > should
> >
> > be
> >
> > marked
> >
> > blocked.
> >    If the job makes the strategy, the node may be not
> >
> > marked
> >
> > blocked
> >
> > if
> >
> > the
> >    fail times don't exceed the threshold.
> >
> >
> > Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> >
> > Thank you for all your feedback!
> >
> > Besides the answers from Lijie, I'd like to share
> >
> > some
> >
> > of
> >
> > my
> >
> > thoughts:
> >
> > 1. Whether to enable automatical blocklist
> > Generally speaking, it is not a goal of FLIP-224.
> > The automatical way should be something built upon
> >
> > the
> >
> > blocklist
> >
> > mechanism and well decoupled. It was designed to be a
> >
> > configurable
> >
> > blocklist strategy, but I think we can further
> >
> > decouple
> >
> > it
> >
> > by
> >
> > introducing a abnormal node detector, as Becket
> >
> > suggested,
> >
> > which
> >
> > just
> >
> > uses the blocklist mechanism once bad nodes are
> >
> > detected.
> >
> > However,
> >
> > it
> >
> > should be a separate FLIP with further dev
> >
> > discussions
> >
> > and
> >
> > feedback
> >
> > from users. I also agree with Becket that different
> >
> > users
> >
> > have
> >
> > different
> >
> > requirements, and we should listen to them.
> >
> > 2. Is it enough to just take away abnormal nodes
> >
> > externally
> >
> > My answer is no. As Lijie has mentioned, we need a
> >
> > way
> >
> > to
> >
> > avoid
> >
> > deploying tasks to temporary hot nodes. In this case,
> >
> > users
> >
> > may
> >
> > just
> >
> > want to limit the load of the node and do not want to
> >
> > kill
> >
> > all
> >
> > the
> >
> > processes on it. Another case is the speculative
> >
> > execution[1]
> >
> > which
> >
> > may also leverage this feature to avoid starting
> >
> > mirror
> >
> > tasks on
> >
> > slow
> >
> > nodes.
> >
> > Thanks,
> > Zhu
> >
> > [1]
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> >
> > 15:56写道：
> >
> > Hi everyone,
> >
> >
> > Thanks for your feedback.
> >
> >
> > There's one detail that I'd like to re-emphasize
> >
> > here
> >
> > because
> >
> > it
> >
> > can
> >
> > affect the value and design of the blocklist
> >
> > mechanism
> >
> > (perhaps
> >
> > I
> >
> > should
> >
> > highlight it in the FLIP). We propose two actions in
> >
> > FLIP:
> >
> > 1) MARK_BLOCKLISTED: Just mark the task manager or
> >
> > node
> >
> > as
> >
> > blocked.
> >
> > Future slots should not be allocated from the blocked
> >
> > task
> >
> > manager
> >
> > or
> >
> > node.
> >
> > But slots that are already allocated will not be
> >
> > affected.
> >
> > A
> >
> > typical
> >
> > application scenario is to mitigate machine hotspots.
> >
> > In
> >
> > this
> >
> > case,
> >
> > we
> >
> > hope
> >
> > that subsequent resource allocations will not be on
> >
> > the
> >
> > hot
> >
> > machine,
> >
> > but
> >
> > tasks currently running on it should not be affected.
> >
> > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> >
> > task
> >
> > manager
> >
> > or
> >
> > node
> >
> > as
> >
> > blocked, and evacuate all tasks on it. Evacuated
> >
> > tasks
> >
> > will
> >
> > be
> >
> > restarted on
> >
> > non-blocked task managers.
> >
> > For the above 2 actions, the former may more
> >
> > highlight
> >
> > the
> >
> > meaning
> >
> > of
> >
> > this FLIP, because the external system cannot do
> >
> > that.
> >
> > Regarding *Manually* and *Automatically*, I
> >
> > basically
> >
> > agree
> >
> > with
> >
> > @Becket
> >
> > Qin: different users have different answers. Not all
> >
> > users’
> >
> > deployment
> >
> > environments have a special external system that can
> >
> > perform
> >
> > the
> >
> > anomaly
> >
> > detection. In addition, adding pluggable/optional
> >
> > auto-detection
> >
> > doesn't
> >
> > require much extra work on top of manual
> >
> > specification.
> >
> > I will answer your other questions one by one.
> >
> >
> > @Yangze
> >
> > a) I think you are right, we do not need to expose
> >
> > the
> > `cluster.resource-blocklist.item.timeout-check-interval`
> >
> > to
> >
> > users.
> >
> > b) We can abstract the `notifyException` to a
> >
> > separate
> >
> > interface
> >
> > (maybe
> >
> > BlocklistExceptionListener), and the
> >
> > ResourceManagerBlocklistHandler
> >
> > can
> >
> > implement it in the future.
> >
> > @Martijn
> >
> > a) I also think the manual blocking should be done
> >
> > by
> >
> > cluster
> >
> > operators.
> >
> > b) I think manual blocking makes sense, because
> >
> > according
> >
> > to
> >
> > my
> >
> > experience, users are often the first to perceive the
> >
> > machine
> >
> > problems
> >
> > (because of job failover or delay), and they will
> >
> > contact
> >
> > cluster
> >
> > operators
> >
> > to solve it, or even tell the cluster operators which
> >
> > machine is
> >
> > problematic. From this point of view, I think the
> >
> > people
> >
> > who
> >
> > really
> >
> > need
> >
> > the manual blocking are the users, and it’s just
> >
> > performed
> >
> > by
> >
> > the
> >
> > cluster
> >
> > operator, so I think the manual blocking makes sense.
> >
> > @Chesnay
> >
> > We need to touch the logic of JM/SlotPool, because
> >
> > for
> >
> > MARK_BLOCKLISTED
> >
> > , we need to know whether the slot is blocklisted
> >
> > when
> >
> > the
> >
> > task
> >
> > is
> >
> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> >
> > release
> >
> > the
> >
> > slot
> >
> > directly to avoid assigning other tasks (of this job)
> >
> > on
> >
> > it.
> >
> > If
> >
> > we
> >
> > only
> >
> > maintain the blocklist information on the RM, JM
> >
> > needs
> >
> > to
> >
> > retrieve
> >
> > it
> >
> > by
> >
> > RPC. I think the performance overhead of that is
> >
> > relatively
> >
> > large,
> >
> > so
> >
> > I
> >
> > think it's worth maintaining the blocklist
> >
> > information
> >
> > on
> >
> > the JM
> >
> > side
> >
> > and
> >
> > syncing them.
> >
> > @Роман
> >
> >     a) “Probably storing inside Zookeeper/Configmap
> >
> > might
> >
> > be
> >
> > helpful
> >
> > here.”  Can you explain it in detail? I don't fully
> >
> > understand
> >
> > that.
> >
> > In
> >
> > my
> >
> > opinion, non-active and active are the same, and no
> >
> > special
> >
> > treatment
> >
> > is
> >
> > required.
> >
> > b) I agree with you, the `endTimestamp` makes
> >
> > sense,
> >
> > I
> >
> > will
> >
> > add
> >
> > it
> >
> > to
> >
> > FLIP.
> >
> > @Yang
> >
> > As mentioned above, AFAK, the external system
> >
> > cannot
> >
> > support
> >
> > the
> >
> > MARK_BLOCKLISTED action.
> >
> > Looking forward to your further feedback.
> >
> >
> > Best,
> >
> > Lijie
> >
> >
> > Yang Wang <da...@gmail.com> 于2022年5月3日周二
> >
> > 21:09写道：
> >
> > Thanks Lijie and Zhu for creating the proposal.
> >
> > I want to share some thoughts about Flink cluster
> >
> > operations.
> >
> > In the production environment, the SRE(aka Site
> >
> > Reliability
> >
> > Engineer)
> >
> > already has many tools to detect the unstable
> >
> > nodes,
> >
> > which
> >
> > could
> >
> > take
> >
> > the
> >
> > system logs/metrics into consideration.
> > Then they use graceful-decomission in YARN and
> >
> > taint
> >
> > in
> >
> > K8s
> >
> > to
> >
> > prevent
> >
> > new
> >
> > allocations on these unstable nodes.
> > At last, they will evict all the containers and
> >
> > pods
> >
> > running
> >
> > on
> >
> > these
> >
> > nodes.
> >
> > This mechanism also works for planned maintenance.
> >
> > So
> >
> > I
> >
> > am
> >
> > afraid
> >
> > this
> >
> > is
> >
> > not the typical use case for FLIP-224.
> >
> > If we only support to block nodes manually, then I
> >
> > could
> >
> > not
> >
> > see
> >
> > the obvious advantages compared with current SRE's
> >
> > approach(via
> >
> > *yarn
> >
> > rmadmin or kubectl taint*).
> > At least, we need to have a pluggable component
> >
> > which
> >
> > could
> >
> > expose
> >
> > the
> >
> > potential unstable nodes automatically and block
> >
> > them
> >
> > if
> >
> > enabled
> >
> > explicitly.
> >
> > Best,
> > Yang
> >
> >
> >
> > Becket Qin <be...@gmail.com> 于2022年5月2日周一
> >
> > 16:36写道：
> >
> > Thanks for the proposal, Lijie.
> >
> > This is an interesting feature and discussion,
> >
> > and
> >
> > somewhat
> >
> > related
> >
> > to the
> >
> > design principle about how people should operate
> >
> > Flink.
> >
> > I think there are three things involved in this
> >
> > FLIP.
> >
> >      a) Detect and report the unstable node.
> >      b) Collect the information of the unstable
> >
> > node
> >
> > and
> >
> > form a
> >
> > blocklist.
> >
> >      c) Take the action to block nodes.
> >
> > My two cents:
> >
> > 1. It looks like people all agree that Flink
> >
> > should
> >
> > have
> >
> > c).
> >
> > It
> >
> > is
> >
> > not only
> >
> > useful for cases of node failures, but also
> >
> > handy
> >
> > for
> >
> > some
> >
> > planned
> >
> > maintenance.
> >
> > 2. People have different opinions on b), i.e.
> >
> > who
> >
> > should be
> >
> > the
> >
> > brain
> >
> > to
> >
> > make the decision to block a node. I think this
> >
> > largely
> >
> > depends
> >
> > on
> >
> > who we
> >
> > talk to. Different users would probably give
> >
> > different
> >
> > answers.
> >
> > For
> >
> > people
> >
> > who do have a centralized node health management
> >
> > service,
> >
> > let
> >
> > Flink
> >
> > do just
> >
> > do a) and c) would be preferred. So essentially
> >
> > Flink
> >
> > would
> >
> > be
> >
> > one
> >
> > of
> >
> > the
> >
> > sources that may detect unstable nodes, report
> >
> > it
> >
> > to
> >
> > that
> >
> > service,
> >
> > and then
> >
> > take the command from that service to block the
> >
> > problematic
> >
> > nodes.
> >
> > On
> >
> > the
> >
> > other hand, for users who do not have such a
> >
> > service,
> >
> > simply
> >
> > letting
> >
> > Flink
> >
> > be clever by itself to block the suspicious
> >
> > nodes
> >
> > might
> >
> > be
> >
> > desired
> >
> > to
> >
> > ensure the jobs are running smoothly.
> >
> > So that indicates a) and b) here should be
> >
> > pluggable /
> >
> > optional.
> >
> > In light of this, maybe it would make sense to
> >
> > have
> >
> > something
> >
> > pluggable
> >
> > like a UnstableNodeReporter which exposes
> >
> > unstable
> >
> > nodes
> >
> > actively.
> >
> > (A
> >
> > more
> >
> > general interface should be JobInfoReporter<T>
> >
> > which
> >
> > can be
> >
> > used
> >
> > to
> >
> > report
> >
> > any information of type <T>. But I'll just keep
> >
> > the
> >
> > scope
> >
> > relevant
> >
> > to
> >
> > this
> >
> > FLIP here). Personally speaking, I think it is
> >
> > OK
> >
> > to
> >
> > have a
> >
> > default
> >
> > implementation of a reporter which just tells
> >
> > Flink
> >
> > to
> >
> > take
> >
> > action
> >
> > to
> >
> > block
> >
> > problematic nodes and also unblocks them after
> >
> > timeout.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> >
> > ro.v.boyko@gmail.com
> >
> > wrote:
> >
> > Thanks for good initiative, Lijie and Zhu!
> >
> > If it's possible I'd like to participate in
> >
> > development.
> >
> > I agree with 3rd point of Konstantin's reply -
> >
> > we
> >
> > should
> >
> > consider
> >
> > to move
> >
> > somehow the information of blocklisted
> >
> > nodes/TMs
> >
> > from
> >
> > active
> >
> > ResourceManager to non-active ones. Probably
> >
> > storing
> >
> > inside
> >
> > Zookeeper/Configmap might be helpful here.
> >
> > And I agree with Martijn that a lot of
> >
> > organizations
> >
> > don't
> >
> > want
> >
> > to
> >
> > expose
> >
> > such API for a cluster user group. But I think
> >
> > it's
> >
> > necessary
> >
> > to
> >
> > have the
> >
> > mechanism for unblocking the nodes/TMs anyway
> >
> > for
> >
> > avoiding
> >
> > incorrect
> >
> > automatic behaviour.
> >
> > And another one small suggestion - I think it
> >
> > would
> >
> > be
> >
> > better
> >
> > to
> >
> > extend
> >
> > the
> >
> > *BlocklistedItem* class with the
> >
> > *endTimestamp*
> >
> > field
> >
> > and
> >
> > fill
> >
> > it
> >
> > at the
> >
> > item creation. This simple addition will allow
> >
> > to:
> >
> >    -
> >
> >    Provide the ability to users to setup the
> >
> > exact
> >
> > time
> >
> > of
> >
> > blocklist end
> >
> >    through RestAPI
> >    -
> >
> >    Not being tied to a single value of
> >    *cluster.resource-blacklist.item.timeout*
> >
> >
> > On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> >
> > <
> >
> > chesnay@apache.org>
> >
> > wrote:
> >
> > I do share the concern between blurring the
> >
> > lines
> >
> > a
> >
> > bit.
> >
> > That said, I'd prefer to not have any
> >
> > auto-detection
> >
> > and
> >
> > only
> >
> > have an
> >
> > opt-in mechanism
> > to manually block processes/nodes. To me
> >
> > this
> >
> > sounds
> >
> > yet
> >
> > again
> >
> > like one
> >
> > of those
> > magical mechanisms that will rarely work
> >
> > just
> >
> > right.
> >
> > An external system can leverage way more
> >
> > information
> >
> > after
> >
> > all.
> >
> > Moreover, I'm quite concerned about the
> >
> > complexity
> >
> > of
> >
> > this
> >
> > proposal.
> >
> > Tracking on both the RM/JM side; syncing
> >
> > between
> >
> > components;
> >
> > adjustments
> >
> > to the
> > slot and resource protocol.
> >
> > In a way it seems overly complicated.
> >
> > If we look at it purely from an active
> >
> > resource
> >
> > management
> >
> > perspective,
> >
> > then there
> > isn't really a need to touch the slot
> >
> > protocol
> >
> > at
> >
> > all
> >
> > (or
> >
> > in
> >
> > fact
> >
> > to
> >
> > anything in the JobMaster),
> > because there isn't any point in keeping
> >
> > around
> >
> > blocked
> >
> > TMs
> >
> > in
> >
> > the
> >
> > first
> >
> > place.
> > They'd just be idling, potentially shutting
> >
> > down
> >
> > after
> >
> > a
> >
> > while
> >
> > by
> >
> > the
> >
> > RM
> >
> > because of
> > it (unless we _also_ touch that logic).
> > Here the blocking of a process (be it by
> >
> > blocking
> >
> > the
> >
> > process
> >
> > or
> >
> > node)
> >
> > is
> >
> > equivalent with shutting down the blocked
> >
> > process(es).
> >
> > Once the block is lifted we can just spin it
> >
> > back
> >
> > up.
> >
> > And I do wonder whether we couldn't apply
> >
> > the
> >
> > same
> >
> > line
> >
> > of
> >
> > thinking to
> >
> > standalone resource management.
> > Here being able to stop/restart a
> >
> > process/node
> >
> > manually
> >
> > should
> >
> > be
> >
> > a
> >
> > core
> >
> > requirement for a Flink deployment anyway.
> >
> >
> > On 02/05/2022 08:49, Martijn Visser wrote:
> >
> > Hi everyone,
> >
> > Thanks for creating this FLIP. I can
> >
> > understand
> >
> > the
> >
> > problem
> >
> > and
> >
> > I see
> >
> > value
> >
> > in the automatic detection and
> >
> > blocklisting. I
> >
> > do
> >
> > have
> >
> > some
> >
> > concerns
> >
> > with
> >
> > the ability to manually specify to be
> >
> > blocked
> >
> > resources.
> >
> > I
> >
> > have
> >
> > two
> >
> > concerns;
> >
> > * Most organizations explicitly have a
> >
> > separation
> >
> > of
> >
> > concerns,
> >
> > meaning
> >
> > that
> >
> > there's a group who's responsible for
> >
> > managing a
> >
> > cluster
> >
> > and
> >
> > there's
> >
> > a
> >
> > user
> >
> > group who uses that cluster. With the
> >
> > introduction of
> >
> > this
> >
> > mechanism,
> >
> > the
> >
> > latter group now can influence the
> >
> > responsibility
> >
> > of
> >
> > the
> >
> > first
> >
> > group.
> >
> > So
> >
> > it
> >
> > can be possible that someone from the user
> >
> > group
> >
> > blocks
> >
> > something,
> >
> > which
> >
> > causes an outage (which could result in
> >
> > paging
> >
> > mechanism
> >
> > triggering
> >
> > etc)
> >
> > which impacts the first group.
> > * How big is the group of people who can
> >
> > go
> >
> > through
> >
> > the
> >
> > process
> >
> > of
> >
> > manually
> >
> > identifying a node that isn't behaving as
> >
> > it
> >
> > should
> >
> > be? I
> >
> > do
> >
> > think
> >
> > this
> >
> > group is relatively limited. Does it then
> >
> > make
> >
> > sense
> >
> > to
> >
> > introduce
> >
> > such
> >
> > a
> >
> > feature, which would only be used by a
> >
> > really
> >
> > small
> >
> > user
> >
> > group
> >
> > of
> >
> > Flink?
> >
> > We
> >
> > still have to maintain, test and support
> >
> > such
> >
> > a
> >
> > feature.
> >
> > I'm +1 for the autodetection features, but
> >
> > I'm
> >
> > leaning
> >
> > towards
> >
> > not
> >
> > exposing
> >
> > this to the user group but having this
> >
> > available
> >
> > strictly
> >
> > for
> >
> > cluster
> >
> > operators. They could then also set up
> >
> > their
> >
> > paging/metrics/logging
> >
> > system
> >
> > to take this into account.
> >
> > Best regards,
> >
> > Martijn Visser
> > https://twitter.com/MartijnVisser82
> > https://github.com/MartijnVisser
> >
> >
> > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> >
> > karmagyz@gmail.com
> >
> > wrote:
> >
> > Thanks for driving this, Zhu and Lijie.
> >
> > +1 for the overall proposal. Just share
> >
> > some
> >
> > cents
> >
> > here:
> >
> > - Why do we need to expose
> >
> > cluster.resource-blacklist.item.timeout-check-interval
> >
> > to
> >
> > the
> >
> > user?
> >
> > I think the semantics of
> >
> > `cluster.resource-blacklist.item.timeout`
> >
> > is
> >
> > sufficient for the user. How to guarantee
> >
> > the
> >
> > timeout
> >
> > mechanism is
> >
> > Flink's internal implementation. I think
> >
> > it
> >
> > will
> >
> > be
> >
> > very
> >
> > confusing
> >
> > and
> >
> > we do not need to expose it to users.
> >
> > - ResourceManager can notify the
> >
> > exception
> >
> > of a
> >
> > task
> >
> > manager to
> >
> > `BlacklistHandler` as well.
> > For example, the slot allocation might
> >
> > fail
> >
> > in
> >
> > case
> >
> > the
> >
> > target
> >
> > task
> >
> > manager is busy or has a network jitter.
> >
> > I
> >
> > don't
> >
> > mean
> >
> > we
> >
> > need
> >
> > to
> >
> > cover
> >
> > this case in this version, but we can
> >
> > also
> >
> > open a
> >
> > `notifyException`
> >
> > in
> >
> > `ResourceManagerBlacklistHandler`.
> >
> > - Before we sync the blocklist to
> >
> > ResourceManager,
> >
> > will
> >
> > the
> >
> > slot of
> >
> > a
> >
> > blocked task manager continues to be
> >
> > released
> >
> > and
> >
> > allocated?
> >
> > Best,
> > Yangze Guo
> >
> > On Thu, Apr 28, 2022 at 3:11 PM Lijie
> >
> > Wang
> >
> > <
> >
> > wangdachui9501@gmail.com>
> >
> > wrote:
> >
> > Hi Konstantin,
> >
> > Thanks for your feedback. I will
> >
> > response
> >
> > your 4
> >
> > remarks:
> >
> > 1) Thanks for reminding me of the
> >
> > controversy. I
> >
> > think
> >
> > “BlockList”
> >
> > is
> >
> > good
> >
> > enough, and I will change it in FLIP.
> >
> >
> > 2) Your suggestion for the REST API is a
> >
> > good
> >
> > idea.
> >
> > Based
> >
> > on
> >
> > the
> >
> > above, I
> >
> > would change REST API as following:
> >
> > POST/GET <host>/blocklist/nodes
> >
> > POST/GET <host>/blocklist/taskmanagers
> >
> > DELETE
> >
> > <host>/blocklist/node/<identifier>
> >
> > DELETE
> >
> > <host>/blocklist/taskmanager/<identifier>
> >
> > 3) If a node is blocking/blocklisted, it
> >
> > means
> >
> > that
> >
> > all
> >
> > task
> >
> > managers
> >
> > on
> >
> > this node are blocklisted. All slots on
> >
> > these
> >
> > TMs
> >
> > are
> >
> > not
> >
> > available.
> >
> > This
> >
> > is actually a bit like TM losts, but
> >
> > these
> >
> > TMs
> >
> > are
> >
> > not
> >
> > really
> >
> > lost,
> >
> > they
> >
> > are in an unavailable status, and they
> >
> > are
> >
> > still
> >
> > registered
> >
> > in this
> >
> > flink
> >
> > cluster. They will be available again
> >
> > once
> >
> > the
> >
> > corresponding
> >
> > blocklist
> >
> > item
> >
> > is removed. This behavior is the same in
> >
> > active/non-active
> >
> > clusters.
> >
> > However in the active clusters, these
> >
> > TMs
> >
> > may
> >
> > be
> >
> > released
> >
> > due
> >
> > to
> >
> > idle
> >
> > timeouts.
> >
> >
> > 4) For the item timeout, I prefer to
> >
> > keep
> >
> > it.
> >
> > The
> >
> > reasons
> >
> > are
> >
> > as
> >
> > following:
> >
> > a) The timeout will not affect users
> >
> > adding
> >
> > or
> >
> > removing
> >
> > items
> >
> > via
> >
> > REST
> >
> > API,
> >
> > and users can disable it by configuring
> >
> > it
> >
> > to
> >
> > Long.MAX_VALUE .
> >
> > b) Some node problems can recover after
> >
> > a
> >
> > period of
> >
> > time
> >
> > (such as
> >
> > machine
> >
> > hotspots), in which case users may
> >
> > prefer
> >
> > that
> >
> > Flink
> >
> > can
> >
> > do
> >
> > this
> >
> > automatically instead of requiring the
> >
> > user
> >
> > to
> >
> > do
> >
> > it
> >
> > manually.
> >
> > Best,
> >
> > Lijie
> >
> > Konstantin Knauf <kn...@apache.org>
> >
> > 于2022年4月27日周三
> >
> > 19:23写道：
> >
> > Hi Lijie,
> >
> > I think, this makes sense and +1 to
> >
> > only
> >
> > support
> >
> > manually
> >
> > blocking
> >
> > taskmanagers and nodes. Maybe the
> >
> > different
> >
> > strategies
> >
> > can
> >
> > also be
> >
> > maintained outside of Apache Flink.
> >
> > A few remarks:
> >
> > 1) Can we use another term than
> >
> > "bla.cklist"
> >
> > due
> >
> > to
> >
> > the
> >
> > controversy
> >
> > around
> >
> > the term? [1] There was also a Jira
> >
> > Ticket
> >
> > about
> >
> > this
> >
> > topic a
> >
> > while
> >
> > back
> >
> > and there was generally a consensus to
> >
> > avoid
> >
> > the
> >
> > term
> >
> > blacklist &
> >
> > whitelist
> >
> > [2]? We could use "blocklist"
> >
> > "denylist"
> >
> > or
> >
> > "quarantined"
> >
> > 2) For the REST API, I'd prefer a
> >
> > slightly
> >
> > different
> >
> > design
> >
> > as
> >
> > verbs
> >
> > like
> >
> > add/remove often considered an
> >
> > anti-pattern
> >
> > for
> >
> > REST
> >
> > APIs.
> >
> > POST
> >
> > on a
> >
> > list
> >
> > item is generally the standard to add
> >
> > items.
> >
> > DELETE
> >
> > on
> >
> > the
> >
> > individual
> >
> > resource is standard to remove an item.
> >
> > POST <host>/quarantine/items
> > DELETE
> >
> > <host>/quarantine/items/<itemidentifier>
> >
> > We could also consider to separate
> >
> > taskmanagers
> >
> > and
> >
> > nodes
> >
> > in
> >
> > the
> >
> > REST
> >
> > API
> >
> > (and internal data structures). Any
> >
> > opinion
> >
> > on
> >
> > this?
> >
> > POST/GET <host>/quarantine/nodes
> > POST/GET <host>/quarantine/taskmanager
> > DELETE
> >
> > <host>/quarantine/nodes/<identifier>
> >
> > DELETE
> >
> > <host>/quarantine/taskmanager/<identifier>
> >
> > 3) How would blocking nodes behave with
> >
> > non-active
> >
> > resource
> >
> > managers,
> >
> > i.e.
> >
> > standalone or reactive mode?
> >
> > 4) To keep the implementation even more
> >
> > minimal,
> >
> > do
> >
> > we
> >
> > need
> >
> > the
> >
> > timeout
> >
> > behavior? If items are added/removed
> >
> > manually
> >
> > we
> >
> > could
> >
> > delegate
> >
> > this
> >
> > to the
> >
> > user easily. In my opinion the timeout
> >
> > behavior
> >
> > would
> >
> > better
> >
> > fit
> >
> > into
> >
> > specific strategies at a later point.
> >
> > Looking forward to your thoughts.
> >
> > Cheers and thank you,
> >
> > Konstantin
> >
> > [1]
> >
> >
> > https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >
> > [2]
> >
> > https://issues.apache.org/jira/browse/FLINK-18209
> >
> > Am Mi., 27. Apr. 2022 um 04:04 Uhr
> >
> > schrieb
> >
> > Lijie
> >
> > Wang
> >
> > <
> >
> > wangdachui9501@gmail.com>:
> >
> > Hi all,
> >
> > Flink job failures may happen due to
> >
> > cluster
> >
> > node
> >
> > issues
> >
> > (insufficient
> >
> > disk
> >
> > space, bad hardware, network
> >
> > abnormalities).
> >
> > Flink
> >
> > will
> >
> > take care
> >
> > of
> >
> > the
> >
> > failures and redeploy the tasks.
> >
> > However,
> >
> > due
> >
> > to
> >
> > data
> >
> > locality
> >
> > and
> >
> > limited
> >
> > resources, the new tasks are very
> >
> > likely
> >
> > to
> >
> > be
> >
> > redeployed
> >
> > to the
> >
> > same
> >
> > nodes, which will result in continuous
> >
> > task
> >
> > abnormalities
> >
> > and
> >
> > affect
> >
> > job
> >
> > progress.
> >
> > Currently, Flink users need to
> >
> > manually
> >
> > identify
> >
> > the
> >
> > problematic
> >
> > node and
> >
> > take it offline to solve this problem.
> >
> > But
> >
> > this
> >
> > approach
> >
> > has
> >
> > following
> >
> > disadvantages:
> >
> > 1. Taking a node offline can be a
> >
> > heavy
> >
> > process.
> >
> > Users
> >
> > may
> >
> > need
> >
> > to
> >
> > contact
> >
> > cluster administors to do this. The
> >
> > operation
> >
> > can
> >
> > even
> >
> > be
> >
> > dangerous
> >
> > and
> >
> > not
> >
> > allowed during some important business
> >
> > events.
> >
> > 2. Identifying and solving this kind
> >
> > of
> >
> > problems
> >
> > manually
> >
> > would
> >
> > be
> >
> > slow
> >
> > and
> >
> > a waste of human resources.
> >
> > To solve this problem, Zhu Zhu and I
> >
> > propose
> >
> > to
> >
> > introduce a
> >
> > blacklist
> >
> > mechanism for Flink to filter out
> >
> > problematic
> >
> > resources.
> >
> > You can find more details in
> >
> > FLIP-224[1].
> >
> > Looking
> >
> > forward
> >
> > to your
> >
> > feedback.
> >
> > [1]
> >
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >
> > Best,
> >
> > Lijie
> >
> > --
> > Best regards,
> > Roman Boyko
> > e.: ro.v.boyko@gmail.com
> >
> > --
> > https://twitter.com/snntrable
> > https://github.com/knaufk
> >
> > --
> > https://twitter.com/snntrable
> > https://github.com/knaufk
> >
> >
> >

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.

1) With the declarative slot allocation protocol, it's not easy to
keep the slow slots which may have satisfy the slot request already,
but ask for more slots from the resource manager.
And we are also concerned to use detected slow slots, which may add
load to the slow nodes and further slow down the tasks on the slow
nodes. Because we found slow nodes are mostly caused by heavy loads.

2) I see your point. The blocker here is that batch jobs currently
uses DefaultScheduler. The scheduler does not see slots directly and
does not know which of them are from slow nodes, so it's hard for it
to rescale the vertex according to this information. Besides that, it
cannot help with the stage that first observed and affected the slow
nodes.

Thanks,
Zhu

Chesnay Schepler <ch...@apache.org> 于2022年6月10日周五 17:04写道：
>
> 1)
> It's true that if we handle this entirely in the scheduler we may get a bunch of slow slots from the RM.
> My point is that this isn't necessarily a problem; it depends on how you handle those slots.
>
> We anyway need to account for the possibility that we're not getting any new fast slots from the RM.
> With that in mind, I'd rather not categorically throw away the slow slots we got, but try to make use of them somehow.
>
> 2)
> The idea was to rescale the job when the current stage finishes.
> The speculative execution handles slow nodes being detected in the current stage,
> next stage we use _some_ strategy to handle slow nodes (be it ignoring those, rescaling the job, ...).
>
> 3)
> It's a fair observation that once you push this to the RM you end up with a de-facto blocklist :)
>
> On 08/06/2022 17:11, Zhu Zhu wrote:
>
> Thanks Chesnay for the feedback in the vote thread.
> (https://lists.apache.org/thread/opc7jg3rpxnwotkb0fcn4wnm02m4397o)
>
> I'd like to continue the discussion in this thread so that the
> discussions can be better tracked.
>
> Regarding your questions, here are my thoughts:
> 1. The current locality mechanism does not work well to avoid
> deploying tasks to slow nodes because it cannot proactively reject or
> release slots from the slow nodes. And it cannot help at the resource
> manager side to avoid allocating free slots or launching new
> TaskManagers on slow nodes.
> 2. Dynamically downscaling or upscaling a batch job is usually
> unacceptable because it means to re-run the whole stage.
> 3. Extending the requirement declaration to have a notion of
> "undesirable nodes" is an option. And it is actually how we started.
> Considering implementation details, we found we need that
> - a tracker to collect all the undesirable(detected slow nodes)
> - to filter out slots from undesirable nodes when allocating slots
> from the SlotPool
> - ask the ResourceManager for slots that are not on the undesirable
> nodes. The ResourceManager may further need to ask for new
> TaskManagers that are not on the undesirable nodes.
> Then, with all these functionality, we found that we almost have a
> blocklist mechanism. As blocklist mechanism is a common concept and is
> possible to benefit users, we took this chance to propose the
> blocklist mechanism.
> 4. A cluster-wide shared blocklist is not a must for speculative
> execution at the moment. It is mainly part of a standalone blocklist
> feature to host user specified block items. To avoid sharing job
> specific blocked items between jobs, one way is to add a nullable
> JobID tag for the blocked item.
>
> Thanks,
> Zhu
>
> Zhu Zhu <re...@gmail.com> 于2022年6月7日周二 10:58写道：
>
> Hi Chesnay,
>
> Would you please take a look at the FLIP and discussion to see if all
> your concerns have been addressed?
>
> Thanks,
> Zhu
>
> Zhu Zhu <re...@gmail.com> 于2022年5月28日周六 13:26写道：
>
> Regarding the concern of the SlotManager, my two cents:
> 1. it is necessary for the SlotManager to host blocked slots, in 2 cases:
>   a. In standalone mode, a taskmanager may be temporarily added to
> the blocklist. We do not want the TM to get disconnected and shut down.
> So we need to keep its connection to RM and keep hosting its slots.
>   b. When we want to avoid allocating slots to a slow nodes but do not
> want to kill current running tasks on the nodes (MARK_BLOCKED mode).
>
> There is possible a way to keep the connection of a blocked task manager
> while hide its slots from SlotManager, but I feel it may be even much more
> complicated.
>
> 2. It will not complicate the SlotManager too much. The SlotManager will
> be offered a BlockedTaskManagerChecker when created, and just need
> to use it to filter out blocked slots on slot request. Therefore I think the
> complication is acceptable.
>
> Thanks,
> Zhu
>
> Lijie Wang <wa...@gmail.com> 于2022年5月25日周三 15:26写道：
>
> Hi everyone,
>
> I've updated the FLIP according to Chesnay's feedback, changes as follows:
> 1. Change the GET result to a map.
> 2. Only left *endTimestamp* in ADD operation, and change the rest (from
> POST) to PUT
> 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
> encapsulate blocklist related functions.
> 4. Remove *mainThread* from BlocklistTracker, instead provide a
> *removeTimeoutItems* method to be called by outside components。
>
> Best,
> Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：
>
> Hi Chesnay,
>
> Thanks for feedback.
>
> 1. Regarding the TM/Node id. Do you mean special characters may appear in
> the rest URL?  Actually, I don't think so. The task manager id in REST API
> should be the *ResourceID* of taskmanager in Flink, there should be no
> special characters, and some existing REST APIs are already using it, e.g.
> GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
> id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
> should also have no special characters.
> 2. Regarding the GET query responses. I agree with you, it makes sense to
> change the GET result to a map.
>
> 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
> cover everything, and the endTimestamp is a unix timestamp, there should be
> no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
> add rest api is add or update, PUT can cover this semantics.
>
> 4. Regarding the slot pool/manager. I don't think the current slotpool
> and slotmanager are able to support the MARK_BLOCKED(slots that are
> already allocated will not be affected) action. The reasons are as
> follows：
>
> a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
> from reserved(task assigned) to free(no task assigned), it is necessary to
> check whether the slot should be released immediately(it should be released
> immediately if the task manager is blocked, otherwise it may be allocated
> to other tasks). I think it cannot be supported without being aware of
> the blocklist information. Compared to the solution in FLIP, a more
> appropriate/prefered way may be: Introduce a new slot pool
> implementation for blocklist(may be named BlocklistSlotPool, it
> extends/wrapps the original slot pool), and implement the parts that need
> to be aware of the blocklist in this newly introduced slot pool, and the
> original slot pool basically does not need to change.
>
> b) for slot manager, with the MARK_BLOCKED action, there may be free but
> blocked slots in slot manager (the corresponding TMs cannot be
> released/unregistered because there are still running tasks on them).
> Therefore, we need to filter out the blocked slots when trying to fulfill
> the slot requirements. Therefore it also needs to know the blocklist information.
> A better way may be to abstract a resource allocation strategy, and make
> the blocklist as a special implementation, then pass the resource
> allocation strategy in when constructing the slot manager. Unfortunately,
> the data structures in the two existing slot manager
> implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
> quite different, it is not easy to abstract a common resource allocation
> strategy, so we prefer to keep the current way(i.e. pass the blocklist
> information directly into slot manager).
>
>
> 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
> does not need to be aware of the executor, and the timeout actions can be
> done outside.
>
> Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
>
> I have a number of concerns:
>
> Is the id used for deleting an item the same sent in the initial request
> (and not one returned by Flink)?
> I'm very concerned that the tm/node id can contain special characters.
>
> The GET query should return a map, not a list of items. This makes it
> easier to work with.
>
> The duality of endTimestamp and duration is also concerning.
> If we conclude that endTimestamps can in fact work (and aren't utterly
> unusable due to timezones),
> then this should be able to cover everything and rid us of some
> complexity w.r.t. POSTs to the same ID.
> Additions would be a PUT, changes a PATCH, deletes a DELETE.
>
>
> I also dislike how we're pushing more functionality into the
> slotpool/-manager.
> These components are complex enough as-is, and instead I'd propose a
> separate component that interacts with the SlotPool/-Manager instead,
> for example by removing the slots from that TM.
> The reason being that from the slot-pool perspective it is irrelevant
> whether a slot is gone because the TM was lost, or because it was blocked.
>
>
> The BlocklistTracker shouldn't be modeled as component that is aware of
> the concept of main threads.
> It really has no business knowing that; all it needs is an executor for
> handling timeouts/periodic actions,
> and a way to interact with the JM/RM (which internally can handle the
> scheduling into the main thread).
>
>
> On 20/05/2022 07:20, Lijie Wang wrote:
>
> Hi everyone,
>
> I have started a vote for this FLIP [1]. Please cast your vote there or
>
> ask
>
> additional questions here. [1]
> https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
>
> Best, Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
>
> Hi Konstantin,
>
> We found that Flink REST URL does not support the format ":merge" ,
>
> which
>
> will be recognized as a parameter in the URL(due to start with a
>
> colon).
>
> We will keep the previous way, i.e.
>
> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> and the "id" and "merge" flag are put into the request body.
>
> Best,
> Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
>
> Hi Weihua,
> thanks for feedback.
>
> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
> step towards auto-detection.
> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> exception of insufficient resources.
> 3. No. This FLIP won't change the WebUI. The blocklist information
>
> can be
>
> obtained through REST API and metrics.
>
> Best,
> Lijie
>
> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
>
> Hi,
> Thanks for creating this FLIP.
> We have implemented an automatic blocklist detection mechanism
> internally, which is indeed very effective for handling node
>
> failures.
>
> Due to the large number of nodes, although SREs already support
> automatic offline failure nodes, the detection is not 100% accurate
>
> and
>
> there is some delay.
> So the blocklist mechanism can make flink job recover from failure
>
> much
>
> faster.
>
> Here are some of my thoughts:
> 1. In this FLIP, it needs users to locate machine failure manually,
> there is a certain cost of use
> 2. What happens if too many nodes are blocked, resulting in
>
> insufficient
>
> resources? Will there be a special Exception for the user?
> 3. Will we display the blocklist information in the WebUI? The
>
> blocklist
>
> is for cluster level, and if multiple users share a cluster, some
>
> users may
>
> be a little confused when resources are not enough, or when
>
> resources are
>
> applied for more.
>
> Also, Looking forward to the next FLIP on auto-detection.
>
> Best,
> Weihua
>
> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
>
> Hi Konstantin,
>
> Maybe change it to the following:
>
> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> Merge is not allowed. If the {id} already exists, return error.
>
> Otherwise,
>
> create a new item.
>
> 2. POST: http://
>
> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>
> Merge is allowed. If the {id} already exists, merge. Otherwise,
>
> create
>
> a
>
> new item.
>
> WDYT?
>
> Best,
> Lijie
>
> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
>
> Hi Lijie,
>
> hm, maybe the following is more appropriate in that case
>
> POST: http://
>
> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>
> Best,
>
> Konstantin
>
> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> wangdachui9501@gmail.com>:
>
> Hi Konstantin,
> thanks for your feedback.
>
>  From what I understand, PUT should be idempotent. However, we
>
> have a
>
> *timeout* field in the request. This means that initiating the
>
> same
>
> request
>
> at two different times will lead to different resource status
>
> (timestamps
>
> of the items to be removed will be different).
>
> Should we use PUT in this case? WDYT?
>
> Best,
> Lijie
>
> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>
> Hi Lijie,
>
> wouldn't the REST API-idiomatic way for an update/replace be a
>
> PUT
>
> on
>
> the
>
> resource?
>
> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>
> Best,
>
> Konstantin
>
>
>
> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> wangdachui9501@gmail.com>:
>
> Hi everyone,
>
> I've had an offline discussion with Becket Qin and Zhu Zhu, and
>
> made
>
> the
>
> following changes on REST API:
> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
>
> choose
>
> one.
>
> If
>
> both are specified, will return error.
> 2.  If the specified item is already there, the *ADD* operation
>
> has
>
> two
>
> behaviors:  *return error*(default value) or *merge/update*,
>
> and we
>
> add a
>
> flag to the request body to control it. You can find more
>
> details
>
> "Public
>
> Interface" section.
>
> If there is no more feedback, we will start the vote thread next
>
> week.
>
> Best,
> Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>
> Hi Becket Qin,
>
> Thanks for your suggestions.  I have moved the description of
> configurations, metrics and REST API into "Public Interface"
>
> section,
>
> and
>
> made a few updates according to your suggestion.  And in this
>
> FLIP,
>
> there
>
> no public java Interfaces or pluggables that users need to
>
> implement
>
> by
>
> themselves.
>
> Answers for you questions:
> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>
> items
>
> can
>
> only be added through the REST API, so these 2 action are
>
> mentioned
>
> in
>
> the
>
> REST API part (The REST API part has beed moved to public
>
> interface
>
> now).
>
> 2. I agree with you. I have changed the "Cause" field to
>
> String,
>
> and
>
> allow
>
> users to specify it via REST API.
> 3. Yes, it is useful to allow different timeouts. As mentioned
>
> above,
>
> we
>
> will introduce 2 fields : *timeout* and *endTimestamp* into the
>
> ADD
>
> REST
>
> API to specify when to remove the blocked item. These 2 fields
>
> are
>
> optional, if neither is specified, it means that the blocked
>
> item
>
> is
>
> permanent and will not be removed. If both are specified, the
>
> minimum
>
> of
>
> *currentTimestamp+tiemout *and* endTimestamp* will be used as
>
> the
>
> time
>
> to
>
> remove the blocked item. To keep the configurations more
>
> minimal,
>
> we
>
> have
>
> removed the *cluster.resource-blocklist.item.timeout*
>
> configuration
>
> option.
> 4. Yes, the block item will be overridden if the specified item
>
> already
>
> exists. The ADD operation is *ADD or UPDATE*.
> 5. Yes. On JM/RM side, all the blocklist information is
>
> maintained
>
> in
>
> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>
> abstracted
>
> to other interfaces) will be propagated to different
>
> components.
>
> Best,
> Lijie
>
> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>
> Hi Lijie,
>
> Thanks for updating the FLIP. It looks like the public
>
> interface
>
> section
>
> did not fully reflect all the user sensible behavior and API.
>
> Can
>
> you
>
> put
>
> everything that users may be aware of there? That would
>
> include
>
> the
>
> REST
>
> API, metrics, configurations, public java Interfaces or
>
> pluggables
>
> that
>
> users may see or implement by themselves, as well as a brief
>
> summary
>
> of
>
> the
> behavior of the public API.
>
> Besides that, I have a few questions:
>
> 1. According to the conversation in the discussion thread, it
>
> looks
>
> like
>
> the BlockAction will have "MARK_BLOCKLISTED" and
> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
>
> so,
>
> can
>
> you
>
> add
> that to the public interface as well?
>
> 2. At this point, the "Cause" field in the BlockingItem is a
>
> Throwable
>
> and
>
> is not reflected in the REST API. Should that be included in
>
> the
>
> query
>
> response? And should we change that field to be a String so
>
> users
>
> may
>
> specify the cause via the REST API when they block some nodes
>
> /
>
> TMs?
>
> 3. Would it be useful to allow users to have different
>
> timeouts
>
> for
>
> different blocked items? So while there is a default timeout,
>
> users
>
> can
>
> also override it via the REST API when they block an entity.
>
> 4. Regarding the ADD operation, if the specified item is
>
> already
>
> there,
>
> will the block item be overridden? For example, if the user
>
> wants
>
> to
>
> extend
> the timeout of a blocked item, can they just  issue an ADD
>
> command
>
> again?
>
> 5. I am not quite familiar with the details of this, but is
>
> there
>
> a
>
> source
>
> of truth for the blocked list? I think it might be good to
>
> have a
>
> single
>
> source of truth for the blocked list and just propagate that
>
> list
>
> to
>
> different components to take the action of actually blocking
>
> the
>
> resource.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>
> wangdachui9501@gmail.com
>
> wrote:
>
> Hi everyone,
>
> Based on the discussion in the mailing list, I updated the
>
> FLIP
>
> doc,
>
> the
>
> changes include:
> 1. Changed the description of the motivation section to more
>
> clearly
>
> describe the problem this FLIP is trying to solve.
> 2. Only  *Manually* is supported.
> 3. Adopted some suggestions, such as *endTimestamp*.
>
> Best,
> Lijie
>
>
> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>
> Hi Lijie!
>
>
>
>
> *a) “Probably storing inside Zookeeper/Configmap might be
>
> helpfulhere.”
>
> Can you explain it in detail? I don't fully understand that.
>
> In
>
> myopinion,
>
> non-active and active are the same, and no special treatment
>
> isrequired.*
>
> Sorry this was a misunderstanding from my side. I thought we
>
> were
>
> talking
>
> about the HA mode (but not about Active and Standalone
>
> ResourceManager).
>
> And the original question was - how to handle the
>
> blacklisted
>
> nodes
>
> list
>
> at
>
> the moment of leader change? Should we simply forget about
>
> them
>
> or
>
> try to
>
> pre-save that list on the remote storage?
>
> On Sat, 7 May 2022 at 10:51, Yang Wang <
>
> danrtsey.wy@gmail.com
>
> wrote:
>
> Thanks Lijie and ZhuZhu for the explanation.
>
> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>
> it
>
> is
>
> indeed
>
> some functionalities the external tools(e.g. kubectl taint)
>
> could
>
> not
>
> support.
>
>
> Best,
> Yang
>
> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
>
> 22:18写道：
>
> Thanks for your feedback, Jiangang and Martijn.
>
> @Jiangang
>
>
> For auto-detecting, I wonder how to make the strategy
>
> and
>
> mark a
>
> node
>
> blocked?
>
> In fact, we currently plan to not support auto-detection
>
> in
>
> this
>
> FLIP.
>
> The
>
> part about auto-detection may be continued in a separate
>
> FLIP
>
> in
>
> the
>
> future. Some guys have the same concerns as you, and the
>
> correctness
>
> and
>
> necessity of auto-detection may require further discussion
>
> in
>
> the
>
> future.
>
> In session mode, multi jobs can fail on the same bad
>
> node
>
> and
>
> the
>
> node
>
> should be marked blocked.
> By design, the blocklist information will be shared among
>
> all
>
> jobs
>
> in a
>
> cluster/session. The JM will sync blocklist information
>
> with
>
> RM.
>
> @Martijn
>
> I agree with Yang Wang on this.
>
> As Zhu Zhu and I mentioned above, we think the
>
> MARK_BLOCKLISTED(Just
>
> limits
>
> the load of the node and does not  kill all the processes
>
> on
>
> it)
>
> is
>
> also
>
> important, and we think that external systems (*yarn
>
> rmadmin
>
> or
>
> kubectl
>
> taint*) cannot support it. So we think it makes sense even
>
> only
>
> *manually*.
>
> I also agree with Chesnay that magical mechanisms are
>
> indeed
>
> super
>
> hard
>
> to get right.
> Yes, as you see, Jiangang(and a few others) have the same
>
> concern.
>
> However, we currently plan to not support auto-detection
>
> in
>
> this
>
> FLIP,
>
> and
>
> only *manually*. In addition, I'd like to say that the
>
> FLIP
>
> provides
>
> a
>
> mechanism to support MARK_BLOCKLISTED and
> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> the auto-detection may be done by external systems.
>
> Best,
> Lijie
>
> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
>
> 19:04写道：
>
> If we only support to block nodes manually, then I
>
> could
>
> not
>
> see
>
> the obvious advantages compared with current SRE's
>
> approach(via
>
> *yarn
>
> rmadmin or kubectl taint*).
>
> I agree with Yang Wang on this.
>
> To me this sounds yet again like one of those magical
>
> mechanisms
>
> that
>
> will rarely work just right.
>
> I also agree with Chesnay that magical mechanisms are
>
> indeed
>
> super
>
> hard
>
> to
>
> get right.
>
> Best regards,
>
> Martijn
>
> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>
> liujiangangpeng@gmail.com
>
> wrote:
>
> Thanks for the valuable design. The auto-detecting can
>
> decrease
>
> great
>
> work
>
> for us. We have implemented the similar feature in our
>
> inner
>
> flink
>
> version.
> Below is something that I care about:
>
>    1. For auto-detecting, I wonder how to make the
>
> strategy
>
> and
>
> mark a
>
> node
>    blocked? Sometimes the blocked node is hard to be
>
> detected,
>
> for
>
> example,
>    the upper node or the down node will be blocked when
>
> network
>
> unreachable.
>    2. I see that the strategy is made in JobMaster
>
> side.
>
> How
>
> about
>
>    implementing the similar logic in resource manager?
>
> In
>
> session
>
> mode,
>
> multi
>    jobs can fail on the same bad node and the node
>
> should
>
> be
>
> marked
>
> blocked.
>    If the job makes the strategy, the node may be not
>
> marked
>
> blocked
>
> if
>
> the
>    fail times don't exceed the threshold.
>
>
> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>
> Thank you for all your feedback!
>
> Besides the answers from Lijie, I'd like to share
>
> some
>
> of
>
> my
>
> thoughts:
>
> 1. Whether to enable automatical blocklist
> Generally speaking, it is not a goal of FLIP-224.
> The automatical way should be something built upon
>
> the
>
> blocklist
>
> mechanism and well decoupled. It was designed to be a
>
> configurable
>
> blocklist strategy, but I think we can further
>
> decouple
>
> it
>
> by
>
> introducing a abnormal node detector, as Becket
>
> suggested,
>
> which
>
> just
>
> uses the blocklist mechanism once bad nodes are
>
> detected.
>
> However,
>
> it
>
> should be a separate FLIP with further dev
>
> discussions
>
> and
>
> feedback
>
> from users. I also agree with Becket that different
>
> users
>
> have
>
> different
>
> requirements, and we should listen to them.
>
> 2. Is it enough to just take away abnormal nodes
>
> externally
>
> My answer is no. As Lijie has mentioned, we need a
>
> way
>
> to
>
> avoid
>
> deploying tasks to temporary hot nodes. In this case,
>
> users
>
> may
>
> just
>
> want to limit the load of the node and do not want to
>
> kill
>
> all
>
> the
>
> processes on it. Another case is the speculative
>
> execution[1]
>
> which
>
> may also leverage this feature to avoid starting
>
> mirror
>
> tasks on
>
> slow
>
> nodes.
>
> Thanks,
> Zhu
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>
> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
>
> 15:56写道：
>
> Hi everyone,
>
>
> Thanks for your feedback.
>
>
> There's one detail that I'd like to re-emphasize
>
> here
>
> because
>
> it
>
> can
>
> affect the value and design of the blocklist
>
> mechanism
>
> (perhaps
>
> I
>
> should
>
> highlight it in the FLIP). We propose two actions in
>
> FLIP:
>
> 1) MARK_BLOCKLISTED: Just mark the task manager or
>
> node
>
> as
>
> blocked.
>
> Future slots should not be allocated from the blocked
>
> task
>
> manager
>
> or
>
> node.
>
> But slots that are already allocated will not be
>
> affected.
>
> A
>
> typical
>
> application scenario is to mitigate machine hotspots.
>
> In
>
> this
>
> case,
>
> we
>
> hope
>
> that subsequent resource allocations will not be on
>
> the
>
> hot
>
> machine,
>
> but
>
> tasks currently running on it should not be affected.
>
> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>
> task
>
> manager
>
> or
>
> node
>
> as
>
> blocked, and evacuate all tasks on it. Evacuated
>
> tasks
>
> will
>
> be
>
> restarted on
>
> non-blocked task managers.
>
> For the above 2 actions, the former may more
>
> highlight
>
> the
>
> meaning
>
> of
>
> this FLIP, because the external system cannot do
>
> that.
>
> Regarding *Manually* and *Automatically*, I
>
> basically
>
> agree
>
> with
>
> @Becket
>
> Qin: different users have different answers. Not all
>
> users’
>
> deployment
>
> environments have a special external system that can
>
> perform
>
> the
>
> anomaly
>
> detection. In addition, adding pluggable/optional
>
> auto-detection
>
> doesn't
>
> require much extra work on top of manual
>
> specification.
>
> I will answer your other questions one by one.
>
>
> @Yangze
>
> a) I think you are right, we do not need to expose
>
> the
> `cluster.resource-blocklist.item.timeout-check-interval`
>
> to
>
> users.
>
> b) We can abstract the `notifyException` to a
>
> separate
>
> interface
>
> (maybe
>
> BlocklistExceptionListener), and the
>
> ResourceManagerBlocklistHandler
>
> can
>
> implement it in the future.
>
> @Martijn
>
> a) I also think the manual blocking should be done
>
> by
>
> cluster
>
> operators.
>
> b) I think manual blocking makes sense, because
>
> according
>
> to
>
> my
>
> experience, users are often the first to perceive the
>
> machine
>
> problems
>
> (because of job failover or delay), and they will
>
> contact
>
> cluster
>
> operators
>
> to solve it, or even tell the cluster operators which
>
> machine is
>
> problematic. From this point of view, I think the
>
> people
>
> who
>
> really
>
> need
>
> the manual blocking are the users, and it’s just
>
> performed
>
> by
>
> the
>
> cluster
>
> operator, so I think the manual blocking makes sense.
>
> @Chesnay
>
> We need to touch the logic of JM/SlotPool, because
>
> for
>
> MARK_BLOCKLISTED
>
> , we need to know whether the slot is blocklisted
>
> when
>
> the
>
> task
>
> is
>
> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>
> release
>
> the
>
> slot
>
> directly to avoid assigning other tasks (of this job)
>
> on
>
> it.
>
> If
>
> we
>
> only
>
> maintain the blocklist information on the RM, JM
>
> needs
>
> to
>
> retrieve
>
> it
>
> by
>
> RPC. I think the performance overhead of that is
>
> relatively
>
> large,
>
> so
>
> I
>
> think it's worth maintaining the blocklist
>
> information
>
> on
>
> the JM
>
> side
>
> and
>
> syncing them.
>
> @Роман
>
>     a) “Probably storing inside Zookeeper/Configmap
>
> might
>
> be
>
> helpful
>
> here.”  Can you explain it in detail? I don't fully
>
> understand
>
> that.
>
> In
>
> my
>
> opinion, non-active and active are the same, and no
>
> special
>
> treatment
>
> is
>
> required.
>
> b) I agree with you, the `endTimestamp` makes
>
> sense,
>
> I
>
> will
>
> add
>
> it
>
> to
>
> FLIP.
>
> @Yang
>
> As mentioned above, AFAK, the external system
>
> cannot
>
> support
>
> the
>
> MARK_BLOCKLISTED action.
>
> Looking forward to your further feedback.
>
>
> Best,
>
> Lijie
>
>
> Yang Wang <da...@gmail.com> 于2022年5月3日周二
>
> 21:09写道：
>
> Thanks Lijie and Zhu for creating the proposal.
>
> I want to share some thoughts about Flink cluster
>
> operations.
>
> In the production environment, the SRE(aka Site
>
> Reliability
>
> Engineer)
>
> already has many tools to detect the unstable
>
> nodes,
>
> which
>
> could
>
> take
>
> the
>
> system logs/metrics into consideration.
> Then they use graceful-decomission in YARN and
>
> taint
>
> in
>
> K8s
>
> to
>
> prevent
>
> new
>
> allocations on these unstable nodes.
> At last, they will evict all the containers and
>
> pods
>
> running
>
> on
>
> these
>
> nodes.
>
> This mechanism also works for planned maintenance.
>
> So
>
> I
>
> am
>
> afraid
>
> this
>
> is
>
> not the typical use case for FLIP-224.
>
> If we only support to block nodes manually, then I
>
> could
>
> not
>
> see
>
> the obvious advantages compared with current SRE's
>
> approach(via
>
> *yarn
>
> rmadmin or kubectl taint*).
> At least, we need to have a pluggable component
>
> which
>
> could
>
> expose
>
> the
>
> potential unstable nodes automatically and block
>
> them
>
> if
>
> enabled
>
> explicitly.
>
> Best,
> Yang
>
>
>
> Becket Qin <be...@gmail.com> 于2022年5月2日周一
>
> 16:36写道：
>
> Thanks for the proposal, Lijie.
>
> This is an interesting feature and discussion,
>
> and
>
> somewhat
>
> related
>
> to the
>
> design principle about how people should operate
>
> Flink.
>
> I think there are three things involved in this
>
> FLIP.
>
>      a) Detect and report the unstable node.
>      b) Collect the information of the unstable
>
> node
>
> and
>
> form a
>
> blocklist.
>
>      c) Take the action to block nodes.
>
> My two cents:
>
> 1. It looks like people all agree that Flink
>
> should
>
> have
>
> c).
>
> It
>
> is
>
> not only
>
> useful for cases of node failures, but also
>
> handy
>
> for
>
> some
>
> planned
>
> maintenance.
>
> 2. People have different opinions on b), i.e.
>
> who
>
> should be
>
> the
>
> brain
>
> to
>
> make the decision to block a node. I think this
>
> largely
>
> depends
>
> on
>
> who we
>
> talk to. Different users would probably give
>
> different
>
> answers.
>
> For
>
> people
>
> who do have a centralized node health management
>
> service,
>
> let
>
> Flink
>
> do just
>
> do a) and c) would be preferred. So essentially
>
> Flink
>
> would
>
> be
>
> one
>
> of
>
> the
>
> sources that may detect unstable nodes, report
>
> it
>
> to
>
> that
>
> service,
>
> and then
>
> take the command from that service to block the
>
> problematic
>
> nodes.
>
> On
>
> the
>
> other hand, for users who do not have such a
>
> service,
>
> simply
>
> letting
>
> Flink
>
> be clever by itself to block the suspicious
>
> nodes
>
> might
>
> be
>
> desired
>
> to
>
> ensure the jobs are running smoothly.
>
> So that indicates a) and b) here should be
>
> pluggable /
>
> optional.
>
> In light of this, maybe it would make sense to
>
> have
>
> something
>
> pluggable
>
> like a UnstableNodeReporter which exposes
>
> unstable
>
> nodes
>
> actively.
>
> (A
>
> more
>
> general interface should be JobInfoReporter<T>
>
> which
>
> can be
>
> used
>
> to
>
> report
>
> any information of type <T>. But I'll just keep
>
> the
>
> scope
>
> relevant
>
> to
>
> this
>
> FLIP here). Personally speaking, I think it is
>
> OK
>
> to
>
> have a
>
> default
>
> implementation of a reporter which just tells
>
> Flink
>
> to
>
> take
>
> action
>
> to
>
> block
>
> problematic nodes and also unblocks them after
>
> timeout.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>
> ro.v.boyko@gmail.com
>
> wrote:
>
> Thanks for good initiative, Lijie and Zhu!
>
> If it's possible I'd like to participate in
>
> development.
>
> I agree with 3rd point of Konstantin's reply -
>
> we
>
> should
>
> consider
>
> to move
>
> somehow the information of blocklisted
>
> nodes/TMs
>
> from
>
> active
>
> ResourceManager to non-active ones. Probably
>
> storing
>
> inside
>
> Zookeeper/Configmap might be helpful here.
>
> And I agree with Martijn that a lot of
>
> organizations
>
> don't
>
> want
>
> to
>
> expose
>
> such API for a cluster user group. But I think
>
> it's
>
> necessary
>
> to
>
> have the
>
> mechanism for unblocking the nodes/TMs anyway
>
> for
>
> avoiding
>
> incorrect
>
> automatic behaviour.
>
> And another one small suggestion - I think it
>
> would
>
> be
>
> better
>
> to
>
> extend
>
> the
>
> *BlocklistedItem* class with the
>
> *endTimestamp*
>
> field
>
> and
>
> fill
>
> it
>
> at the
>
> item creation. This simple addition will allow
>
> to:
>
>    -
>
>    Provide the ability to users to setup the
>
> exact
>
> time
>
> of
>
> blocklist end
>
>    through RestAPI
>    -
>
>    Not being tied to a single value of
>    *cluster.resource-blacklist.item.timeout*
>
>
> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>
> <
>
> chesnay@apache.org>
>
> wrote:
>
> I do share the concern between blurring the
>
> lines
>
> a
>
> bit.
>
> That said, I'd prefer to not have any
>
> auto-detection
>
> and
>
> only
>
> have an
>
> opt-in mechanism
> to manually block processes/nodes. To me
>
> this
>
> sounds
>
> yet
>
> again
>
> like one
>
> of those
> magical mechanisms that will rarely work
>
> just
>
> right.
>
> An external system can leverage way more
>
> information
>
> after
>
> all.
>
> Moreover, I'm quite concerned about the
>
> complexity
>
> of
>
> this
>
> proposal.
>
> Tracking on both the RM/JM side; syncing
>
> between
>
> components;
>
> adjustments
>
> to the
> slot and resource protocol.
>
> In a way it seems overly complicated.
>
> If we look at it purely from an active
>
> resource
>
> management
>
> perspective,
>
> then there
> isn't really a need to touch the slot
>
> protocol
>
> at
>
> all
>
> (or
>
> in
>
> fact
>
> to
>
> anything in the JobMaster),
> because there isn't any point in keeping
>
> around
>
> blocked
>
> TMs
>
> in
>
> the
>
> first
>
> place.
> They'd just be idling, potentially shutting
>
> down
>
> after
>
> a
>
> while
>
> by
>
> the
>
> RM
>
> because of
> it (unless we _also_ touch that logic).
> Here the blocking of a process (be it by
>
> blocking
>
> the
>
> process
>
> or
>
> node)
>
> is
>
> equivalent with shutting down the blocked
>
> process(es).
>
> Once the block is lifted we can just spin it
>
> back
>
> up.
>
> And I do wonder whether we couldn't apply
>
> the
>
> same
>
> line
>
> of
>
> thinking to
>
> standalone resource management.
> Here being able to stop/restart a
>
> process/node
>
> manually
>
> should
>
> be
>
> a
>
> core
>
> requirement for a Flink deployment anyway.
>
>
> On 02/05/2022 08:49, Martijn Visser wrote:
>
> Hi everyone,
>
> Thanks for creating this FLIP. I can
>
> understand
>
> the
>
> problem
>
> and
>
> I see
>
> value
>
> in the automatic detection and
>
> blocklisting. I
>
> do
>
> have
>
> some
>
> concerns
>
> with
>
> the ability to manually specify to be
>
> blocked
>
> resources.
>
> I
>
> have
>
> two
>
> concerns;
>
> * Most organizations explicitly have a
>
> separation
>
> of
>
> concerns,
>
> meaning
>
> that
>
> there's a group who's responsible for
>
> managing a
>
> cluster
>
> and
>
> there's
>
> a
>
> user
>
> group who uses that cluster. With the
>
> introduction of
>
> this
>
> mechanism,
>
> the
>
> latter group now can influence the
>
> responsibility
>
> of
>
> the
>
> first
>
> group.
>
> So
>
> it
>
> can be possible that someone from the user
>
> group
>
> blocks
>
> something,
>
> which
>
> causes an outage (which could result in
>
> paging
>
> mechanism
>
> triggering
>
> etc)
>
> which impacts the first group.
> * How big is the group of people who can
>
> go
>
> through
>
> the
>
> process
>
> of
>
> manually
>
> identifying a node that isn't behaving as
>
> it
>
> should
>
> be? I
>
> do
>
> think
>
> this
>
> group is relatively limited. Does it then
>
> make
>
> sense
>
> to
>
> introduce
>
> such
>
> a
>
> feature, which would only be used by a
>
> really
>
> small
>
> user
>
> group
>
> of
>
> Flink?
>
> We
>
> still have to maintain, test and support
>
> such
>
> a
>
> feature.
>
> I'm +1 for the autodetection features, but
>
> I'm
>
> leaning
>
> towards
>
> not
>
> exposing
>
> this to the user group but having this
>
> available
>
> strictly
>
> for
>
> cluster
>
> operators. They could then also set up
>
> their
>
> paging/metrics/logging
>
> system
>
> to take this into account.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>
>
> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>
> karmagyz@gmail.com
>
> wrote:
>
> Thanks for driving this, Zhu and Lijie.
>
> +1 for the overall proposal. Just share
>
> some
>
> cents
>
> here:
>
> - Why do we need to expose
>
> cluster.resource-blacklist.item.timeout-check-interval
>
> to
>
> the
>
> user?
>
> I think the semantics of
>
> `cluster.resource-blacklist.item.timeout`
>
> is
>
> sufficient for the user. How to guarantee
>
> the
>
> timeout
>
> mechanism is
>
> Flink's internal implementation. I think
>
> it
>
> will
>
> be
>
> very
>
> confusing
>
> and
>
> we do not need to expose it to users.
>
> - ResourceManager can notify the
>
> exception
>
> of a
>
> task
>
> manager to
>
> `BlacklistHandler` as well.
> For example, the slot allocation might
>
> fail
>
> in
>
> case
>
> the
>
> target
>
> task
>
> manager is busy or has a network jitter.
>
> I
>
> don't
>
> mean
>
> we
>
> need
>
> to
>
> cover
>
> this case in this version, but we can
>
> also
>
> open a
>
> `notifyException`
>
> in
>
> `ResourceManagerBlacklistHandler`.
>
> - Before we sync the blocklist to
>
> ResourceManager,
>
> will
>
> the
>
> slot of
>
> a
>
> blocked task manager continues to be
>
> released
>
> and
>
> allocated?
>
> Best,
> Yangze Guo
>
> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>
> Wang
>
> <
>
> wangdachui9501@gmail.com>
>
> wrote:
>
> Hi Konstantin,
>
> Thanks for your feedback. I will
>
> response
>
> your 4
>
> remarks:
>
> 1) Thanks for reminding me of the
>
> controversy. I
>
> think
>
> “BlockList”
>
> is
>
> good
>
> enough, and I will change it in FLIP.
>
>
> 2) Your suggestion for the REST API is a
>
> good
>
> idea.
>
> Based
>
> on
>
> the
>
> above, I
>
> would change REST API as following:
>
> POST/GET <host>/blocklist/nodes
>
> POST/GET <host>/blocklist/taskmanagers
>
> DELETE
>
> <host>/blocklist/node/<identifier>
>
> DELETE
>
> <host>/blocklist/taskmanager/<identifier>
>
> 3) If a node is blocking/blocklisted, it
>
> means
>
> that
>
> all
>
> task
>
> managers
>
> on
>
> this node are blocklisted. All slots on
>
> these
>
> TMs
>
> are
>
> not
>
> available.
>
> This
>
> is actually a bit like TM losts, but
>
> these
>
> TMs
>
> are
>
> not
>
> really
>
> lost,
>
> they
>
> are in an unavailable status, and they
>
> are
>
> still
>
> registered
>
> in this
>
> flink
>
> cluster. They will be available again
>
> once
>
> the
>
> corresponding
>
> blocklist
>
> item
>
> is removed. This behavior is the same in
>
> active/non-active
>
> clusters.
>
> However in the active clusters, these
>
> TMs
>
> may
>
> be
>
> released
>
> due
>
> to
>
> idle
>
> timeouts.
>
>
> 4) For the item timeout, I prefer to
>
> keep
>
> it.
>
> The
>
> reasons
>
> are
>
> as
>
> following:
>
> a) The timeout will not affect users
>
> adding
>
> or
>
> removing
>
> items
>
> via
>
> REST
>
> API,
>
> and users can disable it by configuring
>
> it
>
> to
>
> Long.MAX_VALUE .
>
> b) Some node problems can recover after
>
> a
>
> period of
>
> time
>
> (such as
>
> machine
>
> hotspots), in which case users may
>
> prefer
>
> that
>
> Flink
>
> can
>
> do
>
> this
>
> automatically instead of requiring the
>
> user
>
> to
>
> do
>
> it
>
> manually.
>
> Best,
>
> Lijie
>
> Konstantin Knauf <kn...@apache.org>
>
> 于2022年4月27日周三
>
> 19:23写道：
>
> Hi Lijie,
>
> I think, this makes sense and +1 to
>
> only
>
> support
>
> manually
>
> blocking
>
> taskmanagers and nodes. Maybe the
>
> different
>
> strategies
>
> can
>
> also be
>
> maintained outside of Apache Flink.
>
> A few remarks:
>
> 1) Can we use another term than
>
> "bla.cklist"
>
> due
>
> to
>
> the
>
> controversy
>
> around
>
> the term? [1] There was also a Jira
>
> Ticket
>
> about
>
> this
>
> topic a
>
> while
>
> back
>
> and there was generally a consensus to
>
> avoid
>
> the
>
> term
>
> blacklist &
>
> whitelist
>
> [2]? We could use "blocklist"
>
> "denylist"
>
> or
>
> "quarantined"
>
> 2) For the REST API, I'd prefer a
>
> slightly
>
> different
>
> design
>
> as
>
> verbs
>
> like
>
> add/remove often considered an
>
> anti-pattern
>
> for
>
> REST
>
> APIs.
>
> POST
>
> on a
>
> list
>
> item is generally the standard to add
>
> items.
>
> DELETE
>
> on
>
> the
>
> individual
>
> resource is standard to remove an item.
>
> POST <host>/quarantine/items
> DELETE
>
> <host>/quarantine/items/<itemidentifier>
>
> We could also consider to separate
>
> taskmanagers
>
> and
>
> nodes
>
> in
>
> the
>
> REST
>
> API
>
> (and internal data structures). Any
>
> opinion
>
> on
>
> this?
>
> POST/GET <host>/quarantine/nodes
> POST/GET <host>/quarantine/taskmanager
> DELETE
>
> <host>/quarantine/nodes/<identifier>
>
> DELETE
>
> <host>/quarantine/taskmanager/<identifier>
>
> 3) How would blocking nodes behave with
>
> non-active
>
> resource
>
> managers,
>
> i.e.
>
> standalone or reactive mode?
>
> 4) To keep the implementation even more
>
> minimal,
>
> do
>
> we
>
> need
>
> the
>
> timeout
>
> behavior? If items are added/removed
>
> manually
>
> we
>
> could
>
> delegate
>
> this
>
> to the
>
> user easily. In my opinion the timeout
>
> behavior
>
> would
>
> better
>
> fit
>
> into
>
> specific strategies at a later point.
>
> Looking forward to your thoughts.
>
> Cheers and thank you,
>
> Konstantin
>
> [1]
>
>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>
> [2]
>
> https://issues.apache.org/jira/browse/FLINK-18209
>
> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>
> schrieb
>
> Lijie
>
> Wang
>
> <
>
> wangdachui9501@gmail.com>:
>
> Hi all,
>
> Flink job failures may happen due to
>
> cluster
>
> node
>
> issues
>
> (insufficient
>
> disk
>
> space, bad hardware, network
>
> abnormalities).
>
> Flink
>
> will
>
> take care
>
> of
>
> the
>
> failures and redeploy the tasks.
>
> However,
>
> due
>
> to
>
> data
>
> locality
>
> and
>
> limited
>
> resources, the new tasks are very
>
> likely
>
> to
>
> be
>
> redeployed
>
> to the
>
> same
>
> nodes, which will result in continuous
>
> task
>
> abnormalities
>
> and
>
> affect
>
> job
>
> progress.
>
> Currently, Flink users need to
>
> manually
>
> identify
>
> the
>
> problematic
>
> node and
>
> take it offline to solve this problem.
>
> But
>
> this
>
> approach
>
> has
>
> following
>
> disadvantages:
>
> 1. Taking a node offline can be a
>
> heavy
>
> process.
>
> Users
>
> may
>
> need
>
> to
>
> contact
>
> cluster administors to do this. The
>
> operation
>
> can
>
> even
>
> be
>
> dangerous
>
> and
>
> not
>
> allowed during some important business
>
> events.
>
> 2. Identifying and solving this kind
>
> of
>
> problems
>
> manually
>
> would
>
> be
>
> slow
>
> and
>
> a waste of human resources.
>
> To solve this problem, Zhu Zhu and I
>
> propose
>
> to
>
> introduce a
>
> blacklist
>
> mechanism for Flink to filter out
>
> problematic
>
> resources.
>
> You can find more details in
>
> FLIP-224[1].
>
> Looking
>
> forward
>
> to your
>
> feedback.
>
> [1]
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>
> Best,
>
> Lijie
>
> --
> Best regards,
> Roman Boyko
> e.: ro.v.boyko@gmail.com
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk
>
>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Chesnay Schepler <ch...@apache.org>.

1)
It's true that if we handle this entirely in the scheduler we may get a 
bunch of slow slots from the RM.
My point is that this isn't necessarily a problem; it depends on how you 
handle those slots.

We anyway need to account for the possibility that we're not getting any 
new fast slots from the RM.
With that in mind, I'd rather not /categorically/ throw away the slow 
slots we got, but try to make use of them somehow.

2)
The idea was to rescale the job when the current stage finishes.
The speculative execution handles slow nodes being detected in the 
current stage,
next stage we use _some_ strategy to handle slow nodes (be it ignoring 
those, rescaling the job, ...).

3)
It's a fair observation that once you push this to the RM you end up 
with a de-facto blocklist :)

On 08/06/2022 17:11, Zhu Zhu wrote:
> Thanks Chesnay for the feedback in the vote thread.
> (https://lists.apache.org/thread/opc7jg3rpxnwotkb0fcn4wnm02m4397o)
>
> I'd like to continue the discussion in this thread so that the
> discussions can be better tracked.
>
> Regarding your questions, here are my thoughts:
> 1. The current locality mechanism does not work well to avoid
> deploying tasks to slow nodes because it cannot proactively reject or
> release slots from the slow nodes. And it cannot help at the resource
> manager side to avoid allocating free slots or launching new
> TaskManagers on slow nodes.
> 2. Dynamically downscaling or upscaling a batch job is usually
> unacceptable because it means to re-run the whole stage.
> 3. Extending the requirement declaration to have a notion of
> "undesirable nodes" is an option. And it is actually how we started.
> Considering implementation details, we found we need that
> - a tracker to collect all the undesirable(detected slow nodes)
> - to filter out slots from undesirable nodes when allocating slots
> from the SlotPool
> - ask the ResourceManager for slots that are not on the undesirable
> nodes. The ResourceManager may further need to ask for new
> TaskManagers that are not on the undesirable nodes.
> Then, with all these functionality, we found that we almost have a
> blocklist mechanism. As blocklist mechanism is a common concept and is
> possible to benefit users, we took this chance to propose the
> blocklist mechanism.
> 4. A cluster-wide shared blocklist is not a must for speculative
> execution at the moment. It is mainly part of a standalone blocklist
> feature to host user specified block items. To avoid sharing job
> specific blocked items between jobs, one way is to add a nullable
> JobID tag for the blocked item.
>
> Thanks,
> Zhu
>
> Zhu Zhu<re...@gmail.com>  于2022年6月7日周二 10:58写道：
>> Hi Chesnay,
>>
>> Would you please take a look at the FLIP and discussion to see if all
>> your concerns have been addressed?
>>
>> Thanks,
>> Zhu
>>
>> Zhu Zhu<re...@gmail.com>  于2022年5月28日周六 13:26写道：
>>> Regarding the concern of the SlotManager, my two cents:
>>> 1. it is necessary for the SlotManager to host blocked slots, in 2 cases:
>>>    a. In standalone mode, a taskmanager may be temporarily added to
>>> the blocklist. We do not want the TM to get disconnected and shut down.
>>> So we need to keep its connection to RM and keep hosting its slots.
>>>    b. When we want to avoid allocating slots to a slow nodes but do not
>>> want to kill current running tasks on the nodes (MARK_BLOCKED mode).
>>>
>>> There is possible a way to keep the connection of a blocked task manager
>>> while hide its slots from SlotManager, but I feel it may be even much more
>>> complicated.
>>>
>>> 2. It will not complicate the SlotManager too much. The SlotManager will
>>> be offered a BlockedTaskManagerChecker when created, and just need
>>> to use it to filter out blocked slots on slot request. Therefore I think the
>>> complication is acceptable.
>>>
>>> Thanks,
>>> Zhu
>>>
>>> Lijie Wang<wa...@gmail.com>  于2022年5月25日周三 15:26写道：
>>>> Hi everyone,
>>>>
>>>> I've updated the FLIP according to Chesnay's feedback, changes as follows:
>>>> 1. Change the GET result to a map.
>>>> 2. Only left *endTimestamp* in ADD operation, and change the rest (from
>>>> POST) to PUT
>>>> 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
>>>> encapsulate blocklist related functions.
>>>> 4. Remove *mainThread* from BlocklistTracker, instead provide a
>>>> *removeTimeoutItems* method to be called by outside components。
>>>>
>>>> Best,
>>>> Lijie
>>>>
>>>> Lijie Wang<wa...@gmail.com>  于2022年5月23日周一 22:51写道：
>>>>
>>>>> Hi Chesnay,
>>>>>
>>>>> Thanks for feedback.
>>>>>
>>>>> 1. Regarding the TM/Node id. Do you mean special characters may appear in
>>>>> the rest URL?  Actually, I don't think so. The task manager id in REST API
>>>>> should be the *ResourceID* of taskmanager in Flink, there should be no
>>>>> special characters, and some existing REST APIs are already using it, e.g.
>>>>> GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
>>>>> id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
>>>>> should also have no special characters.
>>>>> 2. Regarding the GET query responses. I agree with you, it makes sense to
>>>>> change the GET result to a map.
>>>>>
>>>>> 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
>>>>> cover everything, and the endTimestamp is a unix timestamp, there should be
>>>>> no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
>>>>> add rest api is add or update, PUT can cover this semantics.
>>>>>
>>>>> 4. Regarding the slot pool/manager. I don't think the current slotpool
>>>>> and slotmanager are able to support the MARK_BLOCKED(slots that are
>>>>> already allocated will not be affected) action. The reasons are as
>>>>> follows：
>>>>>
>>>>> a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
>>>>> from reserved(task assigned) to free(no task assigned), it is necessary to
>>>>> check whether the slot should be released immediately(it should be released
>>>>> immediately if the task manager is blocked, otherwise it may be allocated
>>>>> to other tasks). I think it cannot be supported without being aware of
>>>>> the blocklist information. Compared to the solution in FLIP, a more
>>>>> appropriate/prefered way may be: Introduce a new slot pool
>>>>> implementation for blocklist(may be named BlocklistSlotPool, it
>>>>> extends/wrapps the original slot pool), and implement the parts that need
>>>>> to be aware of the blocklist in this newly introduced slot pool, and the
>>>>> original slot pool basically does not need to change.
>>>>>
>>>>> b) for slot manager, with the MARK_BLOCKED action, there may be free but
>>>>> blocked slots in slot manager (the corresponding TMs cannot be
>>>>> released/unregistered because there are still running tasks on them).
>>>>> Therefore, we need to filter out the blocked slots when trying to fulfill
>>>>> the slot requirements. Therefore it also needs to know the blocklist information.
>>>>> A better way may be to abstract a resource allocation strategy, and make
>>>>> the blocklist as a special implementation, then pass the resource
>>>>> allocation strategy in when constructing the slot manager. Unfortunately,
>>>>> the data structures in the two existing slot manager
>>>>> implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
>>>>> quite different, it is not easy to abstract a common resource allocation
>>>>> strategy, so we prefer to keep the current way(i.e. pass the blocklist
>>>>> information directly into slot manager).
>>>>>
>>>>>
>>>>> 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
>>>>> does not need to be aware of the executor, and the timeout actions can be
>>>>> done outside.
>>>>>
>>>>> Chesnay Schepler<ch...@apache.org>  于2022年5月20日周五 17:34写道：
>>>>>
>>>>>> I have a number of concerns:
>>>>>>
>>>>>> Is the id used for deleting an item the same sent in the initial request
>>>>>> (and not one returned by Flink)?
>>>>>> I'm very concerned that the tm/node id can contain special characters.
>>>>>>
>>>>>> The GET query should return a map, not a list of items. This makes it
>>>>>> easier to work with.
>>>>>>
>>>>>> The duality of endTimestamp and duration is also concerning.
>>>>>> If we conclude that endTimestamps can in fact work (and aren't utterly
>>>>>> unusable due to timezones),
>>>>>> then this should be able to cover everything and rid us of some
>>>>>> complexity w.r.t. POSTs to the same ID.
>>>>>> Additions would be a PUT, changes a PATCH, deletes a DELETE.
>>>>>>
>>>>>>
>>>>>> I also dislike how we're pushing more functionality into the
>>>>>> slotpool/-manager.
>>>>>> These components are complex enough as-is, and instead I'd propose a
>>>>>> separate component that interacts with the SlotPool/-Manager instead,
>>>>>> for example by removing the slots from that TM.
>>>>>> The reason being that from the slot-pool perspective it is irrelevant
>>>>>> whether a slot is gone because the TM was lost, or because it was blocked.
>>>>>>
>>>>>>
>>>>>> The BlocklistTracker shouldn't be modeled as component that is aware of
>>>>>> the concept of main threads.
>>>>>> It really has no business knowing that; all it needs is an executor for
>>>>>> handling timeouts/periodic actions,
>>>>>> and a way to interact with the JM/RM (which internally can handle the
>>>>>> scheduling into the main thread).
>>>>>>
>>>>>>
>>>>>> On 20/05/2022 07:20, Lijie Wang wrote:
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I have started a vote for this FLIP [1]. Please cast your vote there or
>>>>>> ask
>>>>>>> additional questions here. [1]
>>>>>>> https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
>>>>>>>
>>>>>>> Best, Lijie
>>>>>>>
>>>>>>> Lijie Wang<wa...@gmail.com>  于2022年5月19日周四 17:34写道：
>>>>>>>
>>>>>>>> Hi Konstantin,
>>>>>>>>
>>>>>>>> We found that Flink REST URL does not support the format ":merge" ,
>>>>>> which
>>>>>>>> will be recognized as a parameter in the URL(due to start with a
>>>>>> colon).
>>>>>>>> We will keep the previous way, i.e.
>>>>>>>>
>>>>>>>> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
>>>>>>>> and the "id" and "merge" flag are put into the request body.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Lijie
>>>>>>>>
>>>>>>>> Lijie Wang<wa...@gmail.com>  于2022年5月18日周三 09:35写道：
>>>>>>>>
>>>>>>>>> Hi Weihua,
>>>>>>>>> thanks for feedback.
>>>>>>>>>
>>>>>>>>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
>>>>>>>>> step towards auto-detection.
>>>>>>>>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
>>>>>>>>> exception of insufficient resources.
>>>>>>>>> 3. No. This FLIP won't change the WebUI. The blocklist information
>>>>>> can be
>>>>>>>>> obtained through REST API and metrics.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Lijie
>>>>>>>>>
>>>>>>>>> Weihua Hu<hu...@gmail.com>  于2022年5月17日周二 21:41写道：
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> Thanks for creating this FLIP.
>>>>>>>>>> We have implemented an automatic blocklist detection mechanism
>>>>>>>>>> internally, which is indeed very effective for handling node
>>>>>> failures.
>>>>>>>>>> Due to the large number of nodes, although SREs already support
>>>>>>>>>> automatic offline failure nodes, the detection is not 100% accurate
>>>>>> and
>>>>>>>>>> there is some delay.
>>>>>>>>>> So the blocklist mechanism can make flink job recover from failure
>>>>>> much
>>>>>>>>>> faster.
>>>>>>>>>>
>>>>>>>>>> Here are some of my thoughts:
>>>>>>>>>> 1. In this FLIP, it needs users to locate machine failure manually,
>>>>>>>>>> there is a certain cost of use
>>>>>>>>>> 2. What happens if too many nodes are blocked, resulting in
>>>>>> insufficient
>>>>>>>>>> resources? Will there be a special Exception for the user?
>>>>>>>>>> 3. Will we display the blocklist information in the WebUI? The
>>>>>> blocklist
>>>>>>>>>> is for cluster level, and if multiple users share a cluster, some
>>>>>> users may
>>>>>>>>>> be a little confused when resources are not enough, or when
>>>>>> resources are
>>>>>>>>>> applied for more.
>>>>>>>>>>
>>>>>>>>>> Also, Looking forward to the next FLIP on auto-detection.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Weihua
>>>>>>>>>>
>>>>>>>>>>> 2022年5月16日 下午11:22，Lijie Wang<wa...@gmail.com>  写道：
>>>>>>>>>>>
>>>>>>>>>>> Hi Konstantin,
>>>>>>>>>>>
>>>>>>>>>>> Maybe change it to the following:
>>>>>>>>>>>
>>>>>>>>>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>>>>>>>>>> Merge is not allowed. If the {id} already exists, return error.
>>>>>>>>>> Otherwise,
>>>>>>>>>>> create a new item.
>>>>>>>>>>>
>>>>>>>>>>> 2. POST: http://
>>>>>>>>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>>>>>>>>>> Merge is allowed. If the {id} already exists, merge. Otherwise,
>>>>>> create
>>>>>>>>>> a
>>>>>>>>>>> new item.
>>>>>>>>>>>
>>>>>>>>>>> WDYT?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Lijie
>>>>>>>>>>>
>>>>>>>>>>> Konstantin Knauf<kn...@apache.org>  于2022年5月16日周一 20:07写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Lijie,
>>>>>>>>>>>>
>>>>>>>>>>>> hm, maybe the following is more appropriate in that case
>>>>>>>>>>>>
>>>>>>>>>>>> POST: http://
>>>>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Konstantin
>>>>>>>>>>>>
>>>>>>>>>>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>>>>>>>>>>>> wangdachui9501@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Konstantin,
>>>>>>>>>>>>> thanks for your feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>>   From what I understand, PUT should be idempotent. However, we
>>>>>> have a
>>>>>>>>>>>>> *timeout* field in the request. This means that initiating the
>>>>>> same
>>>>>>>>>>>> request
>>>>>>>>>>>>> at two different times will lead to different resource status
>>>>>>>>>> (timestamps
>>>>>>>>>>>>> of the items to be removed will be different).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Should we use PUT in this case? WDYT?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>
>>>>>>>>>>>>> Konstantin Knauf<kn...@apache.org>  于2022年5月13日周五 17:20写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Lijie,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wouldn't the REST API-idiomatic way for an update/replace be a
>>>>>> PUT
>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> resource?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Konstantin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
>>>>>>>>>> made
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> following changes on REST API:
>>>>>>>>>>>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
>>>>>> choose
>>>>>>>>>>>>> one.
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> both are specified, will return error.
>>>>>>>>>>>>>>> 2.  If the specified item is already there, the *ADD* operation
>>>>>> has
>>>>>>>>>>>> two
>>>>>>>>>>>>>>> behaviors:  *return error*(default value) or *merge/update*,
>>>>>> and we
>>>>>>>>>>>>> add a
>>>>>>>>>>>>>>> flag to the request body to control it. You can find more
>>>>>> details
>>>>>>>>>>>>> "Public
>>>>>>>>>>>>>>> Interface" section.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If there is no more feedback, we will start the vote thread next
>>>>>>>>>>>> week.
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lijie Wang<wa...@gmail.com>  于2022年5月10日周二 17:14写道：
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Becket Qin,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your suggestions.  I have moved the description of
>>>>>>>>>>>>>>>> configurations, metrics and REST API into "Public Interface"
>>>>>>>>>>>> section,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> made a few updates according to your suggestion.  And in this
>>>>>>>>>> FLIP,
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>> no public java Interfaces or pluggables that users need to
>>>>>>>>>>>> implement
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Answers for you questions:
>>>>>>>>>>>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>>>>>>>>>>>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>>>>>>>>>>>> items
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> only be added through the REST API, so these 2 action are
>>>>>>>>>> mentioned
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> REST API part (The REST API part has beed moved to public
>>>>>>>>>> interface
>>>>>>>>>>>>>> now).
>>>>>>>>>>>>>>>> 2. I agree with you. I have changed the "Cause" field to
>>>>>> String,
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> allow
>>>>>>>>>>>>>>>> users to specify it via REST API.
>>>>>>>>>>>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>>>>>>>>>>>> above,
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
>>>>>>>>>> ADD
>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>> API to specify when to remove the blocked item. These 2 fields
>>>>>> are
>>>>>>>>>>>>>>>> optional, if neither is specified, it means that the blocked
>>>>>> item
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> permanent and will not be removed. If both are specified, the
>>>>>>>>>>>> minimum
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
>>>>>> the
>>>>>>>>>>>>> time
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> remove the blocked item. To keep the configurations more
>>>>>> minimal,
>>>>>>>>>>>> we
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> removed the *cluster.resource-blocklist.item.timeout*
>>>>>>>>>> configuration
>>>>>>>>>>>>>>>> option.
>>>>>>>>>>>>>>>> 4. Yes, the block item will be overridden if the specified item
>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> exists. The ADD operation is *ADD or UPDATE*.
>>>>>>>>>>>>>>>> 5. Yes. On JM/RM side, all the blocklist information is
>>>>>> maintained
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>>>>>>>>>>>>>>> abstracted
>>>>>>>>>>>>>>>> to other interfaces) will be propagated to different
>>>>>> components.
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Becket Qin<be...@gmail.com>  于2022年5月10日周二 11:26写道：
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Lijie,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for updating the FLIP. It looks like the public
>>>>>> interface
>>>>>>>>>>>>>> section
>>>>>>>>>>>>>>>>> did not fully reflect all the user sensible behavior and API.
>>>>>> Can
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> put
>>>>>>>>>>>>>>>>> everything that users may be aware of there? That would
>>>>>> include
>>>>>>>>>>>> the
>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>> API, metrics, configurations, public java Interfaces or
>>>>>>>>>> pluggables
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> users may see or implement by themselves, as well as a brief
>>>>>>>>>>>> summary
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> behavior of the public API.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Besides that, I have a few questions:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. According to the conversation in the discussion thread, it
>>>>>>>>>>>> looks
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>>>>>>>>>>>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
>>>>>> so,
>>>>>>>>>>>> can
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>>> that to the public interface as well?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>>>>>>>>>>>>> Throwable
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> is not reflected in the REST API. Should that be included in
>>>>>> the
>>>>>>>>>>>>> query
>>>>>>>>>>>>>>>>> response? And should we change that field to be a String so
>>>>>> users
>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>> specify the cause via the REST API when they block some nodes
>>>>>> /
>>>>>>>>>>>> TMs?
>>>>>>>>>>>>>>>>> 3. Would it be useful to allow users to have different
>>>>>> timeouts
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> different blocked items? So while there is a default timeout,
>>>>>>>>>>>> users
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> also override it via the REST API when they block an entity.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 4. Regarding the ADD operation, if the specified item is
>>>>>> already
>>>>>>>>>>>>>> there,
>>>>>>>>>>>>>>>>> will the block item be overridden? For example, if the user
>>>>>> wants
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> extend
>>>>>>>>>>>>>>>>> the timeout of a blocked item, can they just  issue an ADD
>>>>>>>>>> command
>>>>>>>>>>>>>>> again?
>>>>>>>>>>>>>>>>> 5. I am not quite familiar with the details of this, but is
>>>>>> there
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>> of truth for the blocked list? I think it might be good to
>>>>>> have a
>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>> source of truth for the blocked list and just propagate that
>>>>>> list
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> different components to take the action of actually blocking
>>>>>> the
>>>>>>>>>>>>>>> resource.
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>>>>>>>>>>>> wangdachui9501@gmail.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Based on the discussion in the mailing list, I updated the
>>>>>> FLIP
>>>>>>>>>>>>> doc,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> changes include:
>>>>>>>>>>>>>>>>>> 1. Changed the description of the motivation section to more
>>>>>>>>>>>>> clearly
>>>>>>>>>>>>>>>>>> describe the problem this FLIP is trying to solve.
>>>>>>>>>>>>>>>>>> 2. Only  *Manually* is supported.
>>>>>>>>>>>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Roman Boyko<ro...@gmail.com>  于2022年5月7日周六 19:25写道：
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Lijie!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>>>>>>>>>>>>>>>>> helpfulhere.”
>>>>>>>>>>>>>>>>>>> Can you explain it in detail? I don't fully understand that.
>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>> myopinion,
>>>>>>>>>>>>>>>>>>> non-active and active are the same, and no special treatment
>>>>>>>>>>>>>>>>> isrequired.*
>>>>>>>>>>>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>>>>>>>>>>>>> were
>>>>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>>>>>>> about the HA mode (but not about Active and Standalone
>>>>>>>>>>>>>>>>> ResourceManager).
>>>>>>>>>>>>>>>>>>> And the original question was - how to handle the
>>>>>> blacklisted
>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>> the moment of leader change? Should we simply forget about
>>>>>>>>>>>> them
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>> try to
>>>>>>>>>>>>>>>>>>> pre-save that list on the remote storage?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <
>>>>>> danrtsey.wy@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>>>>>>>>>>>> it
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> indeed
>>>>>>>>>>>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> support.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Lijie Wang<wa...@gmail.com>  于2022年5月6日周五
>>>>>> 22:18写道：
>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> @Jiangang
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> mark a
>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>> blocked?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In fact, we currently plan to not support auto-detection
>>>>>>>>>>>> in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>> part about auto-detection may be continued in a separate
>>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> future. Some guys have the same concerns as you, and the
>>>>>>>>>>>>>>>>> correctness
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> necessity of auto-detection may require further discussion
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>>>>>>>>>>>> node
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>> should be marked blocked.
>>>>>>>>>>>>>>>>>>>>> By design, the blocklist information will be shared among
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>> jobs
>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>> cluster/session. The JM will sync blocklist information
>>>>>>>>>>>> with
>>>>>>>>>>>>>> RM.
>>>>>>>>>>>>>>>>>>>>> @Martijn
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I agree with Yang Wang on this.
>>>>>>>>>>>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED(Just
>>>>>>>>>>>>>>>>>>>> limits
>>>>>>>>>>>>>>>>>>>>> the load of the node and does not  kill all the processes
>>>>>>>>>>>> on
>>>>>>>>>>>>>> it)
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>>>> important, and we think that external systems (*yarn
>>>>>>>>>>>> rmadmin
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> kubectl
>>>>>>>>>>>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>> *manually*.
>>>>>>>>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>>>>>>>>>>>> indeed
>>>>>>>>>>>>>>>>> super
>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>> to get right.
>>>>>>>>>>>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>>>>>>>>>>>>>>> concern.
>>>>>>>>>>>>>>>>>>>>> However, we currently plan to not support auto-detection
>>>>>>>>>>>> in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> FLIP,
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>>>>>>>>>>>> FLIP
>>>>>>>>>>>>>>>>> provides
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>>>>>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>>>>>>>>>>>>>>>>>>>>> the auto-detection may be done by external systems.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Martijn Visser<ma...@ververica.com>  于2022年5月6日周五
>>>>>>>>>>>>> 19:04写道：
>>>>>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>>>>>>>>>>>> could
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>>>>>>>>>>>>>> approach(via
>>>>>>>>>>>>>>>>>> *yarn
>>>>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I agree with Yang Wang on this.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> To me this sounds yet again like one of those magical
>>>>>>>>>>>>>>>>> mechanisms
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>> will rarely work just right.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>>>>>>>>>>>> indeed
>>>>>>>>>>>>>>>>> super
>>>>>>>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> get right.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>>>>>>>>>>>>>>>>>> liujiangangpeng@gmail.com
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>>>>>>>>>>>>>>> decrease
>>>>>>>>>>>>>>>>>>> great
>>>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>> for us. We have implemented the similar feature in our
>>>>>>>>>>>>>> inner
>>>>>>>>>>>>>>>>> flink
>>>>>>>>>>>>>>>>>>>>>>> version.
>>>>>>>>>>>>>>>>>>>>>>> Below is something that I care about:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     1. For auto-detecting, I wonder how to make the
>>>>>>>>>>>>> strategy
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> mark a
>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>     blocked? Sometimes the blocked node is hard to be
>>>>>>>>>>>>>>> detected,
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> example,
>>>>>>>>>>>>>>>>>>>>>>>     the upper node or the down node will be blocked when
>>>>>>>>>>>>>>> network
>>>>>>>>>>>>>>>>>>>>>>> unreachable.
>>>>>>>>>>>>>>>>>>>>>>>     2. I see that the strategy is made in JobMaster
>>>>>>>>>>>> side.
>>>>>>>>>>>>>> How
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>>>>     implementing the similar logic in resource manager?
>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>>>>>>>>>> multi
>>>>>>>>>>>>>>>>>>>>>>>     jobs can fail on the same bad node and the node
>>>>>>>>>>>> should
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> marked
>>>>>>>>>>>>>>>>>>>>>>> blocked.
>>>>>>>>>>>>>>>>>>>>>>>     If the job makes the strategy, the node may be not
>>>>>>>>>>>>>> marked
>>>>>>>>>>>>>>>>>> blocked
>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>     fail times don't exceed the threshold.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Zhu Zhu<re...@gmail.com>  于2022年5月5日周四 23:35写道：
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thank you for all your feedback!
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>>>>>>>>>>>> some
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>> thoughts:
>>>>>>>>>>>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>>>>>>>>>>>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>>>>>>>>>>>>>>>>>>>>>>>> The automatical way should be something built upon
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> blocklist
>>>>>>>>>>>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>>>>>>>>>>>>>>>>>> configurable
>>>>>>>>>>>>>>>>>>>>>>>> blocklist strategy, but I think we can further
>>>>>>>>>>>> decouple
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>>>>>>>>>>>>>> suggested,
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>>>>>>>>>>>>> detected.
>>>>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>> should be a separate FLIP with further dev
>>>>>>>>>>>> discussions
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>>>>>>>>>> from users. I also agree with Becket that different
>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>> requirements, and we should listen to them.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>>>>>>>>>>>>>> externally
>>>>>>>>>>>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>>>>>>>>>>>> way
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>> want to limit the load of the node and do not want to
>>>>>>>>>>>>>> kill
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> processes on it. Another case is the speculative
>>>>>>>>>>>>>>> execution[1]
>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>> may also leverage this feature to avoid starting
>>>>>>>>>>>> mirror
>>>>>>>>>>>>>>>>> tasks on
>>>>>>>>>>>>>>>>>>>> slow
>>>>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Zhu
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>>>>>>>>>>>>>>>>>>>>>>>> Lijie Wang<wa...@gmail.com>  于2022年5月5日周四
>>>>>>>>>>>>>>> 15:56写道：
>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> affect the value and design of the blocklist
>>>>>>>>>>>> mechanism
>>>>>>>>>>>>>>>>> (perhaps
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>>>>>>>>>>>>>> FLIP:
>>>>>>>>>>>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>>>>>>>>>>>>> node
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> blocked.
>>>>>>>>>>>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>> node.
>>>>>>>>>>>>>>>>>>>>>>>> But slots that are already allocated will not be
>>>>>>>>>>>>>> affected.
>>>>>>>>>>>>>>> A
>>>>>>>>>>>>>>>>>>> typical
>>>>>>>>>>>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>>>>>>>>>>>>> In
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> hope
>>>>>>>>>>>>>>>>>>>>>>>> that subsequent resource allocations will not be on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> hot
>>>>>>>>>>>>>>>>>>> machine,
>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>> tasks currently running on it should not be affected.
>>>>>>>>>>>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>>>>>>>>>>>> tasks
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> restarted on
>>>>>>>>>>>>>>>>>>>>>>>> non-blocked task managers.
>>>>>>>>>>>>>>>>>>>>>>>>> For the above 2 actions, the former may more
>>>>>>>>>>>>> highlight
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> meaning
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> this FLIP, because the external system cannot do
>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>>>>>>>>>>>> basically
>>>>>>>>>>>>>>> agree
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> @Becket
>>>>>>>>>>>>>>>>>>>>>>>> Qin: different users have different answers. Not all
>>>>>>>>>>>>>> users’
>>>>>>>>>>>>>>>>>>>> deployment
>>>>>>>>>>>>>>>>>>>>>>>> environments have a special external system that can
>>>>>>>>>>>>>>> perform
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> anomaly
>>>>>>>>>>>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>>>>>>>>>>>>>>>>> auto-detection
>>>>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>>>>>>>> require much extra work on top of manual
>>>>>>>>>>>> specification.
>>>>>>>>>>>>>>>>>>>>>>>>> I will answer your other questions one by one.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> @Yangze
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>>>>>>>>>>>>> the
>>>>>>>>>>>>> `cluster.resource-blocklist.item.timeout-check-interval`
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>>>>>>>>> (maybe
>>>>>>>>>>>>>>>>>>>>>>>> BlocklistExceptionListener), and the
>>>>>>>>>>>>>>>>>>> ResourceManagerBlocklistHandler
>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> implement it in the future.
>>>>>>>>>>>>>>>>>>>>>>>>> @Martijn
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>>>>> operators.
>>>>>>>>>>>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>>>>>>>>>>>>>> according
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>> experience, users are often the first to perceive the
>>>>>>>>>>>>>>> machine
>>>>>>>>>>>>>>>>>>>> problems
>>>>>>>>>>>>>>>>>>>>>>>> (because of job failover or delay), and they will
>>>>>>>>>>>>> contact
>>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>>>>>>>>>>>>>>>>> machine is
>>>>>>>>>>>>>>>>>>>>>>>> problematic. From this point of view, I think the
>>>>>>>>>>>>> people
>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>>>>>>>>>>>>>> performed
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>>>>>>>>>>>>>>>>>>>>>>>>> @Chesnay
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED
>>>>>>>>>>>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>>>>>>>>>>>> when
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>>>>>>>>>>>>>> release
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> slot
>>>>>>>>>>>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>>>>>>>>>>>> needs
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> retrieve
>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>>>>>>>>>>>>>> relatively
>>>>>>>>>>>>>>>>>> large,
>>>>>>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> think it's worth maintaining the blocklist
>>>>>>>>>>>> information
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> the JM
>>>>>>>>>>>>>>>>>>>> side
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> syncing them.
>>>>>>>>>>>>>>>>>>>>>>>>> @Роман
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>      a) “Probably storing inside Zookeeper/Configmap
>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> helpful
>>>>>>>>>>>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>>>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>>>>>>>>>>>>>> special
>>>>>>>>>>>>>>>>>>>> treatment
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> required.
>>>>>>>>>>>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>>>>>>>>>>>> sense,
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>>>>>>>> @Yang
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>>>>>>>>>>>> cannot
>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your further feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Yang Wang<da...@gmail.com>  于2022年5月3日周二
>>>>>>>>>>>>>> 21:09写道：
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>>>>>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>>>>>>>>>>>>>>>>> Reliability
>>>>>>>>>>>>>>>>>>>>> Engineer)
>>>>>>>>>>>>>>>>>>>>>>>>>> already has many tools to detect the unstable
>>>>>>>>>>>> nodes,
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> system logs/metrics into consideration.
>>>>>>>>>>>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>>>>>>>>>>>> taint
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> K8s
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> prevent
>>>>>>>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>>>>>> allocations on these unstable nodes.
>>>>>>>>>>>>>>>>>>>>>>>>>> At last, they will evict all the containers and
>>>>>>>>>>>> pods
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>>>>>>>>>>>>> So
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>> afraid
>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>>>>>>>>>>>>>>>>> approach(via
>>>>>>>>>>>>>>>>>>>>> *yarn
>>>>>>>>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>>>>>>>>>>>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>> expose
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>>>>>>>>>>>>> them
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>>>>>>>>>> explicitly.
>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin<be...@gmail.com>  于2022年5月2日周一
>>>>>>>>>>>>>> 16:36写道：
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> somewhat
>>>>>>>>>>>>>>>>>>>>> related
>>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>>>>>>> design principle about how people should operate
>>>>>>>>>>>>>>> Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there are three things involved in this
>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>       a) Detect and report the unstable node.
>>>>>>>>>>>>>>>>>>>>>>>>>>>       b) Collect the information of the unstable
>>>>>>>>>>>>> node
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> form a
>>>>>>>>>>>>>>>>>>>>>>>> blocklist.
>>>>>>>>>>>>>>>>>>>>>>>>>>>       c) Take the action to block nodes.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> My two cents:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> c).
>>>>>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> not only
>>>>>>>>>>>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>>>>>>>>>>>> handy
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> planned
>>>>>>>>>>>>>>>>>>>>>>>>>>> maintenance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>>>>>>>>>>>> who
>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> brain
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>>>>>>>>>>>>>>> largely
>>>>>>>>>>>>>>>>>>> depends
>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>> who we
>>>>>>>>>>>>>>>>>>>>>>>>>>> talk to. Different users would probably give
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> answers.
>>>>>>>>>>>>>>>>>>>>> For
>>>>>>>>>>>>>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>>>>>>>>>>>> who do have a centralized node health management
>>>>>>>>>>>>>>>>> service,
>>>>>>>>>>>>>>>>>> let
>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>> do just
>>>>>>>>>>>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>>>>>>>>>>>> it
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> service,
>>>>>>>>>>>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>>>>>>>>>>>>> take the command from that service to block the
>>>>>>>>>>>>>>>>> problematic
>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>>> On
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>>>>>>>>>>>>>> service,
>>>>>>>>>>>>>>>>>> simply
>>>>>>>>>>>>>>>>>>>>>>> letting
>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> desired
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>>>>>>>>>>>>>> pluggable /
>>>>>>>>>>>>>>>>>>>> optional.
>>>>>>>>>>>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>>>>>> pluggable
>>>>>>>>>>>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>>>>>>>>>>>> unstable
>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>>> actively.
>>>>>>>>>>>>>>>>>>>>>>> (A
>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> report
>>>>>>>>>>>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>>>>>>>>> relevant
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>>>>>>>>>>>> OK
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>>> action
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> block
>>>>>>>>>>>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>>>>>>>>>>>>>>> timeout.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>>>>>>>>>>>>>>>>>>>> ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>>>>>>>>>>>>>>>>> development.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>>>>>>>>>> to move
>>>>>>>>>>>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>>>>>>>>>>>> nodes/TMs
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> active
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>>>>>>>>>>>>>> organizations
>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> expose
>>>>>>>>>>>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>>> necessary
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> avoiding
>>>>>>>>>>>>>>>>>>>>>>> incorrect
>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatic behaviour.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> extend
>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>>>>>>>>>>>> *endTimestamp*
>>>>>>>>>>>>>>> field
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> fill
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>>>>>>>>>>>>> to:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     Provide the ability to users to setup the
>>>>>>>>>>>>> exact
>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> blocklist end
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     through RestAPI
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     Not being tied to a single value of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>     *cluster.resource-blacklist.item.timeout*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> chesnay@apache.org>
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>>>>>>>>>>>>>> lines
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> bit.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>>>>>>>>>>>>>>> auto-detection
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> have an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> opt-in mechanism
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>>>>>>>>>>>> this
>>>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>>>>>> yet
>>>>>>>>>>>>>>>>>>>>> again
>>>>>>>>>>>>>>>>>>>>>>>> like one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>>>>>>>>>>>> just
>>>>>>>>>>>>>>> right.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> An external system can leverage way more
>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>>>>>>>>>>>>>> complexity
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>>> components;
>>>>>>>>>>>>>>>>>>>>>>>>>>> adjustments
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> slot and resource protocol.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>>>> management
>>>>>>>>>>>>>>>>>>>>>>>> perspective,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>>>>>>>>>>>> protocol
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> (or
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> fact
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>> blocked
>>>>>>>>>>>>>>>>>>> TMs
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> place.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>>>>>>>>>>>>> down
>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> RM
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>>>>>>>>>>>>>> blocking
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>>>>>>>>>>>>>>>>> process(es).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>>> up.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>>>>>>>>>>>> the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>> line
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> thinking to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standalone resource management.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>>>>>>>>>>>> process/node
>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>> core
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> I see
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> value
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the automatic detection and
>>>>>>>>>>>>> blocklisting. I
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>> concerns
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>>>>>>>>>>>>> blocked
>>>>>>>>>>>>>>>>>>> resources.
>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>>>>>>>>>>>>>>> separation
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> concerns,
>>>>>>>>>>>>>>>>>>>>>>>>>>> meaning
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>>>>>>>>>>>>>> managing a
>>>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> there's
>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>>>>>>>>>>>>>>>>> introduction of
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> mechanism,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> latter group now can influence the
>>>>>>>>>>>>>>> responsibility
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>>>>> blocks
>>>>>>>>>>>>>>>>>>>>>>>> something,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>>>>>>>>>>>>> paging
>>>>>>>>>>>>>>>>>>> mechanism
>>>>>>>>>>>>>>>>>>>>>>>> triggering
>>>>>>>>>>>>>>>>>>>>>>>>>>>> etc)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which impacts the first group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> be? I
>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>> sense
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>> small
>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>>>>>>>>>>>>> such
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>>>>> leaning
>>>>>>>>>>>>>>>>>>>>>>> towards
>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exposing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this to the user group but having this
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>> strictly
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>>>>>>> paging/metrics/logging
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to take this into account.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>>>>>>>>>>>>>>>>>>>>> karmagyz@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> cents
>>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> user?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think the semantics of
>>>>>>>>>>>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>>>>>>>>>>>> mechanism is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>>>>>>>>>>>> it
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>>>>>>>>>>>> exception
>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>>>>>>> manager to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>>>>>>>>>>>> fail
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>>>>>>>>>>>> I
>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>> mean
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> cover
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>>>>>>>>>>>> also
>>>>>>>>>>>>>>> open a
>>>>>>>>>>>>>>>>>>>>>>>> `notifyException`
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>>>>>>>>>>>>>>>>> ResourceManager,
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> slot of
>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>>>>>>>>>>>>> released
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> allocated?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yangze Guo
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>>>>>>>>>>>> Wang
>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>>>>>>>>>>>> response
>>>>>>>>>>>>>>> your 4
>>>>>>>>>>>>>>>>>>>> remarks:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>>>>>>>>>>>>>>> controversy. I
>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>> “BlockList”
>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>> Based
>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> above, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>>>>> <host>/blocklist/node/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>>>>>>>> <host>/blocklist/taskmanager/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>>>>>>>>>>>>>> means
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>>>>>>>>>>> managers
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>> TMs
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>> available.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>>>>>>>>>>>> these
>>>>>>>>>>>>>> TMs
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>>>>> lost,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>>>>>>>>>>>> are
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>>>> registered
>>>>>>>>>>>>>>>>>>>>>>>> in this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>>>>>>>>>>>> once
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> corresponding
>>>>>>>>>>>>>>>>>>>>>>>>>>>> blocklist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> item
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>>>>>>>>>>>>>>>>>>>>> active/non-active
>>>>>>>>>>>>>>>>>>>>>>>>>>> clusters.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>>>>>>>>>>>> TMs
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> released
>>>>>>>>>>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> idle
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>>>>>>>>>>>> keep
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>>>>>>>>>>>>> adding
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>> removing
>>>>>>>>>>>>>>>>>>>>>>> items
>>>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>>>>>>>>>>>> it
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> Long.MAX_VALUE .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> period of
>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>>>>>> (such as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> machine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>>>>>>>>>>>> prefer
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> manually.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf<kn...@apache.org>
>>>>>>>>>>>>>>>>> 于2022年4月27日周三
>>>>>>>>>>>>>>>>>>>>>>> 19:23写道：
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>>>>> blocking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> also be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>>>>>>>>>>>>>> "bla.cklist"
>>>>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> controversy
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>>>>>>>>>>>>> Ticket
>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> topic a
>>>>>>>>>>>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>>>>>>>> blacklist &
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whitelist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>>>>>>>>>>>> "denylist"
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> "quarantined"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>>>>>>>>>>>>> slightly
>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>> verbs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>>>>>>>>>>>>> anti-pattern
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>>>>>> APIs.
>>>>>>>>>>>>>>>>>>>>>>>> POST
>>>>>>>>>>>>>>>>>>>>>>>>>>> on a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>>>>>>>>>>>>>> items.
>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>>>>>>>> <host>/quarantine/items/<itemidentifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>>>>>>>>>>>>>>> taskmanagers
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>>>>>>>>>>>>> opinion
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>>>>>> <host>/quarantine/nodes/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>>>>>>>>>> <host>/quarantine/taskmanager/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>>>>>>>>>>>>>>>>> non-active
>>>>>>>>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>>>>>>>>>>>>> managers,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>>>>>>>>>>>>>>>>> minimal,
>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>>>>>> delegate
>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>>>>>>>>>>>>>>> behavior
>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>>>>>>>>>>>>> schrieb
>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>> Wang
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (insufficient
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>>>>>>>>>>>>>> abnormalities).
>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>> take care
>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>> locality
>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>>>>>>>>>>>> likely
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> redeployed
>>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>>>>> abnormalities
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> affect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> progress.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>> identify
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> problematic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> node and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>>>>>>>>>>>> heavy
>>>>>>>>>>>>>>>>> process.
>>>>>>>>>>>>>>>>>>>> Users
>>>>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contact
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>>>>>>>>>>>>>>> operation
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>> dangerous
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>>>>>>>>>>>>>>> events.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> problems
>>>>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> slow
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>>>>>>>>>>>>>> propose
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> blacklist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>>>>>>>>>>>>>>> problematic
>>>>>>>>>>>>>>>>>>>>> resources.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>>>>>>>>>>>>> FLIP-224[1].
>>>>>>>>>>>>>>>>> Looking
>>>>>>>>>>>>>>>>>>>>> forward
>>>>>>>>>>>>>>>>>>>>>>>> to your
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>>>>>>>> e.:ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> https://twitter.com/snntrable
>>>>>>>>>>>>>> https://github.com/knaufk
>>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> https://twitter.com/snntrable
>>>>>>>>>>>> https://github.com/knaufk
>>>>>>>>>>>>
>>>>>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.

Thanks Chesnay for the feedback in the vote thread.
(https://lists.apache.org/thread/opc7jg3rpxnwotkb0fcn4wnm02m4397o)

I'd like to continue the discussion in this thread so that the
discussions can be better tracked.

Regarding your questions, here are my thoughts:
1. The current locality mechanism does not work well to avoid
deploying tasks to slow nodes because it cannot proactively reject or
release slots from the slow nodes. And it cannot help at the resource
manager side to avoid allocating free slots or launching new
TaskManagers on slow nodes.
2. Dynamically downscaling or upscaling a batch job is usually
unacceptable because it means to re-run the whole stage.
3. Extending the requirement declaration to have a notion of
"undesirable nodes" is an option. And it is actually how we started.
Considering implementation details, we found we need that
- a tracker to collect all the undesirable(detected slow nodes)
- to filter out slots from undesirable nodes when allocating slots
from the SlotPool
- ask the ResourceManager for slots that are not on the undesirable
nodes. The ResourceManager may further need to ask for new
TaskManagers that are not on the undesirable nodes.
Then, with all these functionality, we found that we almost have a
blocklist mechanism. As blocklist mechanism is a common concept and is
possible to benefit users, we took this chance to propose the
blocklist mechanism.
4. A cluster-wide shared blocklist is not a must for speculative
execution at the moment. It is mainly part of a standalone blocklist
feature to host user specified block items. To avoid sharing job
specific blocked items between jobs, one way is to add a nullable
JobID tag for the blocked item.

Thanks,
Zhu

Zhu Zhu <re...@gmail.com> 于2022年6月7日周二 10:58写道：
>
> Hi Chesnay,
>
> Would you please take a look at the FLIP and discussion to see if all
> your concerns have been addressed?
>
> Thanks,
> Zhu
>
> Zhu Zhu <re...@gmail.com> 于2022年5月28日周六 13:26写道：
> >
> > Regarding the concern of the SlotManager, my two cents:
> > 1. it is necessary for the SlotManager to host blocked slots, in 2 cases:
> >   a. In standalone mode, a taskmanager may be temporarily added to
> > the blocklist. We do not want the TM to get disconnected and shut down.
> > So we need to keep its connection to RM and keep hosting its slots.
> >   b. When we want to avoid allocating slots to a slow nodes but do not
> > want to kill current running tasks on the nodes (MARK_BLOCKED mode).
> >
> > There is possible a way to keep the connection of a blocked task manager
> > while hide its slots from SlotManager, but I feel it may be even much more
> > complicated.
> >
> > 2. It will not complicate the SlotManager too much. The SlotManager will
> > be offered a BlockedTaskManagerChecker when created, and just need
> > to use it to filter out blocked slots on slot request. Therefore I think the
> > complication is acceptable.
> >
> > Thanks,
> > Zhu
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月25日周三 15:26写道：
> > >
> > > Hi everyone,
> > >
> > > I've updated the FLIP according to Chesnay's feedback, changes as follows:
> > > 1. Change the GET result to a map.
> > > 2. Only left *endTimestamp* in ADD operation, and change the rest (from
> > > POST) to PUT
> > > 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
> > > encapsulate blocklist related functions.
> > > 4. Remove *mainThread* from BlocklistTracker, instead provide a
> > > *removeTimeoutItems* method to be called by outside components。
> > >
> > > Best,
> > > Lijie
> > >
> > > Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：
> > >
> > > > Hi Chesnay,
> > > >
> > > > Thanks for feedback.
> > > >
> > > > 1. Regarding the TM/Node id. Do you mean special characters may appear in
> > > > the rest URL?  Actually, I don't think so. The task manager id in REST API
> > > > should be the *ResourceID* of taskmanager in Flink, there should be no
> > > > special characters, and some existing REST APIs are already using it, e.g.
> > > > GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
> > > > id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
> > > > should also have no special characters.
> > > > 2. Regarding the GET query responses. I agree with you, it makes sense to
> > > > change the GET result to a map.
> > > >
> > > > 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
> > > > cover everything, and the endTimestamp is a unix timestamp, there should be
> > > > no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
> > > > add rest api is add or update, PUT can cover this semantics.
> > > >
> > > > 4. Regarding the slot pool/manager. I don't think the current slotpool
> > > > and slotmanager are able to support the MARK_BLOCKED(slots that are
> > > > already allocated will not be affected) action. The reasons are as
> > > > follows：
> > > >
> > > > a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
> > > > from reserved(task assigned) to free(no task assigned), it is necessary to
> > > > check whether the slot should be released immediately(it should be released
> > > > immediately if the task manager is blocked, otherwise it may be allocated
> > > > to other tasks). I think it cannot be supported without being aware of
> > > > the blocklist information. Compared to the solution in FLIP, a more
> > > > appropriate/prefered way may be: Introduce a new slot pool
> > > > implementation for blocklist(may be named BlocklistSlotPool, it
> > > > extends/wrapps the original slot pool), and implement the parts that need
> > > > to be aware of the blocklist in this newly introduced slot pool, and the
> > > > original slot pool basically does not need to change.
> > > >
> > > > b) for slot manager, with the MARK_BLOCKED action, there may be free but
> > > > blocked slots in slot manager (the corresponding TMs cannot be
> > > > released/unregistered because there are still running tasks on them).
> > > > Therefore, we need to filter out the blocked slots when trying to fulfill
> > > > the slot requirements. Therefore it also needs to know the blocklist information.
> > > > A better way may be to abstract a resource allocation strategy, and make
> > > > the blocklist as a special implementation, then pass the resource
> > > > allocation strategy in when constructing the slot manager. Unfortunately,
> > > > the data structures in the two existing slot manager
> > > > implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
> > > > quite different, it is not easy to abstract a common resource allocation
> > > > strategy, so we prefer to keep the current way(i.e. pass the blocklist
> > > > information directly into slot manager).
> > > >
> > > >
> > > > 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
> > > > does not need to be aware of the executor, and the timeout actions can be
> > > > done outside.
> > > >
> > > > Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
> > > >
> > > >> I have a number of concerns:
> > > >>
> > > >> Is the id used for deleting an item the same sent in the initial request
> > > >> (and not one returned by Flink)?
> > > >> I'm very concerned that the tm/node id can contain special characters.
> > > >>
> > > >> The GET query should return a map, not a list of items. This makes it
> > > >> easier to work with.
> > > >>
> > > >> The duality of endTimestamp and duration is also concerning.
> > > >> If we conclude that endTimestamps can in fact work (and aren't utterly
> > > >> unusable due to timezones),
> > > >> then this should be able to cover everything and rid us of some
> > > >> complexity w.r.t. POSTs to the same ID.
> > > >> Additions would be a PUT, changes a PATCH, deletes a DELETE.
> > > >>
> > > >>
> > > >> I also dislike how we're pushing more functionality into the
> > > >> slotpool/-manager.
> > > >> These components are complex enough as-is, and instead I'd propose a
> > > >> separate component that interacts with the SlotPool/-Manager instead,
> > > >> for example by removing the slots from that TM.
> > > >> The reason being that from the slot-pool perspective it is irrelevant
> > > >> whether a slot is gone because the TM was lost, or because it was blocked.
> > > >>
> > > >>
> > > >> The BlocklistTracker shouldn't be modeled as component that is aware of
> > > >> the concept of main threads.
> > > >> It really has no business knowing that; all it needs is an executor for
> > > >> handling timeouts/periodic actions,
> > > >> and a way to interact with the JM/RM (which internally can handle the
> > > >> scheduling into the main thread).
> > > >>
> > > >>
> > > >> On 20/05/2022 07:20, Lijie Wang wrote:
> > > >> > Hi everyone,
> > > >> >
> > > >> > I have started a vote for this FLIP [1]. Please cast your vote there or
> > > >> ask
> > > >> > additional questions here. [1]
> > > >> > https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
> > > >> >
> > > >> > Best, Lijie
> > > >> >
> > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
> > > >> >
> > > >> >> Hi Konstantin,
> > > >> >>
> > > >> >> We found that Flink REST URL does not support the format ":merge" ,
> > > >> which
> > > >> >> will be recognized as a parameter in the URL(due to start with a
> > > >> colon).
> > > >> >>
> > > >> >> We will keep the previous way, i.e.
> > > >> >>
> > > >> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> > > >> >> and the "id" and "merge" flag are put into the request body.
> > > >> >>
> > > >> >> Best,
> > > >> >> Lijie
> > > >> >>
> > > >> >> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
> > > >> >>
> > > >> >>> Hi Weihua,
> > > >> >>> thanks for feedback.
> > > >> >>>
> > > >> >>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
> > > >> >>> step towards auto-detection.
> > > >> >>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> > > >> >>> exception of insufficient resources.
> > > >> >>> 3. No. This FLIP won't change the WebUI. The blocklist information
> > > >> can be
> > > >> >>> obtained through REST API and metrics.
> > > >> >>>
> > > >> >>> Best,
> > > >> >>> Lijie
> > > >> >>>
> > > >> >>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
> > > >> >>>
> > > >> >>>> Hi,
> > > >> >>>> Thanks for creating this FLIP.
> > > >> >>>> We have implemented an automatic blocklist detection mechanism
> > > >> >>>> internally, which is indeed very effective for handling node
> > > >> failures.
> > > >> >>>> Due to the large number of nodes, although SREs already support
> > > >> >>>> automatic offline failure nodes, the detection is not 100% accurate
> > > >> and
> > > >> >>>> there is some delay.
> > > >> >>>> So the blocklist mechanism can make flink job recover from failure
> > > >> much
> > > >> >>>> faster.
> > > >> >>>>
> > > >> >>>> Here are some of my thoughts:
> > > >> >>>> 1. In this FLIP, it needs users to locate machine failure manually,
> > > >> >>>> there is a certain cost of use
> > > >> >>>> 2. What happens if too many nodes are blocked, resulting in
> > > >> insufficient
> > > >> >>>> resources? Will there be a special Exception for the user?
> > > >> >>>> 3. Will we display the blocklist information in the WebUI? The
> > > >> blocklist
> > > >> >>>> is for cluster level, and if multiple users share a cluster, some
> > > >> users may
> > > >> >>>> be a little confused when resources are not enough, or when
> > > >> resources are
> > > >> >>>> applied for more.
> > > >> >>>>
> > > >> >>>> Also, Looking forward to the next FLIP on auto-detection.
> > > >> >>>>
> > > >> >>>> Best,
> > > >> >>>> Weihua
> > > >> >>>>
> > > >> >>>>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> > > >> >>>>>
> > > >> >>>>> Hi Konstantin,
> > > >> >>>>>
> > > >> >>>>> Maybe change it to the following:
> > > >> >>>>>
> > > >> >>>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> > > >> >>>>> Merge is not allowed. If the {id} already exists, return error.
> > > >> >>>> Otherwise,
> > > >> >>>>> create a new item.
> > > >> >>>>>
> > > >> >>>>> 2. POST: http://
> > > >> >>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> > > >> >>>>> Merge is allowed. If the {id} already exists, merge. Otherwise,
> > > >> create
> > > >> >>>> a
> > > >> >>>>> new item.
> > > >> >>>>>
> > > >> >>>>> WDYT?
> > > >> >>>>>
> > > >> >>>>> Best,
> > > >> >>>>> Lijie
> > > >> >>>>>
> > > >> >>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> > > >> >>>>>
> > > >> >>>>>> Hi Lijie,
> > > >> >>>>>>
> > > >> >>>>>> hm, maybe the following is more appropriate in that case
> > > >> >>>>>>
> > > >> >>>>>> POST: http://
> > > >> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> > > >> >>>>>>
> > > >> >>>>>> Best,
> > > >> >>>>>>
> > > >> >>>>>> Konstantin
> > > >> >>>>>>
> > > >> >>>>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> > > >> >>>>>> wangdachui9501@gmail.com>:
> > > >> >>>>>>
> > > >> >>>>>>> Hi Konstantin,
> > > >> >>>>>>> thanks for your feedback.
> > > >> >>>>>>>
> > > >> >>>>>>>  From what I understand, PUT should be idempotent. However, we
> > > >> have a
> > > >> >>>>>>> *timeout* field in the request. This means that initiating the
> > > >> same
> > > >> >>>>>> request
> > > >> >>>>>>> at two different times will lead to different resource status
> > > >> >>>> (timestamps
> > > >> >>>>>>> of the items to be removed will be different).
> > > >> >>>>>>>
> > > >> >>>>>>> Should we use PUT in this case? WDYT?
> > > >> >>>>>>>
> > > >> >>>>>>> Best,
> > > >> >>>>>>> Lijie
> > > >> >>>>>>>
> > > >> >>>>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> > > >> >>>>>>>
> > > >> >>>>>>>> Hi Lijie,
> > > >> >>>>>>>>
> > > >> >>>>>>>> wouldn't the REST API-idiomatic way for an update/replace be a
> > > >> PUT
> > > >> >>>> on
> > > >> >>>>>> the
> > > >> >>>>>>>> resource?
> > > >> >>>>>>>>
> > > >> >>>>>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> > > >> >>>>>>>>
> > > >> >>>>>>>> Best,
> > > >> >>>>>>>>
> > > >> >>>>>>>> Konstantin
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> > > >> >>>>>>>> wangdachui9501@gmail.com>:
> > > >> >>>>>>>>
> > > >> >>>>>>>>> Hi everyone,
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
> > > >> >>>> made
> > > >> >>>>>>> the
> > > >> >>>>>>>>> following changes on REST API:
> > > >> >>>>>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
> > > >> choose
> > > >> >>>>>>> one.
> > > >> >>>>>>>> If
> > > >> >>>>>>>>> both are specified, will return error.
> > > >> >>>>>>>>> 2.  If the specified item is already there, the *ADD* operation
> > > >> has
> > > >> >>>>>> two
> > > >> >>>>>>>>> behaviors:  *return error*(default value) or *merge/update*,
> > > >> and we
> > > >> >>>>>>> add a
> > > >> >>>>>>>>> flag to the request body to control it. You can find more
> > > >> details
> > > >> >>>>>>> "Public
> > > >> >>>>>>>>> Interface" section.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> If there is no more feedback, we will start the vote thread next
> > > >> >>>>>> week.
> > > >> >>>>>>>>> Best,
> > > >> >>>>>>>>> Lijie
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>> Hi Becket Qin,
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Thanks for your suggestions.  I have moved the description of
> > > >> >>>>>>>>>> configurations, metrics and REST API into "Public Interface"
> > > >> >>>>>> section,
> > > >> >>>>>>>> and
> > > >> >>>>>>>>>> made a few updates according to your suggestion.  And in this
> > > >> >>>> FLIP,
> > > >> >>>>>>>> there
> > > >> >>>>>>>>>> no public java Interfaces or pluggables that users need to
> > > >> >>>>>> implement
> > > >> >>>>>>> by
> > > >> >>>>>>>>>> themselves.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Answers for you questions:
> > > >> >>>>>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> > > >> >>>>>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> > > >> >>>>>> items
> > > >> >>>>>>>> can
> > > >> >>>>>>>>>> only be added through the REST API, so these 2 action are
> > > >> >>>> mentioned
> > > >> >>>>>>> in
> > > >> >>>>>>>>> the
> > > >> >>>>>>>>>> REST API part (The REST API part has beed moved to public
> > > >> >>>> interface
> > > >> >>>>>>>> now).
> > > >> >>>>>>>>>> 2. I agree with you. I have changed the "Cause" field to
> > > >> String,
> > > >> >>>>>> and
> > > >> >>>>>>>>> allow
> > > >> >>>>>>>>>> users to specify it via REST API.
> > > >> >>>>>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
> > > >> >>>>>> above,
> > > >> >>>>>>>> we
> > > >> >>>>>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
> > > >> >>>> ADD
> > > >> >>>>>>>> REST
> > > >> >>>>>>>>>> API to specify when to remove the blocked item. These 2 fields
> > > >> are
> > > >> >>>>>>>>>> optional, if neither is specified, it means that the blocked
> > > >> item
> > > >> >>>>>> is
> > > >> >>>>>>>>>> permanent and will not be removed. If both are specified, the
> > > >> >>>>>> minimum
> > > >> >>>>>>>> of
> > > >> >>>>>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
> > > >> the
> > > >> >>>>>>> time
> > > >> >>>>>>>> to
> > > >> >>>>>>>>>> remove the blocked item. To keep the configurations more
> > > >> minimal,
> > > >> >>>>>> we
> > > >> >>>>>>>> have
> > > >> >>>>>>>>>> removed the *cluster.resource-blocklist.item.timeout*
> > > >> >>>> configuration
> > > >> >>>>>>>>>> option.
> > > >> >>>>>>>>>> 4. Yes, the block item will be overridden if the specified item
> > > >> >>>>>>> already
> > > >> >>>>>>>>>> exists. The ADD operation is *ADD or UPDATE*.
> > > >> >>>>>>>>>> 5. Yes. On JM/RM side, all the blocklist information is
> > > >> maintained
> > > >> >>>>>> in
> > > >> >>>>>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> > > >> >>>>>>>>> abstracted
> > > >> >>>>>>>>>> to other interfaces) will be propagated to different
> > > >> components.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Best,
> > > >> >>>>>>>>>> Lijie
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>> Hi Lijie,
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Thanks for updating the FLIP. It looks like the public
> > > >> interface
> > > >> >>>>>>>> section
> > > >> >>>>>>>>>>> did not fully reflect all the user sensible behavior and API.
> > > >> Can
> > > >> >>>>>>> you
> > > >> >>>>>>>>> put
> > > >> >>>>>>>>>>> everything that users may be aware of there? That would
> > > >> include
> > > >> >>>>>> the
> > > >> >>>>>>>> REST
> > > >> >>>>>>>>>>> API, metrics, configurations, public java Interfaces or
> > > >> >>>> pluggables
> > > >> >>>>>>>> that
> > > >> >>>>>>>>>>> users may see or implement by themselves, as well as a brief
> > > >> >>>>>> summary
> > > >> >>>>>>>> of
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>> behavior of the public API.
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Besides that, I have a few questions:
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> 1. According to the conversation in the discussion thread, it
> > > >> >>>>>> looks
> > > >> >>>>>>>> like
> > > >> >>>>>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
> > > >> >>>>>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
> > > >> so,
> > > >> >>>>>> can
> > > >> >>>>>>>> you
> > > >> >>>>>>>>>>> add
> > > >> >>>>>>>>>>> that to the public interface as well?
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
> > > >> >>>>>>> Throwable
> > > >> >>>>>>>>> and
> > > >> >>>>>>>>>>> is not reflected in the REST API. Should that be included in
> > > >> the
> > > >> >>>>>>> query
> > > >> >>>>>>>>>>> response? And should we change that field to be a String so
> > > >> users
> > > >> >>>>>>> may
> > > >> >>>>>>>>>>> specify the cause via the REST API when they block some nodes
> > > >> /
> > > >> >>>>>> TMs?
> > > >> >>>>>>>>>>> 3. Would it be useful to allow users to have different
> > > >> timeouts
> > > >> >>>>>> for
> > > >> >>>>>>>>>>> different blocked items? So while there is a default timeout,
> > > >> >>>>>> users
> > > >> >>>>>>>> can
> > > >> >>>>>>>>>>> also override it via the REST API when they block an entity.
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> 4. Regarding the ADD operation, if the specified item is
> > > >> already
> > > >> >>>>>>>> there,
> > > >> >>>>>>>>>>> will the block item be overridden? For example, if the user
> > > >> wants
> > > >> >>>>>> to
> > > >> >>>>>>>>>>> extend
> > > >> >>>>>>>>>>> the timeout of a blocked item, can they just  issue an ADD
> > > >> >>>> command
> > > >> >>>>>>>>> again?
> > > >> >>>>>>>>>>> 5. I am not quite familiar with the details of this, but is
> > > >> there
> > > >> >>>>>> a
> > > >> >>>>>>>>> source
> > > >> >>>>>>>>>>> of truth for the blocked list? I think it might be good to
> > > >> have a
> > > >> >>>>>>>> single
> > > >> >>>>>>>>>>> source of truth for the blocked list and just propagate that
> > > >> list
> > > >> >>>>>> to
> > > >> >>>>>>>>>>> different components to take the action of actually blocking
> > > >> the
> > > >> >>>>>>>>> resource.
> > > >> >>>>>>>>>>> Thanks,
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Jiangjie (Becket) Qin
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> > > >> >>>>>> wangdachui9501@gmail.com
> > > >> >>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>>> Hi everyone,
> > > >> >>>>>>>>>>>>
> > > >> >>>>>>>>>>>> Based on the discussion in the mailing list, I updated the
> > > >> FLIP
> > > >> >>>>>>> doc,
> > > >> >>>>>>>>> the
> > > >> >>>>>>>>>>>> changes include:
> > > >> >>>>>>>>>>>> 1. Changed the description of the motivation section to more
> > > >> >>>>>>> clearly
> > > >> >>>>>>>>>>>> describe the problem this FLIP is trying to solve.
> > > >> >>>>>>>>>>>> 2. Only  *Manually* is supported.
> > > >> >>>>>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
> > > >> >>>>>>>>>>>>
> > > >> >>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>> Lijie
> > > >> >>>>>>>>>>>>
> > > >> >>>>>>>>>>>>
> > > >> >>>>>>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> > > >> >>>>>>>>>>>>
> > > >> >>>>>>>>>>>>> Hi Lijie!
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
> > > >> >>>>>>>>>>> helpfulhere.”
> > > >> >>>>>>>>>>>>> Can you explain it in detail? I don't fully understand that.
> > > >> >>>>>> In
> > > >> >>>>>>>>>>>> myopinion,
> > > >> >>>>>>>>>>>>> non-active and active are the same, and no special treatment
> > > >> >>>>>>>>>>> isrequired.*
> > > >> >>>>>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
> > > >> >>>>>>> were
> > > >> >>>>>>>>>>> talking
> > > >> >>>>>>>>>>>>> about the HA mode (but not about Active and Standalone
> > > >> >>>>>>>>>>> ResourceManager).
> > > >> >>>>>>>>>>>>> And the original question was - how to handle the
> > > >> blacklisted
> > > >> >>>>>>>> nodes
> > > >> >>>>>>>>>>> list
> > > >> >>>>>>>>>>>> at
> > > >> >>>>>>>>>>>>> the moment of leader change? Should we simply forget about
> > > >> >>>>>> them
> > > >> >>>>>>> or
> > > >> >>>>>>>>>>> try to
> > > >> >>>>>>>>>>>>> pre-save that list on the remote storage?
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <
> > > >> danrtsey.wy@gmail.com
> > > >> >>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> > > >> >>>>>> it
> > > >> >>>>>>> is
> > > >> >>>>>>>>>>> indeed
> > > >> >>>>>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
> > > >> >>>>>>>> could
> > > >> >>>>>>>>>>> not
> > > >> >>>>>>>>>>>>>> support.
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>>>> Yang
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
> > > >> 22:18写道：
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> @Jiangang
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
> > > >> >>>>>> and
> > > >> >>>>>>>>> mark a
> > > >> >>>>>>>>>>>> node
> > > >> >>>>>>>>>>>>>>> blocked?
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> In fact, we currently plan to not support auto-detection
> > > >> >>>>>> in
> > > >> >>>>>>>> this
> > > >> >>>>>>>>>>>> FLIP.
> > > >> >>>>>>>>>>>>>> The
> > > >> >>>>>>>>>>>>>>> part about auto-detection may be continued in a separate
> > > >> >>>>>>> FLIP
> > > >> >>>>>>>> in
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>> future. Some guys have the same concerns as you, and the
> > > >> >>>>>>>>>>> correctness
> > > >> >>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>> necessity of auto-detection may require further discussion
> > > >> >>>>>>> in
> > > >> >>>>>>>>> the
> > > >> >>>>>>>>>>>>> future.
> > > >> >>>>>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
> > > >> >>>>>> node
> > > >> >>>>>>>> and
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>> node
> > > >> >>>>>>>>>>>>>>> should be marked blocked.
> > > >> >>>>>>>>>>>>>>> By design, the blocklist information will be shared among
> > > >> >>>>>>> all
> > > >> >>>>>>>>> jobs
> > > >> >>>>>>>>>>>> in a
> > > >> >>>>>>>>>>>>>>> cluster/session. The JM will sync blocklist information
> > > >> >>>>>> with
> > > >> >>>>>>>> RM.
> > > >> >>>>>>>>>>>>>>> @Martijn
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> > > >> >>>>>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
> > > >> >>>>>>>>>>> MARK_BLOCKLISTED(Just
> > > >> >>>>>>>>>>>>>> limits
> > > >> >>>>>>>>>>>>>>> the load of the node and does not  kill all the processes
> > > >> >>>>>> on
> > > >> >>>>>>>> it)
> > > >> >>>>>>>>>>> is
> > > >> >>>>>>>>>>>>> also
> > > >> >>>>>>>>>>>>>>> important, and we think that external systems (*yarn
> > > >> >>>>>> rmadmin
> > > >> >>>>>>>> or
> > > >> >>>>>>>>>>>> kubectl
> > > >> >>>>>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
> > > >> >>>>>>>> only
> > > >> >>>>>>>>>>>>>> *manually*.
> > > >> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> > > >> >>>>>>> indeed
> > > >> >>>>>>>>>>> super
> > > >> >>>>>>>>>>>>> hard
> > > >> >>>>>>>>>>>>>>> to get right.
> > > >> >>>>>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
> > > >> >>>>>>>>> concern.
> > > >> >>>>>>>>>>>>>>> However, we currently plan to not support auto-detection
> > > >> >>>>>> in
> > > >> >>>>>>>> this
> > > >> >>>>>>>>>>>> FLIP,
> > > >> >>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
> > > >> >>>>>> FLIP
> > > >> >>>>>>>>>>> provides
> > > >> >>>>>>>>>>>> a
> > > >> >>>>>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
> > > >> >>>>>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > > >> >>>>>>>>>>>>>>> the auto-detection may be done by external systems.
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>>>>> Lijie
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> > > >> >>>>>>> 19:04写道：
> > > >> >>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> > > >> >>>>>> could
> > > >> >>>>>>>> not
> > > >> >>>>>>>>>>> see
> > > >> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> > > >> >>>>>>>>> approach(via
> > > >> >>>>>>>>>>>> *yarn
> > > >> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> To me this sounds yet again like one of those magical
> > > >> >>>>>>>>>>> mechanisms
> > > >> >>>>>>>>>>>>>> that
> > > >> >>>>>>>>>>>>>>>> will rarely work just right.
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> > > >> >>>>>>> indeed
> > > >> >>>>>>>>>>> super
> > > >> >>>>>>>>>>>>> hard
> > > >> >>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>> get right.
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> Best regards,
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> Martijn
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> > > >> >>>>>>>>>>>> liujiangangpeng@gmail.com
> > > >> >>>>>>>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
> > > >> >>>>>>>>> decrease
> > > >> >>>>>>>>>>>>> great
> > > >> >>>>>>>>>>>>>>> work
> > > >> >>>>>>>>>>>>>>>>> for us. We have implemented the similar feature in our
> > > >> >>>>>>>> inner
> > > >> >>>>>>>>>>> flink
> > > >> >>>>>>>>>>>>>>>>> version.
> > > >> >>>>>>>>>>>>>>>>> Below is something that I care about:
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>    1. For auto-detecting, I wonder how to make the
> > > >> >>>>>>> strategy
> > > >> >>>>>>>>> and
> > > >> >>>>>>>>>>>>> mark a
> > > >> >>>>>>>>>>>>>>>>> node
> > > >> >>>>>>>>>>>>>>>>>    blocked? Sometimes the blocked node is hard to be
> > > >> >>>>>>>>> detected,
> > > >> >>>>>>>>>>> for
> > > >> >>>>>>>>>>>>>>>>> example,
> > > >> >>>>>>>>>>>>>>>>>    the upper node or the down node will be blocked when
> > > >> >>>>>>>>> network
> > > >> >>>>>>>>>>>>>>>>> unreachable.
> > > >> >>>>>>>>>>>>>>>>>    2. I see that the strategy is made in JobMaster
> > > >> >>>>>> side.
> > > >> >>>>>>>> How
> > > >> >>>>>>>>>>> about
> > > >> >>>>>>>>>>>>>>>>>    implementing the similar logic in resource manager?
> > > >> >>>>>> In
> > > >> >>>>>>>>>>> session
> > > >> >>>>>>>>>>>>>> mode,
> > > >> >>>>>>>>>>>>>>>>> multi
> > > >> >>>>>>>>>>>>>>>>>    jobs can fail on the same bad node and the node
> > > >> >>>>>> should
> > > >> >>>>>>>> be
> > > >> >>>>>>>>>>>> marked
> > > >> >>>>>>>>>>>>>>>>> blocked.
> > > >> >>>>>>>>>>>>>>>>>    If the job makes the strategy, the node may be not
> > > >> >>>>>>>> marked
> > > >> >>>>>>>>>>>> blocked
> > > >> >>>>>>>>>>>>>> if
> > > >> >>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>    fail times don't exceed the threshold.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>> Thank you for all your feedback!
> > > >> >>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
> > > >> >>>>>> some
> > > >> >>>>>>> of
> > > >> >>>>>>>>> my
> > > >> >>>>>>>>>>>>>> thoughts:
> > > >> >>>>>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
> > > >> >>>>>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
> > > >> >>>>>>>>>>>>>>>>>> The automatical way should be something built upon
> > > >> >>>>>> the
> > > >> >>>>>>>>>>> blocklist
> > > >> >>>>>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
> > > >> >>>>>>>>>>>> configurable
> > > >> >>>>>>>>>>>>>>>>>> blocklist strategy, but I think we can further
> > > >> >>>>>> decouple
> > > >> >>>>>>>> it
> > > >> >>>>>>>>> by
> > > >> >>>>>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
> > > >> >>>>>>>> suggested,
> > > >> >>>>>>>>>>> which
> > > >> >>>>>>>>>>>>>> just
> > > >> >>>>>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
> > > >> >>>>>>> detected.
> > > >> >>>>>>>>>>>> However,
> > > >> >>>>>>>>>>>>>> it
> > > >> >>>>>>>>>>>>>>>>>> should be a separate FLIP with further dev
> > > >> >>>>>> discussions
> > > >> >>>>>>>> and
> > > >> >>>>>>>>>>>>> feedback
> > > >> >>>>>>>>>>>>>>>>>> from users. I also agree with Becket that different
> > > >> >>>>>>> users
> > > >> >>>>>>>>>>> have
> > > >> >>>>>>>>>>>>>>> different
> > > >> >>>>>>>>>>>>>>>>>> requirements, and we should listen to them.
> > > >> >>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
> > > >> >>>>>>>> externally
> > > >> >>>>>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
> > > >> >>>>>> way
> > > >> >>>>>>> to
> > > >> >>>>>>>>>>> avoid
> > > >> >>>>>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
> > > >> >>>>>>>> users
> > > >> >>>>>>>>>>> may
> > > >> >>>>>>>>>>>>> just
> > > >> >>>>>>>>>>>>>>>>>> want to limit the load of the node and do not want to
> > > >> >>>>>>>> kill
> > > >> >>>>>>>>>>> all
> > > >> >>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>> processes on it. Another case is the speculative
> > > >> >>>>>>>>> execution[1]
> > > >> >>>>>>>>>>>>> which
> > > >> >>>>>>>>>>>>>>>>>> may also leverage this feature to avoid starting
> > > >> >>>>>> mirror
> > > >> >>>>>>>>>>> tasks on
> > > >> >>>>>>>>>>>>>> slow
> > > >> >>>>>>>>>>>>>>>>>> nodes.
> > > >> >>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>> Thanks,
> > > >> >>>>>>>>>>>>>>>>>> Zhu
> > > >> >>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>> [1]
> > > >> >>>>>>>>>>>>>>>>>>
> > > >> >>>>
> > > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > >> >>>>>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> > > >> >>>>>>>>> 15:56写道：
> > > >> >>>>>>>>>>>>>>>>>>> Hi everyone,
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> Thanks for your feedback.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
> > > >> >>>>>> here
> > > >> >>>>>>>>>>> because
> > > >> >>>>>>>>>>>> it
> > > >> >>>>>>>>>>>>>> can
> > > >> >>>>>>>>>>>>>>>>>> affect the value and design of the blocklist
> > > >> >>>>>> mechanism
> > > >> >>>>>>>>>>> (perhaps
> > > >> >>>>>>>>>>>> I
> > > >> >>>>>>>>>>>>>>> should
> > > >> >>>>>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
> > > >> >>>>>>>> FLIP:
> > > >> >>>>>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
> > > >> >>>>>>> node
> > > >> >>>>>>>>> as
> > > >> >>>>>>>>>>>>>> blocked.
> > > >> >>>>>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
> > > >> >>>>>>>> task
> > > >> >>>>>>>>>>>> manager
> > > >> >>>>>>>>>>>>>> or
> > > >> >>>>>>>>>>>>>>>>> node.
> > > >> >>>>>>>>>>>>>>>>>> But slots that are already allocated will not be
> > > >> >>>>>>>> affected.
> > > >> >>>>>>>>> A
> > > >> >>>>>>>>>>>>> typical
> > > >> >>>>>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
> > > >> >>>>>>> In
> > > >> >>>>>>>>> this
> > > >> >>>>>>>>>>>>> case,
> > > >> >>>>>>>>>>>>>> we
> > > >> >>>>>>>>>>>>>>>>> hope
> > > >> >>>>>>>>>>>>>>>>>> that subsequent resource allocations will not be on
> > > >> >>>>>> the
> > > >> >>>>>>>> hot
> > > >> >>>>>>>>>>>>> machine,
> > > >> >>>>>>>>>>>>>>> but
> > > >> >>>>>>>>>>>>>>>>>> tasks currently running on it should not be affected.
> > > >> >>>>>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> > > >> >>>>>> task
> > > >> >>>>>>>>>>> manager
> > > >> >>>>>>>>>>>> or
> > > >> >>>>>>>>>>>>>>> node
> > > >> >>>>>>>>>>>>>>>>> as
> > > >> >>>>>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
> > > >> >>>>>> tasks
> > > >> >>>>>>>> will
> > > >> >>>>>>>>>>> be
> > > >> >>>>>>>>>>>>>>>>> restarted on
> > > >> >>>>>>>>>>>>>>>>>> non-blocked task managers.
> > > >> >>>>>>>>>>>>>>>>>>> For the above 2 actions, the former may more
> > > >> >>>>>>> highlight
> > > >> >>>>>>>>> the
> > > >> >>>>>>>>>>>>> meaning
> > > >> >>>>>>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>>> this FLIP, because the external system cannot do
> > > >> >>>>>> that.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
> > > >> >>>>>> basically
> > > >> >>>>>>>>> agree
> > > >> >>>>>>>>>>>> with
> > > >> >>>>>>>>>>>>>>>>> @Becket
> > > >> >>>>>>>>>>>>>>>>>> Qin: different users have different answers. Not all
> > > >> >>>>>>>> users’
> > > >> >>>>>>>>>>>>>> deployment
> > > >> >>>>>>>>>>>>>>>>>> environments have a special external system that can
> > > >> >>>>>>>>> perform
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>> anomaly
> > > >> >>>>>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
> > > >> >>>>>>>>>>> auto-detection
> > > >> >>>>>>>>>>>>>>> doesn't
> > > >> >>>>>>>>>>>>>>>>>> require much extra work on top of manual
> > > >> >>>>>> specification.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> I will answer your other questions one by one.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> @Yangze
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
> > > >> >>>>>>> the
> > > >> >>>>>>> `cluster.resource-blocklist.item.timeout-check-interval`
> > > >> >>>>>>>> to
> > > >> >>>>>>>>>>>> users.
> > > >> >>>>>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
> > > >> >>>>>>> separate
> > > >> >>>>>>>>>>>> interface
> > > >> >>>>>>>>>>>>>>>>> (maybe
> > > >> >>>>>>>>>>>>>>>>>> BlocklistExceptionListener), and the
> > > >> >>>>>>>>>>>>> ResourceManagerBlocklistHandler
> > > >> >>>>>>>>>>>>>>> can
> > > >> >>>>>>>>>>>>>>>>>> implement it in the future.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> @Martijn
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
> > > >> >>>>>> by
> > > >> >>>>>>>>>>> cluster
> > > >> >>>>>>>>>>>>>>>>> operators.
> > > >> >>>>>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
> > > >> >>>>>>>> according
> > > >> >>>>>>>>>>> to
> > > >> >>>>>>>>>>>> my
> > > >> >>>>>>>>>>>>>>>>>> experience, users are often the first to perceive the
> > > >> >>>>>>>>> machine
> > > >> >>>>>>>>>>>>>> problems
> > > >> >>>>>>>>>>>>>>>>>> (because of job failover or delay), and they will
> > > >> >>>>>>> contact
> > > >> >>>>>>>>>>>> cluster
> > > >> >>>>>>>>>>>>>>>>> operators
> > > >> >>>>>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
> > > >> >>>>>>>>>>> machine is
> > > >> >>>>>>>>>>>>>>>>>> problematic. From this point of view, I think the
> > > >> >>>>>>> people
> > > >> >>>>>>>>> who
> > > >> >>>>>>>>>>>>> really
> > > >> >>>>>>>>>>>>>>> need
> > > >> >>>>>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
> > > >> >>>>>>>> performed
> > > >> >>>>>>>>> by
> > > >> >>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>> cluster
> > > >> >>>>>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> @Chesnay
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
> > > >> >>>>>>> for
> > > >> >>>>>>>>>>>>>>>>> MARK_BLOCKLISTED
> > > >> >>>>>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
> > > >> >>>>>> when
> > > >> >>>>>>>> the
> > > >> >>>>>>>>>>> task
> > > >> >>>>>>>>>>>> is
> > > >> >>>>>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> > > >> >>>>>>>> release
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>> slot
> > > >> >>>>>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
> > > >> >>>>>>> on
> > > >> >>>>>>>>> it.
> > > >> >>>>>>>>>>> If
> > > >> >>>>>>>>>>>> we
> > > >> >>>>>>>>>>>>>>> only
> > > >> >>>>>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
> > > >> >>>>>> needs
> > > >> >>>>>>> to
> > > >> >>>>>>>>>>>> retrieve
> > > >> >>>>>>>>>>>>>> it
> > > >> >>>>>>>>>>>>>>> by
> > > >> >>>>>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
> > > >> >>>>>>>> relatively
> > > >> >>>>>>>>>>>> large,
> > > >> >>>>>>>>>>>>>> so
> > > >> >>>>>>>>>>>>>>> I
> > > >> >>>>>>>>>>>>>>>>>> think it's worth maintaining the blocklist
> > > >> >>>>>> information
> > > >> >>>>>>> on
> > > >> >>>>>>>>>>> the JM
> > > >> >>>>>>>>>>>>>> side
> > > >> >>>>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>> syncing them.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> @Роман
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>     a) “Probably storing inside Zookeeper/Configmap
> > > >> >>>>>>>> might
> > > >> >>>>>>>>>>> be
> > > >> >>>>>>>>>>>>>> helpful
> > > >> >>>>>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
> > > >> >>>>>>>>>>> understand
> > > >> >>>>>>>>>>>>> that.
> > > >> >>>>>>>>>>>>>>> In
> > > >> >>>>>>>>>>>>>>>>> my
> > > >> >>>>>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
> > > >> >>>>>>>> special
> > > >> >>>>>>>>>>>>>> treatment
> > > >> >>>>>>>>>>>>>>> is
> > > >> >>>>>>>>>>>>>>>>>> required.
> > > >> >>>>>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
> > > >> >>>>>> sense,
> > > >> >>>>>>> I
> > > >> >>>>>>>>> will
> > > >> >>>>>>>>>>>> add
> > > >> >>>>>>>>>>>>> it
> > > >> >>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> FLIP.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> @Yang
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
> > > >> >>>>>> cannot
> > > >> >>>>>>>>>>> support
> > > >> >>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> Looking forward to your further feedback.
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> Lijie
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
> > > >> >>>>>>>> 21:09写道：
> > > >> >>>>>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
> > > >> >>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
> > > >> >>>>>>>>>>> operations.
> > > >> >>>>>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
> > > >> >>>>>>>>>>> Reliability
> > > >> >>>>>>>>>>>>>>> Engineer)
> > > >> >>>>>>>>>>>>>>>>>>>> already has many tools to detect the unstable
> > > >> >>>>>> nodes,
> > > >> >>>>>>>>> which
> > > >> >>>>>>>>>>>>> could
> > > >> >>>>>>>>>>>>>>> take
> > > >> >>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>> system logs/metrics into consideration.
> > > >> >>>>>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
> > > >> >>>>>> taint
> > > >> >>>>>>>> in
> > > >> >>>>>>>>>>> K8s
> > > >> >>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>> prevent
> > > >> >>>>>>>>>>>>>>>>>> new
> > > >> >>>>>>>>>>>>>>>>>>>> allocations on these unstable nodes.
> > > >> >>>>>>>>>>>>>>>>>>>> At last, they will evict all the containers and
> > > >> >>>>>> pods
> > > >> >>>>>>>>>>> running
> > > >> >>>>>>>>>>>> on
> > > >> >>>>>>>>>>>>>>> these
> > > >> >>>>>>>>>>>>>>>>>> nodes.
> > > >> >>>>>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
> > > >> >>>>>>> So
> > > >> >>>>>>>> I
> > > >> >>>>>>>>> am
> > > >> >>>>>>>>>>>>> afraid
> > > >> >>>>>>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>>> is
> > > >> >>>>>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
> > > >> >>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> > > >> >>>>>>>> could
> > > >> >>>>>>>>>>> not
> > > >> >>>>>>>>>>>>> see
> > > >> >>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> > > >> >>>>>>>>>>>> approach(via
> > > >> >>>>>>>>>>>>>>> *yarn
> > > >> >>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> > > >> >>>>>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
> > > >> >>>>>>> which
> > > >> >>>>>>>>>>> could
> > > >> >>>>>>>>>>>>>> expose
> > > >> >>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
> > > >> >>>>>>> them
> > > >> >>>>>>>> if
> > > >> >>>>>>>>>>>>> enabled
> > > >> >>>>>>>>>>>>>>>>>> explicitly.
> > > >> >>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>>>>>>>>>> Yang
> > > >> >>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> > > >> >>>>>>>> 16:36写道：
> > > >> >>>>>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
> > > >> >>>>>> and
> > > >> >>>>>>>>>>> somewhat
> > > >> >>>>>>>>>>>>>>> related
> > > >> >>>>>>>>>>>>>>>>>> to the
> > > >> >>>>>>>>>>>>>>>>>>>>> design principle about how people should operate
> > > >> >>>>>>>>> Flink.
> > > >> >>>>>>>>>>>>>>>>>>>>> I think there are three things involved in this
> > > >> >>>>>>>> FLIP.
> > > >> >>>>>>>>>>>>>>>>>>>>>      a) Detect and report the unstable node.
> > > >> >>>>>>>>>>>>>>>>>>>>>      b) Collect the information of the unstable
> > > >> >>>>>>> node
> > > >> >>>>>>>>> and
> > > >> >>>>>>>>>>>>> form a
> > > >> >>>>>>>>>>>>>>>>>> blocklist.
> > > >> >>>>>>>>>>>>>>>>>>>>>      c) Take the action to block nodes.
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>> My two cents:
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
> > > >> >>>>>>> should
> > > >> >>>>>>>>> have
> > > >> >>>>>>>>>>>> c).
> > > >> >>>>>>>>>>>>> It
> > > >> >>>>>>>>>>>>>>> is
> > > >> >>>>>>>>>>>>>>>>>> not only
> > > >> >>>>>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
> > > >> >>>>>> handy
> > > >> >>>>>>>> for
> > > >> >>>>>>>>>>> some
> > > >> >>>>>>>>>>>>>>> planned
> > > >> >>>>>>>>>>>>>>>>>>>>> maintenance.
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
> > > >> >>>>>> who
> > > >> >>>>>>>>>>> should be
> > > >> >>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>> brain
> > > >> >>>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
> > > >> >>>>>>>>> largely
> > > >> >>>>>>>>>>>>> depends
> > > >> >>>>>>>>>>>>>>> on
> > > >> >>>>>>>>>>>>>>>>>> who we
> > > >> >>>>>>>>>>>>>>>>>>>>> talk to. Different users would probably give
> > > >> >>>>>>>> different
> > > >> >>>>>>>>>>>>> answers.
> > > >> >>>>>>>>>>>>>>> For
> > > >> >>>>>>>>>>>>>>>>>> people
> > > >> >>>>>>>>>>>>>>>>>>>>> who do have a centralized node health management
> > > >> >>>>>>>>>>> service,
> > > >> >>>>>>>>>>>> let
> > > >> >>>>>>>>>>>>>>> Flink
> > > >> >>>>>>>>>>>>>>>>>> do just
> > > >> >>>>>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
> > > >> >>>>>>>> Flink
> > > >> >>>>>>>>>>> would
> > > >> >>>>>>>>>>>>> be
> > > >> >>>>>>>>>>>>>>> one
> > > >> >>>>>>>>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
> > > >> >>>>>> it
> > > >> >>>>>>> to
> > > >> >>>>>>>>>>> that
> > > >> >>>>>>>>>>>>>>> service,
> > > >> >>>>>>>>>>>>>>>>>> and then
> > > >> >>>>>>>>>>>>>>>>>>>>> take the command from that service to block the
> > > >> >>>>>>>>>>> problematic
> > > >> >>>>>>>>>>>>>>> nodes.
> > > >> >>>>>>>>>>>>>>>>> On
> > > >> >>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
> > > >> >>>>>>>> service,
> > > >> >>>>>>>>>>>> simply
> > > >> >>>>>>>>>>>>>>>>> letting
> > > >> >>>>>>>>>>>>>>>>>> Flink
> > > >> >>>>>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
> > > >> >>>>>> nodes
> > > >> >>>>>>>>> might
> > > >> >>>>>>>>>>> be
> > > >> >>>>>>>>>>>>>>> desired
> > > >> >>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
> > > >> >>>>>>>> pluggable /
> > > >> >>>>>>>>>>>>>> optional.
> > > >> >>>>>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
> > > >> >>>>>>> have
> > > >> >>>>>>>>>>>> something
> > > >> >>>>>>>>>>>>>>>>>> pluggable
> > > >> >>>>>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
> > > >> >>>>>> unstable
> > > >> >>>>>>>>> nodes
> > > >> >>>>>>>>>>>>>>> actively.
> > > >> >>>>>>>>>>>>>>>>> (A
> > > >> >>>>>>>>>>>>>>>>>> more
> > > >> >>>>>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
> > > >> >>>>>>> which
> > > >> >>>>>>>>>>> can be
> > > >> >>>>>>>>>>>>>> used
> > > >> >>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> report
> > > >> >>>>>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
> > > >> >>>>>>> the
> > > >> >>>>>>>>>>> scope
> > > >> >>>>>>>>>>>>>>> relevant
> > > >> >>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
> > > >> >>>>>> OK
> > > >> >>>>>>> to
> > > >> >>>>>>>>>>> have a
> > > >> >>>>>>>>>>>>>>> default
> > > >> >>>>>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
> > > >> >>>>>>> Flink
> > > >> >>>>>>>> to
> > > >> >>>>>>>>>>> take
> > > >> >>>>>>>>>>>>>>> action
> > > >> >>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> block
> > > >> >>>>>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
> > > >> >>>>>>>>> timeout.
> > > >> >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > > >> >>>>>>>>>>>>>> ro.v.boyko@gmail.com
> > > >> >>>>>>>>>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
> > > >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
> > > >> >>>>>>>>>>> development.
> > > >> >>>>>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
> > > >> >>>>>>> we
> > > >> >>>>>>>>>>> should
> > > >> >>>>>>>>>>>>>>> consider
> > > >> >>>>>>>>>>>>>>>>>> to move
> > > >> >>>>>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
> > > >> >>>>>> nodes/TMs
> > > >> >>>>>>>>> from
> > > >> >>>>>>>>>>>>> active
> > > >> >>>>>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
> > > >> >>>>>>>> storing
> > > >> >>>>>>>>>>>> inside
> > > >> >>>>>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
> > > >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
> > > >> >>>>>>>> organizations
> > > >> >>>>>>>>>>>> don't
> > > >> >>>>>>>>>>>>>> want
> > > >> >>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> expose
> > > >> >>>>>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
> > > >> >>>>>>>> it's
> > > >> >>>>>>>>>>>>> necessary
> > > >> >>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> have the
> > > >> >>>>>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
> > > >> >>>>>>> for
> > > >> >>>>>>>>>>>> avoiding
> > > >> >>>>>>>>>>>>>>>>> incorrect
> > > >> >>>>>>>>>>>>>>>>>>>>>> automatic behaviour.
> > > >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
> > > >> >>>>>>>> would
> > > >> >>>>>>>>> be
> > > >> >>>>>>>>>>>>> better
> > > >> >>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> extend
> > > >> >>>>>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
> > > >> >>>>>> *endTimestamp*
> > > >> >>>>>>>>> field
> > > >> >>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>> fill
> > > >> >>>>>>>>>>>>>>> it
> > > >> >>>>>>>>>>>>>>>>>> at the
> > > >> >>>>>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
> > > >> >>>>>>> to:
> > > >> >>>>>>>>>>>>>>>>>>>>>>    -
> > > >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>    Provide the ability to users to setup the
> > > >> >>>>>>> exact
> > > >> >>>>>>>>>>> time
> > > >> >>>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>>> blocklist end
> > > >> >>>>>>>>>>>>>>>>>>>>>>    through RestAPI
> > > >> >>>>>>>>>>>>>>>>>>>>>>    -
> > > >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>    Not being tied to a single value of
> > > >> >>>>>>>>>>>>>>>>>>>>>>    *cluster.resource-blacklist.item.timeout*
> > > >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> > > >> >>>>>> <
> > > >> >>>>>>>>>>>>>>>>> chesnay@apache.org>
> > > >> >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
> > > >> >>>>>>>> lines
> > > >> >>>>>>>>> a
> > > >> >>>>>>>>>>>> bit.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
> > > >> >>>>>>>>> auto-detection
> > > >> >>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>> only
> > > >> >>>>>>>>>>>>>>>>>> have an
> > > >> >>>>>>>>>>>>>>>>>>>>>>> opt-in mechanism
> > > >> >>>>>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
> > > >> >>>>>> this
> > > >> >>>>>>>>> sounds
> > > >> >>>>>>>>>>>> yet
> > > >> >>>>>>>>>>>>>>> again
> > > >> >>>>>>>>>>>>>>>>>> like one
> > > >> >>>>>>>>>>>>>>>>>>>>>>> of those
> > > >> >>>>>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
> > > >> >>>>>> just
> > > >> >>>>>>>>> right.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> An external system can leverage way more
> > > >> >>>>>>>>> information
> > > >> >>>>>>>>>>>>> after
> > > >> >>>>>>>>>>>>>>> all.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
> > > >> >>>>>>>> complexity
> > > >> >>>>>>>>>>> of
> > > >> >>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>>> proposal.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
> > > >> >>>>>>> between
> > > >> >>>>>>>>>>>>>> components;
> > > >> >>>>>>>>>>>>>>>>>>>>> adjustments
> > > >> >>>>>>>>>>>>>>>>>>>>>>> to the
> > > >> >>>>>>>>>>>>>>>>>>>>>>> slot and resource protocol.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
> > > >> >>>>>>> resource
> > > >> >>>>>>>>>>>>> management
> > > >> >>>>>>>>>>>>>>>>>> perspective,
> > > >> >>>>>>>>>>>>>>>>>>>>>>> then there
> > > >> >>>>>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
> > > >> >>>>>> protocol
> > > >> >>>>>>>> at
> > > >> >>>>>>>>>>> all
> > > >> >>>>>>>>>>>> (or
> > > >> >>>>>>>>>>>>>> in
> > > >> >>>>>>>>>>>>>>>>> fact
> > > >> >>>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
> > > >> >>>>>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
> > > >> >>>>>>> around
> > > >> >>>>>>>>>>> blocked
> > > >> >>>>>>>>>>>>> TMs
> > > >> >>>>>>>>>>>>>>> in
> > > >> >>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>> first
> > > >> >>>>>>>>>>>>>>>>>>>>>>> place.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
> > > >> >>>>>>> down
> > > >> >>>>>>>>>>> after
> > > >> >>>>>>>>>>>> a
> > > >> >>>>>>>>>>>>>>> while
> > > >> >>>>>>>>>>>>>>>>> by
> > > >> >>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>> RM
> > > >> >>>>>>>>>>>>>>>>>>>>>>> because of
> > > >> >>>>>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
> > > >> >>>>>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
> > > >> >>>>>>>> blocking
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>> process
> > > >> >>>>>>>>>>>>>>>>> or
> > > >> >>>>>>>>>>>>>>>>>> node)
> > > >> >>>>>>>>>>>>>>>>>>>>> is
> > > >> >>>>>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
> > > >> >>>>>>>>>>> process(es).
> > > >> >>>>>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
> > > >> >>>>>>>> back
> > > >> >>>>>>>>>>> up.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
> > > >> >>>>>> the
> > > >> >>>>>>>> same
> > > >> >>>>>>>>>>> line
> > > >> >>>>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>>> thinking to
> > > >> >>>>>>>>>>>>>>>>>>>>>>> standalone resource management.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
> > > >> >>>>>> process/node
> > > >> >>>>>>>>>>> manually
> > > >> >>>>>>>>>>>>>>> should
> > > >> >>>>>>>>>>>>>>>>> be
> > > >> >>>>>>>>>>>>>>>>>> a
> > > >> >>>>>>>>>>>>>>>>>>>>> core
> > > >> >>>>>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
> > > >> >>>>>>>> understand
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>> problem
> > > >> >>>>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>> I see
> > > >> >>>>>>>>>>>>>>>>>>>>>>> value
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> in the automatic detection and
> > > >> >>>>>>> blocklisting. I
> > > >> >>>>>>>>> do
> > > >> >>>>>>>>>>>> have
> > > >> >>>>>>>>>>>>>> some
> > > >> >>>>>>>>>>>>>>>>>> concerns
> > > >> >>>>>>>>>>>>>>>>>>>>>> with
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
> > > >> >>>>>>> blocked
> > > >> >>>>>>>>>>>>> resources.
> > > >> >>>>>>>>>>>>>> I
> > > >> >>>>>>>>>>>>>>>>> have
> > > >> >>>>>>>>>>>>>>>>>> two
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> concerns;
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
> > > >> >>>>>>>>> separation
> > > >> >>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>> concerns,
> > > >> >>>>>>>>>>>>>>>>>>>>> meaning
> > > >> >>>>>>>>>>>>>>>>>>>>>>> that
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
> > > >> >>>>>>>> managing a
> > > >> >>>>>>>>>>>>> cluster
> > > >> >>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>> there's
> > > >> >>>>>>>>>>>>>>>>>>>>> a
> > > >> >>>>>>>>>>>>>>>>>>>>>>> user
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
> > > >> >>>>>>>>>>> introduction of
> > > >> >>>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>>> mechanism,
> > > >> >>>>>>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> latter group now can influence the
> > > >> >>>>>>>>> responsibility
> > > >> >>>>>>>>>>> of
> > > >> >>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>> first
> > > >> >>>>>>>>>>>>>>>>>> group.
> > > >> >>>>>>>>>>>>>>>>>>>>>> So
> > > >> >>>>>>>>>>>>>>>>>>>>>>> it
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
> > > >> >>>>>>>> group
> > > >> >>>>>>>>>>>> blocks
> > > >> >>>>>>>>>>>>>>>>>> something,
> > > >> >>>>>>>>>>>>>>>>>>>>>> which
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
> > > >> >>>>>>> paging
> > > >> >>>>>>>>>>>>> mechanism
> > > >> >>>>>>>>>>>>>>>>>> triggering
> > > >> >>>>>>>>>>>>>>>>>>>>>> etc)
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> which impacts the first group.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
> > > >> >>>>>> go
> > > >> >>>>>>>>>>> through
> > > >> >>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>> process
> > > >> >>>>>>>>>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>>>>>>>> manually
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
> > > >> >>>>>> it
> > > >> >>>>>>>>>>> should
> > > >> >>>>>>>>>>>>> be? I
> > > >> >>>>>>>>>>>>>>> do
> > > >> >>>>>>>>>>>>>>>>>> think
> > > >> >>>>>>>>>>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
> > > >> >>>>>>> make
> > > >> >>>>>>>>>>> sense
> > > >> >>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>> introduce
> > > >> >>>>>>>>>>>>>>>>>>>>> such
> > > >> >>>>>>>>>>>>>>>>>>>>>> a
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
> > > >> >>>>>>> really
> > > >> >>>>>>>>>>> small
> > > >> >>>>>>>>>>>>> user
> > > >> >>>>>>>>>>>>>>>>> group
> > > >> >>>>>>>>>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>>>>>>> Flink?
> > > >> >>>>>>>>>>>>>>>>>>>>>>> We
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
> > > >> >>>>>>> such
> > > >> >>>>>>>> a
> > > >> >>>>>>>>>>>>> feature.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
> > > >> >>>>>>> I'm
> > > >> >>>>>>>>>>>> leaning
> > > >> >>>>>>>>>>>>>>>>> towards
> > > >> >>>>>>>>>>>>>>>>>> not
> > > >> >>>>>>>>>>>>>>>>>>>>>>> exposing
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> this to the user group but having this
> > > >> >>>>>>>> available
> > > >> >>>>>>>>>>>>> strictly
> > > >> >>>>>>>>>>>>>>> for
> > > >> >>>>>>>>>>>>>>>>>> cluster
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
> > > >> >>>>>> their
> > > >> >>>>>>>>>>>>>>>>>> paging/metrics/logging
> > > >> >>>>>>>>>>>>>>>>>>>>>>> system
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> to take this into account.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > > >> >>>>>>>>>>>>>>> karmagyz@gmail.com
> > > >> >>>>>>>>>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
> > > >> >>>>>>> some
> > > >> >>>>>>>>>>> cents
> > > >> >>>>>>>>>>>>> here:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
> > > >> >>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>> user?
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> I think the semantics of
> > > >> >>>>>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
> > > >> >>>>>>>>>>>>>>>>>>>>> is
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
> > > >> >>>>>>> the
> > > >> >>>>>>>>>>>> timeout
> > > >> >>>>>>>>>>>>>>>>>> mechanism is
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
> > > >> >>>>>> it
> > > >> >>>>>>>>> will
> > > >> >>>>>>>>>>> be
> > > >> >>>>>>>>>>>>> very
> > > >> >>>>>>>>>>>>>>>>>> confusing
> > > >> >>>>>>>>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
> > > >> >>>>>> exception
> > > >> >>>>>>>> of a
> > > >> >>>>>>>>>>> task
> > > >> >>>>>>>>>>>>>>>>> manager to
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
> > > >> >>>>>> fail
> > > >> >>>>>>>> in
> > > >> >>>>>>>>>>> case
> > > >> >>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>> target
> > > >> >>>>>>>>>>>>>>>>>> task
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
> > > >> >>>>>> I
> > > >> >>>>>>>>> don't
> > > >> >>>>>>>>>>>> mean
> > > >> >>>>>>>>>>>>> we
> > > >> >>>>>>>>>>>>>>>>> need
> > > >> >>>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>>>>> cover
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
> > > >> >>>>>> also
> > > >> >>>>>>>>> open a
> > > >> >>>>>>>>>>>>>>>>>> `notifyException`
> > > >> >>>>>>>>>>>>>>>>>>>>> in
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
> > > >> >>>>>>>>>>> ResourceManager,
> > > >> >>>>>>>>>>>>> will
> > > >> >>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>> slot of
> > > >> >>>>>>>>>>>>>>>>>>>>> a
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
> > > >> >>>>>>> released
> > > >> >>>>>>>>> and
> > > >> >>>>>>>>>>>>>>> allocated?
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> Yangze Guo
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
> > > >> >>>>>> Wang
> > > >> >>>>>>> <
> > > >> >>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
> > > >> >>>>>> response
> > > >> >>>>>>>>> your 4
> > > >> >>>>>>>>>>>>>> remarks:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
> > > >> >>>>>>>>> controversy. I
> > > >> >>>>>>>>>>>>> think
> > > >> >>>>>>>>>>>>>>>>>> “BlockList”
> > > >> >>>>>>>>>>>>>>>>>>>>> is
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> good
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
> > > >> >>>>>>>> good
> > > >> >>>>>>>>>>> idea.
> > > >> >>>>>>>>>>>>>> Based
> > > >> >>>>>>>>>>>>>>>>> on
> > > >> >>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>>>> above, I
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > > >> >>>>>> <host>/blocklist/node/<identifier>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > > >> >>>>>>>>> <host>/blocklist/taskmanager/<identifier>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
> > > >> >>>>>>>> means
> > > >> >>>>>>>>>>> that
> > > >> >>>>>>>>>>>>> all
> > > >> >>>>>>>>>>>>>>>>> task
> > > >> >>>>>>>>>>>>>>>>>>>>> managers
> > > >> >>>>>>>>>>>>>>>>>>>>>>> on
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
> > > >> >>>>>>>> these
> > > >> >>>>>>>>>>> TMs
> > > >> >>>>>>>>>>>> are
> > > >> >>>>>>>>>>>>>> not
> > > >> >>>>>>>>>>>>>>>>>>>>> available.
> > > >> >>>>>>>>>>>>>>>>>>>>>>> This
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
> > > >> >>>>>> these
> > > >> >>>>>>>> TMs
> > > >> >>>>>>>>>>> are
> > > >> >>>>>>>>>>>>> not
> > > >> >>>>>>>>>>>>>>>>> really
> > > >> >>>>>>>>>>>>>>>>>> lost,
> > > >> >>>>>>>>>>>>>>>>>>>>>>> they
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
> > > >> >>>>>> are
> > > >> >>>>>>>>> still
> > > >> >>>>>>>>>>>>>>> registered
> > > >> >>>>>>>>>>>>>>>>>> in this
> > > >> >>>>>>>>>>>>>>>>>>>>>>> flink
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
> > > >> >>>>>> once
> > > >> >>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>> corresponding
> > > >> >>>>>>>>>>>>>>>>>>>>>> blocklist
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> item
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
> > > >> >>>>>>>>>>>>>>> active/non-active
> > > >> >>>>>>>>>>>>>>>>>>>>> clusters.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
> > > >> >>>>>> TMs
> > > >> >>>>>>>> may
> > > >> >>>>>>>>> be
> > > >> >>>>>>>>>>>>>> released
> > > >> >>>>>>>>>>>>>>>>> due
> > > >> >>>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>>>>> idle
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> timeouts.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
> > > >> >>>>>> keep
> > > >> >>>>>>>> it.
> > > >> >>>>>>>>>>> The
> > > >> >>>>>>>>>>>>>> reasons
> > > >> >>>>>>>>>>>>>>>>> are
> > > >> >>>>>>>>>>>>>>>>>> as
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> following:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
> > > >> >>>>>>> adding
> > > >> >>>>>>>> or
> > > >> >>>>>>>>>>>>> removing
> > > >> >>>>>>>>>>>>>>>>> items
> > > >> >>>>>>>>>>>>>>>>>> via
> > > >> >>>>>>>>>>>>>>>>>>>>>> REST
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> API,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
> > > >> >>>>>> it
> > > >> >>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>> Long.MAX_VALUE .
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
> > > >> >>>>>> a
> > > >> >>>>>>>>>>> period of
> > > >> >>>>>>>>>>>>>> time
> > > >> >>>>>>>>>>>>>>>>>> (such as
> > > >> >>>>>>>>>>>>>>>>>>>>>>> machine
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
> > > >> >>>>>> prefer
> > > >> >>>>>>>> that
> > > >> >>>>>>>>>>>> Flink
> > > >> >>>>>>>>>>>>>> can
> > > >> >>>>>>>>>>>>>>> do
> > > >> >>>>>>>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
> > > >> >>>>>>> user
> > > >> >>>>>>>> to
> > > >> >>>>>>>>>>> do
> > > >> >>>>>>>>>>>> it
> > > >> >>>>>>>>>>>>>>>>> manually.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
> > > >> >>>>>>>>>>> 于2022年4月27日周三
> > > >> >>>>>>>>>>>>>>>>> 19:23写道：
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
> > > >> >>>>>> only
> > > >> >>>>>>>>>>> support
> > > >> >>>>>>>>>>>>>>> manually
> > > >> >>>>>>>>>>>>>>>>>> blocking
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
> > > >> >>>>>>> different
> > > >> >>>>>>>>>>>>> strategies
> > > >> >>>>>>>>>>>>>>> can
> > > >> >>>>>>>>>>>>>>>>>> also be
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
> > > >> >>>>>>>> "bla.cklist"
> > > >> >>>>>>>>>>> due
> > > >> >>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>> controversy
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> around
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
> > > >> >>>>>>> Ticket
> > > >> >>>>>>>>>>> about
> > > >> >>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>> topic a
> > > >> >>>>>>>>>>>>>>>>>>>>> while
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> back
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
> > > >> >>>>>>>> avoid
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>> term
> > > >> >>>>>>>>>>>>>>>>>> blacklist &
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> whitelist
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
> > > >> >>>>>> "denylist"
> > > >> >>>>>>> or
> > > >> >>>>>>>>>>>>>>> "quarantined"
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
> > > >> >>>>>>> slightly
> > > >> >>>>>>>>>>>> different
> > > >> >>>>>>>>>>>>>>>>> design
> > > >> >>>>>>>>>>>>>>>>>> as
> > > >> >>>>>>>>>>>>>>>>>>>>> verbs
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> like
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
> > > >> >>>>>>> anti-pattern
> > > >> >>>>>>>>> for
> > > >> >>>>>>>>>>>> REST
> > > >> >>>>>>>>>>>>>>> APIs.
> > > >> >>>>>>>>>>>>>>>>>> POST
> > > >> >>>>>>>>>>>>>>>>>>>>> on a
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> list
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
> > > >> >>>>>>>> items.
> > > >> >>>>>>>>>>>> DELETE
> > > >> >>>>>>>>>>>>> on
> > > >> >>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>>> individual
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > > >> >>>>>>>>> <host>/quarantine/items/<itemidentifier>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
> > > >> >>>>>>>>> taskmanagers
> > > >> >>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>> nodes
> > > >> >>>>>>>>>>>>>>>>> in
> > > >> >>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>>> REST
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> API
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
> > > >> >>>>>>> opinion
> > > >> >>>>>>>>> on
> > > >> >>>>>>>>>>>> this?
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > > >> >>>>>>> <host>/quarantine/nodes/<identifier>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > > >> >>>>>>>>>>> <host>/quarantine/taskmanager/<identifier>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
> > > >> >>>>>>>>>>> non-active
> > > >> >>>>>>>>>>>>>>>>> resource
> > > >> >>>>>>>>>>>>>>>>>>>>>> managers,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> i.e.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
> > > >> >>>>>>>>>>> minimal,
> > > >> >>>>>>>>>>>> do
> > > >> >>>>>>>>>>>>> we
> > > >> >>>>>>>>>>>>>>>>> need
> > > >> >>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>>>> timeout
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
> > > >> >>>>>>>> manually
> > > >> >>>>>>>>>>> we
> > > >> >>>>>>>>>>>>> could
> > > >> >>>>>>>>>>>>>>>>>> delegate
> > > >> >>>>>>>>>>>>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> to the
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
> > > >> >>>>>>>>> behavior
> > > >> >>>>>>>>>>>>> would
> > > >> >>>>>>>>>>>>>>>>> better
> > > >> >>>>>>>>>>>>>>>>>> fit
> > > >> >>>>>>>>>>>>>>>>>>>>>> into
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>
> > > >> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> > > >> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> > > >> >>>>>>> schrieb
> > > >> >>>>>>>>>>> Lijie
> > > >> >>>>>>>>>>>>> Wang
> > > >> >>>>>>>>>>>>>> <
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
> > > >> >>>>>>>> cluster
> > > >> >>>>>>>>>>> node
> > > >> >>>>>>>>>>>>>>> issues
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> (insufficient
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> disk
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
> > > >> >>>>>>>> abnormalities).
> > > >> >>>>>>>>>>>> Flink
> > > >> >>>>>>>>>>>>>> will
> > > >> >>>>>>>>>>>>>>>>>> take care
> > > >> >>>>>>>>>>>>>>>>>>>>>> of
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
> > > >> >>>>>>> However,
> > > >> >>>>>>>>> due
> > > >> >>>>>>>>>>> to
> > > >> >>>>>>>>>>>>> data
> > > >> >>>>>>>>>>>>>>>>>> locality
> > > >> >>>>>>>>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> limited
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
> > > >> >>>>>> likely
> > > >> >>>>>>>> to
> > > >> >>>>>>>>> be
> > > >> >>>>>>>>>>>>>>> redeployed
> > > >> >>>>>>>>>>>>>>>>>> to the
> > > >> >>>>>>>>>>>>>>>>>>>>>> same
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
> > > >> >>>>>>>> task
> > > >> >>>>>>>>>>>>>>> abnormalities
> > > >> >>>>>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>>>>>> affect
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> job
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> progress.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
> > > >> >>>>>> manually
> > > >> >>>>>>>>>>> identify
> > > >> >>>>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>>> problematic
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> node and
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
> > > >> >>>>>>> But
> > > >> >>>>>>>>>>> this
> > > >> >>>>>>>>>>>>>>> approach
> > > >> >>>>>>>>>>>>>>>>> has
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> following
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
> > > >> >>>>>> heavy
> > > >> >>>>>>>>>>> process.
> > > >> >>>>>>>>>>>>>> Users
> > > >> >>>>>>>>>>>>>>>>> may
> > > >> >>>>>>>>>>>>>>>>>> need
> > > >> >>>>>>>>>>>>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contact
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
> > > >> >>>>>>>>> operation
> > > >> >>>>>>>>>>> can
> > > >> >>>>>>>>>>>>>> even
> > > >> >>>>>>>>>>>>>>> be
> > > >> >>>>>>>>>>>>>>>>>>>>> dangerous
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
> > > >> >>>>>>>>> events.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
> > > >> >>>>>> of
> > > >> >>>>>>>>>>> problems
> > > >> >>>>>>>>>>>>>>> manually
> > > >> >>>>>>>>>>>>>>>>>> would
> > > >> >>>>>>>>>>>>>>>>>>>>> be
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>> slow
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
> > > >> >>>>>>>> propose
> > > >> >>>>>>>>>>> to
> > > >> >>>>>>>>>>>>>>>>> introduce a
> > > >> >>>>>>>>>>>>>>>>>>>>>> blacklist
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
> > > >> >>>>>>>>> problematic
> > > >> >>>>>>>>>>>>>>> resources.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
> > > >> >>>>>>> FLIP-224[1].
> > > >> >>>>>>>>>>> Looking
> > > >> >>>>>>>>>>>>>>> forward
> > > >> >>>>>>>>>>>>>>>>>> to your
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> feedback.
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>
> > > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> > > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>> --
> > > >> >>>>>>>>>>>>> Best regards,
> > > >> >>>>>>>>>>>>> Roman Boyko
> > > >> >>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> --
> > > >> >>>>>>>> https://twitter.com/snntrable
> > > >> >>>>>>>> https://github.com/knaufk
> > > >> >>>>>>>>
> > > >> >>>>>>
> > > >> >>>>>> --
> > > >> >>>>>> https://twitter.com/snntrable
> > > >> >>>>>> https://github.com/knaufk
> > > >> >>>>>>
> > > >> >>>>
> > > >>
> > > >>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.

Hi Chesnay,

Would you please take a look at the FLIP and discussion to see if all
your concerns have been addressed?

Thanks,
Zhu

Zhu Zhu <re...@gmail.com> 于2022年5月28日周六 13:26写道：
>
> Regarding the concern of the SlotManager, my two cents:
> 1. it is necessary for the SlotManager to host blocked slots, in 2 cases:
>   a. In standalone mode, a taskmanager may be temporarily added to
> the blocklist. We do not want the TM to get disconnected and shut down.
> So we need to keep its connection to RM and keep hosting its slots.
>   b. When we want to avoid allocating slots to a slow nodes but do not
> want to kill current running tasks on the nodes (MARK_BLOCKED mode).
>
> There is possible a way to keep the connection of a blocked task manager
> while hide its slots from SlotManager, but I feel it may be even much more
> complicated.
>
> 2. It will not complicate the SlotManager too much. The SlotManager will
> be offered a BlockedTaskManagerChecker when created, and just need
> to use it to filter out blocked slots on slot request. Therefore I think the
> complication is acceptable.
>
> Thanks,
> Zhu
>
> Lijie Wang <wa...@gmail.com> 于2022年5月25日周三 15:26写道：
> >
> > Hi everyone,
> >
> > I've updated the FLIP according to Chesnay's feedback, changes as follows:
> > 1. Change the GET result to a map.
> > 2. Only left *endTimestamp* in ADD operation, and change the rest (from
> > POST) to PUT
> > 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
> > encapsulate blocklist related functions.
> > 4. Remove *mainThread* from BlocklistTracker, instead provide a
> > *removeTimeoutItems* method to be called by outside components。
> >
> > Best,
> > Lijie
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：
> >
> > > Hi Chesnay,
> > >
> > > Thanks for feedback.
> > >
> > > 1. Regarding the TM/Node id. Do you mean special characters may appear in
> > > the rest URL?  Actually, I don't think so. The task manager id in REST API
> > > should be the *ResourceID* of taskmanager in Flink, there should be no
> > > special characters, and some existing REST APIs are already using it, e.g.
> > > GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
> > > id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
> > > should also have no special characters.
> > > 2. Regarding the GET query responses. I agree with you, it makes sense to
> > > change the GET result to a map.
> > >
> > > 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
> > > cover everything, and the endTimestamp is a unix timestamp, there should be
> > > no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
> > > add rest api is add or update, PUT can cover this semantics.
> > >
> > > 4. Regarding the slot pool/manager. I don't think the current slotpool
> > > and slotmanager are able to support the MARK_BLOCKED(slots that are
> > > already allocated will not be affected) action. The reasons are as
> > > follows：
> > >
> > > a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
> > > from reserved(task assigned) to free(no task assigned), it is necessary to
> > > check whether the slot should be released immediately(it should be released
> > > immediately if the task manager is blocked, otherwise it may be allocated
> > > to other tasks). I think it cannot be supported without being aware of
> > > the blocklist information. Compared to the solution in FLIP, a more
> > > appropriate/prefered way may be: Introduce a new slot pool
> > > implementation for blocklist(may be named BlocklistSlotPool, it
> > > extends/wrapps the original slot pool), and implement the parts that need
> > > to be aware of the blocklist in this newly introduced slot pool, and the
> > > original slot pool basically does not need to change.
> > >
> > > b) for slot manager, with the MARK_BLOCKED action, there may be free but
> > > blocked slots in slot manager (the corresponding TMs cannot be
> > > released/unregistered because there are still running tasks on them).
> > > Therefore, we need to filter out the blocked slots when trying to fulfill
> > > the slot requirements. Therefore it also needs to know the blocklist information.
> > > A better way may be to abstract a resource allocation strategy, and make
> > > the blocklist as a special implementation, then pass the resource
> > > allocation strategy in when constructing the slot manager. Unfortunately,
> > > the data structures in the two existing slot manager
> > > implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
> > > quite different, it is not easy to abstract a common resource allocation
> > > strategy, so we prefer to keep the current way(i.e. pass the blocklist
> > > information directly into slot manager).
> > >
> > >
> > > 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
> > > does not need to be aware of the executor, and the timeout actions can be
> > > done outside.
> > >
> > > Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
> > >
> > >> I have a number of concerns:
> > >>
> > >> Is the id used for deleting an item the same sent in the initial request
> > >> (and not one returned by Flink)?
> > >> I'm very concerned that the tm/node id can contain special characters.
> > >>
> > >> The GET query should return a map, not a list of items. This makes it
> > >> easier to work with.
> > >>
> > >> The duality of endTimestamp and duration is also concerning.
> > >> If we conclude that endTimestamps can in fact work (and aren't utterly
> > >> unusable due to timezones),
> > >> then this should be able to cover everything and rid us of some
> > >> complexity w.r.t. POSTs to the same ID.
> > >> Additions would be a PUT, changes a PATCH, deletes a DELETE.
> > >>
> > >>
> > >> I also dislike how we're pushing more functionality into the
> > >> slotpool/-manager.
> > >> These components are complex enough as-is, and instead I'd propose a
> > >> separate component that interacts with the SlotPool/-Manager instead,
> > >> for example by removing the slots from that TM.
> > >> The reason being that from the slot-pool perspective it is irrelevant
> > >> whether a slot is gone because the TM was lost, or because it was blocked.
> > >>
> > >>
> > >> The BlocklistTracker shouldn't be modeled as component that is aware of
> > >> the concept of main threads.
> > >> It really has no business knowing that; all it needs is an executor for
> > >> handling timeouts/periodic actions,
> > >> and a way to interact with the JM/RM (which internally can handle the
> > >> scheduling into the main thread).
> > >>
> > >>
> > >> On 20/05/2022 07:20, Lijie Wang wrote:
> > >> > Hi everyone,
> > >> >
> > >> > I have started a vote for this FLIP [1]. Please cast your vote there or
> > >> ask
> > >> > additional questions here. [1]
> > >> > https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
> > >> >
> > >> > Best, Lijie
> > >> >
> > >> > Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
> > >> >
> > >> >> Hi Konstantin,
> > >> >>
> > >> >> We found that Flink REST URL does not support the format ":merge" ,
> > >> which
> > >> >> will be recognized as a parameter in the URL(due to start with a
> > >> colon).
> > >> >>
> > >> >> We will keep the previous way, i.e.
> > >> >>
> > >> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> > >> >> and the "id" and "merge" flag are put into the request body.
> > >> >>
> > >> >> Best,
> > >> >> Lijie
> > >> >>
> > >> >> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
> > >> >>
> > >> >>> Hi Weihua,
> > >> >>> thanks for feedback.
> > >> >>>
> > >> >>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
> > >> >>> step towards auto-detection.
> > >> >>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> > >> >>> exception of insufficient resources.
> > >> >>> 3. No. This FLIP won't change the WebUI. The blocklist information
> > >> can be
> > >> >>> obtained through REST API and metrics.
> > >> >>>
> > >> >>> Best,
> > >> >>> Lijie
> > >> >>>
> > >> >>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
> > >> >>>
> > >> >>>> Hi,
> > >> >>>> Thanks for creating this FLIP.
> > >> >>>> We have implemented an automatic blocklist detection mechanism
> > >> >>>> internally, which is indeed very effective for handling node
> > >> failures.
> > >> >>>> Due to the large number of nodes, although SREs already support
> > >> >>>> automatic offline failure nodes, the detection is not 100% accurate
> > >> and
> > >> >>>> there is some delay.
> > >> >>>> So the blocklist mechanism can make flink job recover from failure
> > >> much
> > >> >>>> faster.
> > >> >>>>
> > >> >>>> Here are some of my thoughts:
> > >> >>>> 1. In this FLIP, it needs users to locate machine failure manually,
> > >> >>>> there is a certain cost of use
> > >> >>>> 2. What happens if too many nodes are blocked, resulting in
> > >> insufficient
> > >> >>>> resources? Will there be a special Exception for the user?
> > >> >>>> 3. Will we display the blocklist information in the WebUI? The
> > >> blocklist
> > >> >>>> is for cluster level, and if multiple users share a cluster, some
> > >> users may
> > >> >>>> be a little confused when resources are not enough, or when
> > >> resources are
> > >> >>>> applied for more.
> > >> >>>>
> > >> >>>> Also, Looking forward to the next FLIP on auto-detection.
> > >> >>>>
> > >> >>>> Best,
> > >> >>>> Weihua
> > >> >>>>
> > >> >>>>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> > >> >>>>>
> > >> >>>>> Hi Konstantin,
> > >> >>>>>
> > >> >>>>> Maybe change it to the following:
> > >> >>>>>
> > >> >>>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> > >> >>>>> Merge is not allowed. If the {id} already exists, return error.
> > >> >>>> Otherwise,
> > >> >>>>> create a new item.
> > >> >>>>>
> > >> >>>>> 2. POST: http://
> > >> >>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> > >> >>>>> Merge is allowed. If the {id} already exists, merge. Otherwise,
> > >> create
> > >> >>>> a
> > >> >>>>> new item.
> > >> >>>>>
> > >> >>>>> WDYT?
> > >> >>>>>
> > >> >>>>> Best,
> > >> >>>>> Lijie
> > >> >>>>>
> > >> >>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> > >> >>>>>
> > >> >>>>>> Hi Lijie,
> > >> >>>>>>
> > >> >>>>>> hm, maybe the following is more appropriate in that case
> > >> >>>>>>
> > >> >>>>>> POST: http://
> > >> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> > >> >>>>>>
> > >> >>>>>> Best,
> > >> >>>>>>
> > >> >>>>>> Konstantin
> > >> >>>>>>
> > >> >>>>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> > >> >>>>>> wangdachui9501@gmail.com>:
> > >> >>>>>>
> > >> >>>>>>> Hi Konstantin,
> > >> >>>>>>> thanks for your feedback.
> > >> >>>>>>>
> > >> >>>>>>>  From what I understand, PUT should be idempotent. However, we
> > >> have a
> > >> >>>>>>> *timeout* field in the request. This means that initiating the
> > >> same
> > >> >>>>>> request
> > >> >>>>>>> at two different times will lead to different resource status
> > >> >>>> (timestamps
> > >> >>>>>>> of the items to be removed will be different).
> > >> >>>>>>>
> > >> >>>>>>> Should we use PUT in this case? WDYT?
> > >> >>>>>>>
> > >> >>>>>>> Best,
> > >> >>>>>>> Lijie
> > >> >>>>>>>
> > >> >>>>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> > >> >>>>>>>
> > >> >>>>>>>> Hi Lijie,
> > >> >>>>>>>>
> > >> >>>>>>>> wouldn't the REST API-idiomatic way for an update/replace be a
> > >> PUT
> > >> >>>> on
> > >> >>>>>> the
> > >> >>>>>>>> resource?
> > >> >>>>>>>>
> > >> >>>>>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> > >> >>>>>>>>
> > >> >>>>>>>> Best,
> > >> >>>>>>>>
> > >> >>>>>>>> Konstantin
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> > >> >>>>>>>> wangdachui9501@gmail.com>:
> > >> >>>>>>>>
> > >> >>>>>>>>> Hi everyone,
> > >> >>>>>>>>>
> > >> >>>>>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
> > >> >>>> made
> > >> >>>>>>> the
> > >> >>>>>>>>> following changes on REST API:
> > >> >>>>>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
> > >> choose
> > >> >>>>>>> one.
> > >> >>>>>>>> If
> > >> >>>>>>>>> both are specified, will return error.
> > >> >>>>>>>>> 2.  If the specified item is already there, the *ADD* operation
> > >> has
> > >> >>>>>> two
> > >> >>>>>>>>> behaviors:  *return error*(default value) or *merge/update*,
> > >> and we
> > >> >>>>>>> add a
> > >> >>>>>>>>> flag to the request body to control it. You can find more
> > >> details
> > >> >>>>>>> "Public
> > >> >>>>>>>>> Interface" section.
> > >> >>>>>>>>>
> > >> >>>>>>>>> If there is no more feedback, we will start the vote thread next
> > >> >>>>>> week.
> > >> >>>>>>>>> Best,
> > >> >>>>>>>>> Lijie
> > >> >>>>>>>>>
> > >> >>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> > >> >>>>>>>>>
> > >> >>>>>>>>>> Hi Becket Qin,
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Thanks for your suggestions.  I have moved the description of
> > >> >>>>>>>>>> configurations, metrics and REST API into "Public Interface"
> > >> >>>>>> section,
> > >> >>>>>>>> and
> > >> >>>>>>>>>> made a few updates according to your suggestion.  And in this
> > >> >>>> FLIP,
> > >> >>>>>>>> there
> > >> >>>>>>>>>> no public java Interfaces or pluggables that users need to
> > >> >>>>>> implement
> > >> >>>>>>> by
> > >> >>>>>>>>>> themselves.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Answers for you questions:
> > >> >>>>>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> > >> >>>>>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> > >> >>>>>> items
> > >> >>>>>>>> can
> > >> >>>>>>>>>> only be added through the REST API, so these 2 action are
> > >> >>>> mentioned
> > >> >>>>>>> in
> > >> >>>>>>>>> the
> > >> >>>>>>>>>> REST API part (The REST API part has beed moved to public
> > >> >>>> interface
> > >> >>>>>>>> now).
> > >> >>>>>>>>>> 2. I agree with you. I have changed the "Cause" field to
> > >> String,
> > >> >>>>>> and
> > >> >>>>>>>>> allow
> > >> >>>>>>>>>> users to specify it via REST API.
> > >> >>>>>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
> > >> >>>>>> above,
> > >> >>>>>>>> we
> > >> >>>>>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
> > >> >>>> ADD
> > >> >>>>>>>> REST
> > >> >>>>>>>>>> API to specify when to remove the blocked item. These 2 fields
> > >> are
> > >> >>>>>>>>>> optional, if neither is specified, it means that the blocked
> > >> item
> > >> >>>>>> is
> > >> >>>>>>>>>> permanent and will not be removed. If both are specified, the
> > >> >>>>>> minimum
> > >> >>>>>>>> of
> > >> >>>>>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
> > >> the
> > >> >>>>>>> time
> > >> >>>>>>>> to
> > >> >>>>>>>>>> remove the blocked item. To keep the configurations more
> > >> minimal,
> > >> >>>>>> we
> > >> >>>>>>>> have
> > >> >>>>>>>>>> removed the *cluster.resource-blocklist.item.timeout*
> > >> >>>> configuration
> > >> >>>>>>>>>> option.
> > >> >>>>>>>>>> 4. Yes, the block item will be overridden if the specified item
> > >> >>>>>>> already
> > >> >>>>>>>>>> exists. The ADD operation is *ADD or UPDATE*.
> > >> >>>>>>>>>> 5. Yes. On JM/RM side, all the blocklist information is
> > >> maintained
> > >> >>>>>> in
> > >> >>>>>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> > >> >>>>>>>>> abstracted
> > >> >>>>>>>>>> to other interfaces) will be propagated to different
> > >> components.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Best,
> > >> >>>>>>>>>> Lijie
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>> Hi Lijie,
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Thanks for updating the FLIP. It looks like the public
> > >> interface
> > >> >>>>>>>> section
> > >> >>>>>>>>>>> did not fully reflect all the user sensible behavior and API.
> > >> Can
> > >> >>>>>>> you
> > >> >>>>>>>>> put
> > >> >>>>>>>>>>> everything that users may be aware of there? That would
> > >> include
> > >> >>>>>> the
> > >> >>>>>>>> REST
> > >> >>>>>>>>>>> API, metrics, configurations, public java Interfaces or
> > >> >>>> pluggables
> > >> >>>>>>>> that
> > >> >>>>>>>>>>> users may see or implement by themselves, as well as a brief
> > >> >>>>>> summary
> > >> >>>>>>>> of
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>> behavior of the public API.
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Besides that, I have a few questions:
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> 1. According to the conversation in the discussion thread, it
> > >> >>>>>> looks
> > >> >>>>>>>> like
> > >> >>>>>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
> > >> >>>>>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
> > >> so,
> > >> >>>>>> can
> > >> >>>>>>>> you
> > >> >>>>>>>>>>> add
> > >> >>>>>>>>>>> that to the public interface as well?
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
> > >> >>>>>>> Throwable
> > >> >>>>>>>>> and
> > >> >>>>>>>>>>> is not reflected in the REST API. Should that be included in
> > >> the
> > >> >>>>>>> query
> > >> >>>>>>>>>>> response? And should we change that field to be a String so
> > >> users
> > >> >>>>>>> may
> > >> >>>>>>>>>>> specify the cause via the REST API when they block some nodes
> > >> /
> > >> >>>>>> TMs?
> > >> >>>>>>>>>>> 3. Would it be useful to allow users to have different
> > >> timeouts
> > >> >>>>>> for
> > >> >>>>>>>>>>> different blocked items? So while there is a default timeout,
> > >> >>>>>> users
> > >> >>>>>>>> can
> > >> >>>>>>>>>>> also override it via the REST API when they block an entity.
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> 4. Regarding the ADD operation, if the specified item is
> > >> already
> > >> >>>>>>>> there,
> > >> >>>>>>>>>>> will the block item be overridden? For example, if the user
> > >> wants
> > >> >>>>>> to
> > >> >>>>>>>>>>> extend
> > >> >>>>>>>>>>> the timeout of a blocked item, can they just  issue an ADD
> > >> >>>> command
> > >> >>>>>>>>> again?
> > >> >>>>>>>>>>> 5. I am not quite familiar with the details of this, but is
> > >> there
> > >> >>>>>> a
> > >> >>>>>>>>> source
> > >> >>>>>>>>>>> of truth for the blocked list? I think it might be good to
> > >> have a
> > >> >>>>>>>> single
> > >> >>>>>>>>>>> source of truth for the blocked list and just propagate that
> > >> list
> > >> >>>>>> to
> > >> >>>>>>>>>>> different components to take the action of actually blocking
> > >> the
> > >> >>>>>>>>> resource.
> > >> >>>>>>>>>>> Thanks,
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Jiangjie (Becket) Qin
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> > >> >>>>>> wangdachui9501@gmail.com
> > >> >>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>>> Hi everyone,
> > >> >>>>>>>>>>>>
> > >> >>>>>>>>>>>> Based on the discussion in the mailing list, I updated the
> > >> FLIP
> > >> >>>>>>> doc,
> > >> >>>>>>>>> the
> > >> >>>>>>>>>>>> changes include:
> > >> >>>>>>>>>>>> 1. Changed the description of the motivation section to more
> > >> >>>>>>> clearly
> > >> >>>>>>>>>>>> describe the problem this FLIP is trying to solve.
> > >> >>>>>>>>>>>> 2. Only  *Manually* is supported.
> > >> >>>>>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
> > >> >>>>>>>>>>>>
> > >> >>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>> Lijie
> > >> >>>>>>>>>>>>
> > >> >>>>>>>>>>>>
> > >> >>>>>>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> > >> >>>>>>>>>>>>
> > >> >>>>>>>>>>>>> Hi Lijie!
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
> > >> >>>>>>>>>>> helpfulhere.”
> > >> >>>>>>>>>>>>> Can you explain it in detail? I don't fully understand that.
> > >> >>>>>> In
> > >> >>>>>>>>>>>> myopinion,
> > >> >>>>>>>>>>>>> non-active and active are the same, and no special treatment
> > >> >>>>>>>>>>> isrequired.*
> > >> >>>>>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
> > >> >>>>>>> were
> > >> >>>>>>>>>>> talking
> > >> >>>>>>>>>>>>> about the HA mode (but not about Active and Standalone
> > >> >>>>>>>>>>> ResourceManager).
> > >> >>>>>>>>>>>>> And the original question was - how to handle the
> > >> blacklisted
> > >> >>>>>>>> nodes
> > >> >>>>>>>>>>> list
> > >> >>>>>>>>>>>> at
> > >> >>>>>>>>>>>>> the moment of leader change? Should we simply forget about
> > >> >>>>>> them
> > >> >>>>>>> or
> > >> >>>>>>>>>>> try to
> > >> >>>>>>>>>>>>> pre-save that list on the remote storage?
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <
> > >> danrtsey.wy@gmail.com
> > >> >>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> > >> >>>>>> it
> > >> >>>>>>> is
> > >> >>>>>>>>>>> indeed
> > >> >>>>>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
> > >> >>>>>>>> could
> > >> >>>>>>>>>>> not
> > >> >>>>>>>>>>>>>> support.
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>>>> Yang
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
> > >> 22:18写道：
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> @Jiangang
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
> > >> >>>>>> and
> > >> >>>>>>>>> mark a
> > >> >>>>>>>>>>>> node
> > >> >>>>>>>>>>>>>>> blocked?
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> In fact, we currently plan to not support auto-detection
> > >> >>>>>> in
> > >> >>>>>>>> this
> > >> >>>>>>>>>>>> FLIP.
> > >> >>>>>>>>>>>>>> The
> > >> >>>>>>>>>>>>>>> part about auto-detection may be continued in a separate
> > >> >>>>>>> FLIP
> > >> >>>>>>>> in
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>> future. Some guys have the same concerns as you, and the
> > >> >>>>>>>>>>> correctness
> > >> >>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>> necessity of auto-detection may require further discussion
> > >> >>>>>>> in
> > >> >>>>>>>>> the
> > >> >>>>>>>>>>>>> future.
> > >> >>>>>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
> > >> >>>>>> node
> > >> >>>>>>>> and
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>> node
> > >> >>>>>>>>>>>>>>> should be marked blocked.
> > >> >>>>>>>>>>>>>>> By design, the blocklist information will be shared among
> > >> >>>>>>> all
> > >> >>>>>>>>> jobs
> > >> >>>>>>>>>>>> in a
> > >> >>>>>>>>>>>>>>> cluster/session. The JM will sync blocklist information
> > >> >>>>>> with
> > >> >>>>>>>> RM.
> > >> >>>>>>>>>>>>>>> @Martijn
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> > >> >>>>>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
> > >> >>>>>>>>>>> MARK_BLOCKLISTED(Just
> > >> >>>>>>>>>>>>>> limits
> > >> >>>>>>>>>>>>>>> the load of the node and does not  kill all the processes
> > >> >>>>>> on
> > >> >>>>>>>> it)
> > >> >>>>>>>>>>> is
> > >> >>>>>>>>>>>>> also
> > >> >>>>>>>>>>>>>>> important, and we think that external systems (*yarn
> > >> >>>>>> rmadmin
> > >> >>>>>>>> or
> > >> >>>>>>>>>>>> kubectl
> > >> >>>>>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
> > >> >>>>>>>> only
> > >> >>>>>>>>>>>>>> *manually*.
> > >> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> > >> >>>>>>> indeed
> > >> >>>>>>>>>>> super
> > >> >>>>>>>>>>>>> hard
> > >> >>>>>>>>>>>>>>> to get right.
> > >> >>>>>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
> > >> >>>>>>>>> concern.
> > >> >>>>>>>>>>>>>>> However, we currently plan to not support auto-detection
> > >> >>>>>> in
> > >> >>>>>>>> this
> > >> >>>>>>>>>>>> FLIP,
> > >> >>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
> > >> >>>>>> FLIP
> > >> >>>>>>>>>>> provides
> > >> >>>>>>>>>>>> a
> > >> >>>>>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
> > >> >>>>>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > >> >>>>>>>>>>>>>>> the auto-detection may be done by external systems.
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>>>>> Lijie
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> > >> >>>>>>> 19:04写道：
> > >> >>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> > >> >>>>>> could
> > >> >>>>>>>> not
> > >> >>>>>>>>>>> see
> > >> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> > >> >>>>>>>>> approach(via
> > >> >>>>>>>>>>>> *yarn
> > >> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> To me this sounds yet again like one of those magical
> > >> >>>>>>>>>>> mechanisms
> > >> >>>>>>>>>>>>>> that
> > >> >>>>>>>>>>>>>>>> will rarely work just right.
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> > >> >>>>>>> indeed
> > >> >>>>>>>>>>> super
> > >> >>>>>>>>>>>>> hard
> > >> >>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>> get right.
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> Best regards,
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> Martijn
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> > >> >>>>>>>>>>>> liujiangangpeng@gmail.com
> > >> >>>>>>>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
> > >> >>>>>>>>> decrease
> > >> >>>>>>>>>>>>> great
> > >> >>>>>>>>>>>>>>> work
> > >> >>>>>>>>>>>>>>>>> for us. We have implemented the similar feature in our
> > >> >>>>>>>> inner
> > >> >>>>>>>>>>> flink
> > >> >>>>>>>>>>>>>>>>> version.
> > >> >>>>>>>>>>>>>>>>> Below is something that I care about:
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>    1. For auto-detecting, I wonder how to make the
> > >> >>>>>>> strategy
> > >> >>>>>>>>> and
> > >> >>>>>>>>>>>>> mark a
> > >> >>>>>>>>>>>>>>>>> node
> > >> >>>>>>>>>>>>>>>>>    blocked? Sometimes the blocked node is hard to be
> > >> >>>>>>>>> detected,
> > >> >>>>>>>>>>> for
> > >> >>>>>>>>>>>>>>>>> example,
> > >> >>>>>>>>>>>>>>>>>    the upper node or the down node will be blocked when
> > >> >>>>>>>>> network
> > >> >>>>>>>>>>>>>>>>> unreachable.
> > >> >>>>>>>>>>>>>>>>>    2. I see that the strategy is made in JobMaster
> > >> >>>>>> side.
> > >> >>>>>>>> How
> > >> >>>>>>>>>>> about
> > >> >>>>>>>>>>>>>>>>>    implementing the similar logic in resource manager?
> > >> >>>>>> In
> > >> >>>>>>>>>>> session
> > >> >>>>>>>>>>>>>> mode,
> > >> >>>>>>>>>>>>>>>>> multi
> > >> >>>>>>>>>>>>>>>>>    jobs can fail on the same bad node and the node
> > >> >>>>>> should
> > >> >>>>>>>> be
> > >> >>>>>>>>>>>> marked
> > >> >>>>>>>>>>>>>>>>> blocked.
> > >> >>>>>>>>>>>>>>>>>    If the job makes the strategy, the node may be not
> > >> >>>>>>>> marked
> > >> >>>>>>>>>>>> blocked
> > >> >>>>>>>>>>>>>> if
> > >> >>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>    fail times don't exceed the threshold.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>> Thank you for all your feedback!
> > >> >>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
> > >> >>>>>> some
> > >> >>>>>>> of
> > >> >>>>>>>>> my
> > >> >>>>>>>>>>>>>> thoughts:
> > >> >>>>>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
> > >> >>>>>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
> > >> >>>>>>>>>>>>>>>>>> The automatical way should be something built upon
> > >> >>>>>> the
> > >> >>>>>>>>>>> blocklist
> > >> >>>>>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
> > >> >>>>>>>>>>>> configurable
> > >> >>>>>>>>>>>>>>>>>> blocklist strategy, but I think we can further
> > >> >>>>>> decouple
> > >> >>>>>>>> it
> > >> >>>>>>>>> by
> > >> >>>>>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
> > >> >>>>>>>> suggested,
> > >> >>>>>>>>>>> which
> > >> >>>>>>>>>>>>>> just
> > >> >>>>>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
> > >> >>>>>>> detected.
> > >> >>>>>>>>>>>> However,
> > >> >>>>>>>>>>>>>> it
> > >> >>>>>>>>>>>>>>>>>> should be a separate FLIP with further dev
> > >> >>>>>> discussions
> > >> >>>>>>>> and
> > >> >>>>>>>>>>>>> feedback
> > >> >>>>>>>>>>>>>>>>>> from users. I also agree with Becket that different
> > >> >>>>>>> users
> > >> >>>>>>>>>>> have
> > >> >>>>>>>>>>>>>>> different
> > >> >>>>>>>>>>>>>>>>>> requirements, and we should listen to them.
> > >> >>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
> > >> >>>>>>>> externally
> > >> >>>>>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
> > >> >>>>>> way
> > >> >>>>>>> to
> > >> >>>>>>>>>>> avoid
> > >> >>>>>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
> > >> >>>>>>>> users
> > >> >>>>>>>>>>> may
> > >> >>>>>>>>>>>>> just
> > >> >>>>>>>>>>>>>>>>>> want to limit the load of the node and do not want to
> > >> >>>>>>>> kill
> > >> >>>>>>>>>>> all
> > >> >>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>> processes on it. Another case is the speculative
> > >> >>>>>>>>> execution[1]
> > >> >>>>>>>>>>>>> which
> > >> >>>>>>>>>>>>>>>>>> may also leverage this feature to avoid starting
> > >> >>>>>> mirror
> > >> >>>>>>>>>>> tasks on
> > >> >>>>>>>>>>>>>> slow
> > >> >>>>>>>>>>>>>>>>>> nodes.
> > >> >>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>> Thanks,
> > >> >>>>>>>>>>>>>>>>>> Zhu
> > >> >>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>> [1]
> > >> >>>>>>>>>>>>>>>>>>
> > >> >>>>
> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > >> >>>>>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> > >> >>>>>>>>> 15:56写道：
> > >> >>>>>>>>>>>>>>>>>>> Hi everyone,
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> Thanks for your feedback.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
> > >> >>>>>> here
> > >> >>>>>>>>>>> because
> > >> >>>>>>>>>>>> it
> > >> >>>>>>>>>>>>>> can
> > >> >>>>>>>>>>>>>>>>>> affect the value and design of the blocklist
> > >> >>>>>> mechanism
> > >> >>>>>>>>>>> (perhaps
> > >> >>>>>>>>>>>> I
> > >> >>>>>>>>>>>>>>> should
> > >> >>>>>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
> > >> >>>>>>>> FLIP:
> > >> >>>>>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
> > >> >>>>>>> node
> > >> >>>>>>>>> as
> > >> >>>>>>>>>>>>>> blocked.
> > >> >>>>>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
> > >> >>>>>>>> task
> > >> >>>>>>>>>>>> manager
> > >> >>>>>>>>>>>>>> or
> > >> >>>>>>>>>>>>>>>>> node.
> > >> >>>>>>>>>>>>>>>>>> But slots that are already allocated will not be
> > >> >>>>>>>> affected.
> > >> >>>>>>>>> A
> > >> >>>>>>>>>>>>> typical
> > >> >>>>>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
> > >> >>>>>>> In
> > >> >>>>>>>>> this
> > >> >>>>>>>>>>>>> case,
> > >> >>>>>>>>>>>>>> we
> > >> >>>>>>>>>>>>>>>>> hope
> > >> >>>>>>>>>>>>>>>>>> that subsequent resource allocations will not be on
> > >> >>>>>> the
> > >> >>>>>>>> hot
> > >> >>>>>>>>>>>>> machine,
> > >> >>>>>>>>>>>>>>> but
> > >> >>>>>>>>>>>>>>>>>> tasks currently running on it should not be affected.
> > >> >>>>>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> > >> >>>>>> task
> > >> >>>>>>>>>>> manager
> > >> >>>>>>>>>>>> or
> > >> >>>>>>>>>>>>>>> node
> > >> >>>>>>>>>>>>>>>>> as
> > >> >>>>>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
> > >> >>>>>> tasks
> > >> >>>>>>>> will
> > >> >>>>>>>>>>> be
> > >> >>>>>>>>>>>>>>>>> restarted on
> > >> >>>>>>>>>>>>>>>>>> non-blocked task managers.
> > >> >>>>>>>>>>>>>>>>>>> For the above 2 actions, the former may more
> > >> >>>>>>> highlight
> > >> >>>>>>>>> the
> > >> >>>>>>>>>>>>> meaning
> > >> >>>>>>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>>> this FLIP, because the external system cannot do
> > >> >>>>>> that.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
> > >> >>>>>> basically
> > >> >>>>>>>>> agree
> > >> >>>>>>>>>>>> with
> > >> >>>>>>>>>>>>>>>>> @Becket
> > >> >>>>>>>>>>>>>>>>>> Qin: different users have different answers. Not all
> > >> >>>>>>>> users’
> > >> >>>>>>>>>>>>>> deployment
> > >> >>>>>>>>>>>>>>>>>> environments have a special external system that can
> > >> >>>>>>>>> perform
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>> anomaly
> > >> >>>>>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
> > >> >>>>>>>>>>> auto-detection
> > >> >>>>>>>>>>>>>>> doesn't
> > >> >>>>>>>>>>>>>>>>>> require much extra work on top of manual
> > >> >>>>>> specification.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> I will answer your other questions one by one.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> @Yangze
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
> > >> >>>>>>> the
> > >> >>>>>>> `cluster.resource-blocklist.item.timeout-check-interval`
> > >> >>>>>>>> to
> > >> >>>>>>>>>>>> users.
> > >> >>>>>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
> > >> >>>>>>> separate
> > >> >>>>>>>>>>>> interface
> > >> >>>>>>>>>>>>>>>>> (maybe
> > >> >>>>>>>>>>>>>>>>>> BlocklistExceptionListener), and the
> > >> >>>>>>>>>>>>> ResourceManagerBlocklistHandler
> > >> >>>>>>>>>>>>>>> can
> > >> >>>>>>>>>>>>>>>>>> implement it in the future.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> @Martijn
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
> > >> >>>>>> by
> > >> >>>>>>>>>>> cluster
> > >> >>>>>>>>>>>>>>>>> operators.
> > >> >>>>>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
> > >> >>>>>>>> according
> > >> >>>>>>>>>>> to
> > >> >>>>>>>>>>>> my
> > >> >>>>>>>>>>>>>>>>>> experience, users are often the first to perceive the
> > >> >>>>>>>>> machine
> > >> >>>>>>>>>>>>>> problems
> > >> >>>>>>>>>>>>>>>>>> (because of job failover or delay), and they will
> > >> >>>>>>> contact
> > >> >>>>>>>>>>>> cluster
> > >> >>>>>>>>>>>>>>>>> operators
> > >> >>>>>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
> > >> >>>>>>>>>>> machine is
> > >> >>>>>>>>>>>>>>>>>> problematic. From this point of view, I think the
> > >> >>>>>>> people
> > >> >>>>>>>>> who
> > >> >>>>>>>>>>>>> really
> > >> >>>>>>>>>>>>>>> need
> > >> >>>>>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
> > >> >>>>>>>> performed
> > >> >>>>>>>>> by
> > >> >>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> cluster
> > >> >>>>>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> @Chesnay
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
> > >> >>>>>>> for
> > >> >>>>>>>>>>>>>>>>> MARK_BLOCKLISTED
> > >> >>>>>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
> > >> >>>>>> when
> > >> >>>>>>>> the
> > >> >>>>>>>>>>> task
> > >> >>>>>>>>>>>> is
> > >> >>>>>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> > >> >>>>>>>> release
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>> slot
> > >> >>>>>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
> > >> >>>>>>> on
> > >> >>>>>>>>> it.
> > >> >>>>>>>>>>> If
> > >> >>>>>>>>>>>> we
> > >> >>>>>>>>>>>>>>> only
> > >> >>>>>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
> > >> >>>>>> needs
> > >> >>>>>>> to
> > >> >>>>>>>>>>>> retrieve
> > >> >>>>>>>>>>>>>> it
> > >> >>>>>>>>>>>>>>> by
> > >> >>>>>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
> > >> >>>>>>>> relatively
> > >> >>>>>>>>>>>> large,
> > >> >>>>>>>>>>>>>> so
> > >> >>>>>>>>>>>>>>> I
> > >> >>>>>>>>>>>>>>>>>> think it's worth maintaining the blocklist
> > >> >>>>>> information
> > >> >>>>>>> on
> > >> >>>>>>>>>>> the JM
> > >> >>>>>>>>>>>>>> side
> > >> >>>>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>> syncing them.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> @Роман
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>     a) “Probably storing inside Zookeeper/Configmap
> > >> >>>>>>>> might
> > >> >>>>>>>>>>> be
> > >> >>>>>>>>>>>>>> helpful
> > >> >>>>>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
> > >> >>>>>>>>>>> understand
> > >> >>>>>>>>>>>>> that.
> > >> >>>>>>>>>>>>>>> In
> > >> >>>>>>>>>>>>>>>>> my
> > >> >>>>>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
> > >> >>>>>>>> special
> > >> >>>>>>>>>>>>>> treatment
> > >> >>>>>>>>>>>>>>> is
> > >> >>>>>>>>>>>>>>>>>> required.
> > >> >>>>>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
> > >> >>>>>> sense,
> > >> >>>>>>> I
> > >> >>>>>>>>> will
> > >> >>>>>>>>>>>> add
> > >> >>>>>>>>>>>>> it
> > >> >>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> FLIP.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> @Yang
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
> > >> >>>>>> cannot
> > >> >>>>>>>>>>> support
> > >> >>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> Looking forward to your further feedback.
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> Lijie
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
> > >> >>>>>>>> 21:09写道：
> > >> >>>>>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
> > >> >>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
> > >> >>>>>>>>>>> operations.
> > >> >>>>>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
> > >> >>>>>>>>>>> Reliability
> > >> >>>>>>>>>>>>>>> Engineer)
> > >> >>>>>>>>>>>>>>>>>>>> already has many tools to detect the unstable
> > >> >>>>>> nodes,
> > >> >>>>>>>>> which
> > >> >>>>>>>>>>>>> could
> > >> >>>>>>>>>>>>>>> take
> > >> >>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>> system logs/metrics into consideration.
> > >> >>>>>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
> > >> >>>>>> taint
> > >> >>>>>>>> in
> > >> >>>>>>>>>>> K8s
> > >> >>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>> prevent
> > >> >>>>>>>>>>>>>>>>>> new
> > >> >>>>>>>>>>>>>>>>>>>> allocations on these unstable nodes.
> > >> >>>>>>>>>>>>>>>>>>>> At last, they will evict all the containers and
> > >> >>>>>> pods
> > >> >>>>>>>>>>> running
> > >> >>>>>>>>>>>> on
> > >> >>>>>>>>>>>>>>> these
> > >> >>>>>>>>>>>>>>>>>> nodes.
> > >> >>>>>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
> > >> >>>>>>> So
> > >> >>>>>>>> I
> > >> >>>>>>>>> am
> > >> >>>>>>>>>>>>> afraid
> > >> >>>>>>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>>> is
> > >> >>>>>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
> > >> >>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> > >> >>>>>>>> could
> > >> >>>>>>>>>>> not
> > >> >>>>>>>>>>>>> see
> > >> >>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> > >> >>>>>>>>>>>> approach(via
> > >> >>>>>>>>>>>>>>> *yarn
> > >> >>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> > >> >>>>>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
> > >> >>>>>>> which
> > >> >>>>>>>>>>> could
> > >> >>>>>>>>>>>>>> expose
> > >> >>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
> > >> >>>>>>> them
> > >> >>>>>>>> if
> > >> >>>>>>>>>>>>> enabled
> > >> >>>>>>>>>>>>>>>>>> explicitly.
> > >> >>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>>>>>>>>>> Yang
> > >> >>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> > >> >>>>>>>> 16:36写道：
> > >> >>>>>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
> > >> >>>>>> and
> > >> >>>>>>>>>>> somewhat
> > >> >>>>>>>>>>>>>>> related
> > >> >>>>>>>>>>>>>>>>>> to the
> > >> >>>>>>>>>>>>>>>>>>>>> design principle about how people should operate
> > >> >>>>>>>>> Flink.
> > >> >>>>>>>>>>>>>>>>>>>>> I think there are three things involved in this
> > >> >>>>>>>> FLIP.
> > >> >>>>>>>>>>>>>>>>>>>>>      a) Detect and report the unstable node.
> > >> >>>>>>>>>>>>>>>>>>>>>      b) Collect the information of the unstable
> > >> >>>>>>> node
> > >> >>>>>>>>> and
> > >> >>>>>>>>>>>>> form a
> > >> >>>>>>>>>>>>>>>>>> blocklist.
> > >> >>>>>>>>>>>>>>>>>>>>>      c) Take the action to block nodes.
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>> My two cents:
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
> > >> >>>>>>> should
> > >> >>>>>>>>> have
> > >> >>>>>>>>>>>> c).
> > >> >>>>>>>>>>>>> It
> > >> >>>>>>>>>>>>>>> is
> > >> >>>>>>>>>>>>>>>>>> not only
> > >> >>>>>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
> > >> >>>>>> handy
> > >> >>>>>>>> for
> > >> >>>>>>>>>>> some
> > >> >>>>>>>>>>>>>>> planned
> > >> >>>>>>>>>>>>>>>>>>>>> maintenance.
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
> > >> >>>>>> who
> > >> >>>>>>>>>>> should be
> > >> >>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> brain
> > >> >>>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
> > >> >>>>>>>>> largely
> > >> >>>>>>>>>>>>> depends
> > >> >>>>>>>>>>>>>>> on
> > >> >>>>>>>>>>>>>>>>>> who we
> > >> >>>>>>>>>>>>>>>>>>>>> talk to. Different users would probably give
> > >> >>>>>>>> different
> > >> >>>>>>>>>>>>> answers.
> > >> >>>>>>>>>>>>>>> For
> > >> >>>>>>>>>>>>>>>>>> people
> > >> >>>>>>>>>>>>>>>>>>>>> who do have a centralized node health management
> > >> >>>>>>>>>>> service,
> > >> >>>>>>>>>>>> let
> > >> >>>>>>>>>>>>>>> Flink
> > >> >>>>>>>>>>>>>>>>>> do just
> > >> >>>>>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
> > >> >>>>>>>> Flink
> > >> >>>>>>>>>>> would
> > >> >>>>>>>>>>>>> be
> > >> >>>>>>>>>>>>>>> one
> > >> >>>>>>>>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
> > >> >>>>>> it
> > >> >>>>>>> to
> > >> >>>>>>>>>>> that
> > >> >>>>>>>>>>>>>>> service,
> > >> >>>>>>>>>>>>>>>>>> and then
> > >> >>>>>>>>>>>>>>>>>>>>> take the command from that service to block the
> > >> >>>>>>>>>>> problematic
> > >> >>>>>>>>>>>>>>> nodes.
> > >> >>>>>>>>>>>>>>>>> On
> > >> >>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
> > >> >>>>>>>> service,
> > >> >>>>>>>>>>>> simply
> > >> >>>>>>>>>>>>>>>>> letting
> > >> >>>>>>>>>>>>>>>>>> Flink
> > >> >>>>>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
> > >> >>>>>> nodes
> > >> >>>>>>>>> might
> > >> >>>>>>>>>>> be
> > >> >>>>>>>>>>>>>>> desired
> > >> >>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
> > >> >>>>>>>> pluggable /
> > >> >>>>>>>>>>>>>> optional.
> > >> >>>>>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
> > >> >>>>>>> have
> > >> >>>>>>>>>>>> something
> > >> >>>>>>>>>>>>>>>>>> pluggable
> > >> >>>>>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
> > >> >>>>>> unstable
> > >> >>>>>>>>> nodes
> > >> >>>>>>>>>>>>>>> actively.
> > >> >>>>>>>>>>>>>>>>> (A
> > >> >>>>>>>>>>>>>>>>>> more
> > >> >>>>>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
> > >> >>>>>>> which
> > >> >>>>>>>>>>> can be
> > >> >>>>>>>>>>>>>> used
> > >> >>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> report
> > >> >>>>>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
> > >> >>>>>>> the
> > >> >>>>>>>>>>> scope
> > >> >>>>>>>>>>>>>>> relevant
> > >> >>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
> > >> >>>>>> OK
> > >> >>>>>>> to
> > >> >>>>>>>>>>> have a
> > >> >>>>>>>>>>>>>>> default
> > >> >>>>>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
> > >> >>>>>>> Flink
> > >> >>>>>>>> to
> > >> >>>>>>>>>>> take
> > >> >>>>>>>>>>>>>>> action
> > >> >>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> block
> > >> >>>>>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
> > >> >>>>>>>>> timeout.
> > >> >>>>>>>>>>>>>>>>>>>>> Thanks,
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > >> >>>>>>>>>>>>>> ro.v.boyko@gmail.com
> > >> >>>>>>>>>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
> > >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
> > >> >>>>>>>>>>> development.
> > >> >>>>>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
> > >> >>>>>>> we
> > >> >>>>>>>>>>> should
> > >> >>>>>>>>>>>>>>> consider
> > >> >>>>>>>>>>>>>>>>>> to move
> > >> >>>>>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
> > >> >>>>>> nodes/TMs
> > >> >>>>>>>>> from
> > >> >>>>>>>>>>>>> active
> > >> >>>>>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
> > >> >>>>>>>> storing
> > >> >>>>>>>>>>>> inside
> > >> >>>>>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
> > >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
> > >> >>>>>>>> organizations
> > >> >>>>>>>>>>>> don't
> > >> >>>>>>>>>>>>>> want
> > >> >>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> expose
> > >> >>>>>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
> > >> >>>>>>>> it's
> > >> >>>>>>>>>>>>> necessary
> > >> >>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> have the
> > >> >>>>>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
> > >> >>>>>>> for
> > >> >>>>>>>>>>>> avoiding
> > >> >>>>>>>>>>>>>>>>> incorrect
> > >> >>>>>>>>>>>>>>>>>>>>>> automatic behaviour.
> > >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
> > >> >>>>>>>> would
> > >> >>>>>>>>> be
> > >> >>>>>>>>>>>>> better
> > >> >>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> extend
> > >> >>>>>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
> > >> >>>>>> *endTimestamp*
> > >> >>>>>>>>> field
> > >> >>>>>>>>>>> and
> > >> >>>>>>>>>>>>>> fill
> > >> >>>>>>>>>>>>>>> it
> > >> >>>>>>>>>>>>>>>>>> at the
> > >> >>>>>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
> > >> >>>>>>> to:
> > >> >>>>>>>>>>>>>>>>>>>>>>    -
> > >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>    Provide the ability to users to setup the
> > >> >>>>>>> exact
> > >> >>>>>>>>>>> time
> > >> >>>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>>> blocklist end
> > >> >>>>>>>>>>>>>>>>>>>>>>    through RestAPI
> > >> >>>>>>>>>>>>>>>>>>>>>>    -
> > >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>    Not being tied to a single value of
> > >> >>>>>>>>>>>>>>>>>>>>>>    *cluster.resource-blacklist.item.timeout*
> > >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> > >> >>>>>> <
> > >> >>>>>>>>>>>>>>>>> chesnay@apache.org>
> > >> >>>>>>>>>>>>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
> > >> >>>>>>>> lines
> > >> >>>>>>>>> a
> > >> >>>>>>>>>>>> bit.
> > >> >>>>>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
> > >> >>>>>>>>> auto-detection
> > >> >>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>> only
> > >> >>>>>>>>>>>>>>>>>> have an
> > >> >>>>>>>>>>>>>>>>>>>>>>> opt-in mechanism
> > >> >>>>>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
> > >> >>>>>> this
> > >> >>>>>>>>> sounds
> > >> >>>>>>>>>>>> yet
> > >> >>>>>>>>>>>>>>> again
> > >> >>>>>>>>>>>>>>>>>> like one
> > >> >>>>>>>>>>>>>>>>>>>>>>> of those
> > >> >>>>>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
> > >> >>>>>> just
> > >> >>>>>>>>> right.
> > >> >>>>>>>>>>>>>>>>>>>>>>> An external system can leverage way more
> > >> >>>>>>>>> information
> > >> >>>>>>>>>>>>> after
> > >> >>>>>>>>>>>>>>> all.
> > >> >>>>>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
> > >> >>>>>>>> complexity
> > >> >>>>>>>>>>> of
> > >> >>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>>> proposal.
> > >> >>>>>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
> > >> >>>>>>> between
> > >> >>>>>>>>>>>>>> components;
> > >> >>>>>>>>>>>>>>>>>>>>> adjustments
> > >> >>>>>>>>>>>>>>>>>>>>>>> to the
> > >> >>>>>>>>>>>>>>>>>>>>>>> slot and resource protocol.
> > >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
> > >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
> > >> >>>>>>> resource
> > >> >>>>>>>>>>>>> management
> > >> >>>>>>>>>>>>>>>>>> perspective,
> > >> >>>>>>>>>>>>>>>>>>>>>>> then there
> > >> >>>>>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
> > >> >>>>>> protocol
> > >> >>>>>>>> at
> > >> >>>>>>>>>>> all
> > >> >>>>>>>>>>>> (or
> > >> >>>>>>>>>>>>>> in
> > >> >>>>>>>>>>>>>>>>> fact
> > >> >>>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
> > >> >>>>>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
> > >> >>>>>>> around
> > >> >>>>>>>>>>> blocked
> > >> >>>>>>>>>>>>> TMs
> > >> >>>>>>>>>>>>>>> in
> > >> >>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>> first
> > >> >>>>>>>>>>>>>>>>>>>>>>> place.
> > >> >>>>>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
> > >> >>>>>>> down
> > >> >>>>>>>>>>> after
> > >> >>>>>>>>>>>> a
> > >> >>>>>>>>>>>>>>> while
> > >> >>>>>>>>>>>>>>>>> by
> > >> >>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>> RM
> > >> >>>>>>>>>>>>>>>>>>>>>>> because of
> > >> >>>>>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
> > >> >>>>>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
> > >> >>>>>>>> blocking
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>>> process
> > >> >>>>>>>>>>>>>>>>> or
> > >> >>>>>>>>>>>>>>>>>> node)
> > >> >>>>>>>>>>>>>>>>>>>>> is
> > >> >>>>>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
> > >> >>>>>>>>>>> process(es).
> > >> >>>>>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
> > >> >>>>>>>> back
> > >> >>>>>>>>>>> up.
> > >> >>>>>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
> > >> >>>>>> the
> > >> >>>>>>>> same
> > >> >>>>>>>>>>> line
> > >> >>>>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>>> thinking to
> > >> >>>>>>>>>>>>>>>>>>>>>>> standalone resource management.
> > >> >>>>>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
> > >> >>>>>> process/node
> > >> >>>>>>>>>>> manually
> > >> >>>>>>>>>>>>>>> should
> > >> >>>>>>>>>>>>>>>>> be
> > >> >>>>>>>>>>>>>>>>>> a
> > >> >>>>>>>>>>>>>>>>>>>>> core
> > >> >>>>>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
> > >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
> > >> >>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
> > >> >>>>>>>> understand
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>>> problem
> > >> >>>>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>> I see
> > >> >>>>>>>>>>>>>>>>>>>>>>> value
> > >> >>>>>>>>>>>>>>>>>>>>>>>> in the automatic detection and
> > >> >>>>>>> blocklisting. I
> > >> >>>>>>>>> do
> > >> >>>>>>>>>>>> have
> > >> >>>>>>>>>>>>>> some
> > >> >>>>>>>>>>>>>>>>>> concerns
> > >> >>>>>>>>>>>>>>>>>>>>>> with
> > >> >>>>>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
> > >> >>>>>>> blocked
> > >> >>>>>>>>>>>>> resources.
> > >> >>>>>>>>>>>>>> I
> > >> >>>>>>>>>>>>>>>>> have
> > >> >>>>>>>>>>>>>>>>>> two
> > >> >>>>>>>>>>>>>>>>>>>>>>>> concerns;
> > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
> > >> >>>>>>>>> separation
> > >> >>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>> concerns,
> > >> >>>>>>>>>>>>>>>>>>>>> meaning
> > >> >>>>>>>>>>>>>>>>>>>>>>> that
> > >> >>>>>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
> > >> >>>>>>>> managing a
> > >> >>>>>>>>>>>>> cluster
> > >> >>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>> there's
> > >> >>>>>>>>>>>>>>>>>>>>> a
> > >> >>>>>>>>>>>>>>>>>>>>>>> user
> > >> >>>>>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
> > >> >>>>>>>>>>> introduction of
> > >> >>>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>>> mechanism,
> > >> >>>>>>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>>>>> latter group now can influence the
> > >> >>>>>>>>> responsibility
> > >> >>>>>>>>>>> of
> > >> >>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> first
> > >> >>>>>>>>>>>>>>>>>> group.
> > >> >>>>>>>>>>>>>>>>>>>>>> So
> > >> >>>>>>>>>>>>>>>>>>>>>>> it
> > >> >>>>>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
> > >> >>>>>>>> group
> > >> >>>>>>>>>>>> blocks
> > >> >>>>>>>>>>>>>>>>>> something,
> > >> >>>>>>>>>>>>>>>>>>>>>> which
> > >> >>>>>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
> > >> >>>>>>> paging
> > >> >>>>>>>>>>>>> mechanism
> > >> >>>>>>>>>>>>>>>>>> triggering
> > >> >>>>>>>>>>>>>>>>>>>>>> etc)
> > >> >>>>>>>>>>>>>>>>>>>>>>>> which impacts the first group.
> > >> >>>>>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
> > >> >>>>>> go
> > >> >>>>>>>>>>> through
> > >> >>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> process
> > >> >>>>>>>>>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>>>>>>>> manually
> > >> >>>>>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
> > >> >>>>>> it
> > >> >>>>>>>>>>> should
> > >> >>>>>>>>>>>>> be? I
> > >> >>>>>>>>>>>>>>> do
> > >> >>>>>>>>>>>>>>>>>> think
> > >> >>>>>>>>>>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
> > >> >>>>>>> make
> > >> >>>>>>>>>>> sense
> > >> >>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>> introduce
> > >> >>>>>>>>>>>>>>>>>>>>> such
> > >> >>>>>>>>>>>>>>>>>>>>>> a
> > >> >>>>>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
> > >> >>>>>>> really
> > >> >>>>>>>>>>> small
> > >> >>>>>>>>>>>>> user
> > >> >>>>>>>>>>>>>>>>> group
> > >> >>>>>>>>>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>>>>>>> Flink?
> > >> >>>>>>>>>>>>>>>>>>>>>>> We
> > >> >>>>>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
> > >> >>>>>>> such
> > >> >>>>>>>> a
> > >> >>>>>>>>>>>>> feature.
> > >> >>>>>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
> > >> >>>>>>> I'm
> > >> >>>>>>>>>>>> leaning
> > >> >>>>>>>>>>>>>>>>> towards
> > >> >>>>>>>>>>>>>>>>>> not
> > >> >>>>>>>>>>>>>>>>>>>>>>> exposing
> > >> >>>>>>>>>>>>>>>>>>>>>>>> this to the user group but having this
> > >> >>>>>>>> available
> > >> >>>>>>>>>>>>> strictly
> > >> >>>>>>>>>>>>>>> for
> > >> >>>>>>>>>>>>>>>>>> cluster
> > >> >>>>>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
> > >> >>>>>> their
> > >> >>>>>>>>>>>>>>>>>> paging/metrics/logging
> > >> >>>>>>>>>>>>>>>>>>>>>>> system
> > >> >>>>>>>>>>>>>>>>>>>>>>>> to take this into account.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> > >> >>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > >> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > >> >>>>>>>>>>>>>>> karmagyz@gmail.com
> > >> >>>>>>>>>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
> > >> >>>>>>> some
> > >> >>>>>>>>>>> cents
> > >> >>>>>>>>>>>>> here:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
> > >> >>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>> user?
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> I think the semantics of
> > >> >>>>>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
> > >> >>>>>>>>>>>>>>>>>>>>> is
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
> > >> >>>>>>> the
> > >> >>>>>>>>>>>> timeout
> > >> >>>>>>>>>>>>>>>>>> mechanism is
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
> > >> >>>>>> it
> > >> >>>>>>>>> will
> > >> >>>>>>>>>>> be
> > >> >>>>>>>>>>>>> very
> > >> >>>>>>>>>>>>>>>>>> confusing
> > >> >>>>>>>>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
> > >> >>>>>> exception
> > >> >>>>>>>> of a
> > >> >>>>>>>>>>> task
> > >> >>>>>>>>>>>>>>>>> manager to
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
> > >> >>>>>> fail
> > >> >>>>>>>> in
> > >> >>>>>>>>>>> case
> > >> >>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> target
> > >> >>>>>>>>>>>>>>>>>> task
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
> > >> >>>>>> I
> > >> >>>>>>>>> don't
> > >> >>>>>>>>>>>> mean
> > >> >>>>>>>>>>>>> we
> > >> >>>>>>>>>>>>>>>>> need
> > >> >>>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>>>>> cover
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
> > >> >>>>>> also
> > >> >>>>>>>>> open a
> > >> >>>>>>>>>>>>>>>>>> `notifyException`
> > >> >>>>>>>>>>>>>>>>>>>>> in
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
> > >> >>>>>>>>>>> ResourceManager,
> > >> >>>>>>>>>>>>> will
> > >> >>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>> slot of
> > >> >>>>>>>>>>>>>>>>>>>>> a
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
> > >> >>>>>>> released
> > >> >>>>>>>>> and
> > >> >>>>>>>>>>>>>>> allocated?
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> Yangze Guo
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
> > >> >>>>>> Wang
> > >> >>>>>>> <
> > >> >>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
> > >> >>>>>> response
> > >> >>>>>>>>> your 4
> > >> >>>>>>>>>>>>>> remarks:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
> > >> >>>>>>>>> controversy. I
> > >> >>>>>>>>>>>>> think
> > >> >>>>>>>>>>>>>>>>>> “BlockList”
> > >> >>>>>>>>>>>>>>>>>>>>> is
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> good
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
> > >> >>>>>>>> good
> > >> >>>>>>>>>>> idea.
> > >> >>>>>>>>>>>>>> Based
> > >> >>>>>>>>>>>>>>>>> on
> > >> >>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>>>> above, I
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > >> >>>>>> <host>/blocklist/node/<identifier>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > >> >>>>>>>>> <host>/blocklist/taskmanager/<identifier>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
> > >> >>>>>>>> means
> > >> >>>>>>>>>>> that
> > >> >>>>>>>>>>>>> all
> > >> >>>>>>>>>>>>>>>>> task
> > >> >>>>>>>>>>>>>>>>>>>>> managers
> > >> >>>>>>>>>>>>>>>>>>>>>>> on
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
> > >> >>>>>>>> these
> > >> >>>>>>>>>>> TMs
> > >> >>>>>>>>>>>> are
> > >> >>>>>>>>>>>>>> not
> > >> >>>>>>>>>>>>>>>>>>>>> available.
> > >> >>>>>>>>>>>>>>>>>>>>>>> This
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
> > >> >>>>>> these
> > >> >>>>>>>> TMs
> > >> >>>>>>>>>>> are
> > >> >>>>>>>>>>>>> not
> > >> >>>>>>>>>>>>>>>>> really
> > >> >>>>>>>>>>>>>>>>>> lost,
> > >> >>>>>>>>>>>>>>>>>>>>>>> they
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
> > >> >>>>>> are
> > >> >>>>>>>>> still
> > >> >>>>>>>>>>>>>>> registered
> > >> >>>>>>>>>>>>>>>>>> in this
> > >> >>>>>>>>>>>>>>>>>>>>>>> flink
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
> > >> >>>>>> once
> > >> >>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> corresponding
> > >> >>>>>>>>>>>>>>>>>>>>>> blocklist
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> item
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
> > >> >>>>>>>>>>>>>>> active/non-active
> > >> >>>>>>>>>>>>>>>>>>>>> clusters.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
> > >> >>>>>> TMs
> > >> >>>>>>>> may
> > >> >>>>>>>>> be
> > >> >>>>>>>>>>>>>> released
> > >> >>>>>>>>>>>>>>>>> due
> > >> >>>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>>>>> idle
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> timeouts.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
> > >> >>>>>> keep
> > >> >>>>>>>> it.
> > >> >>>>>>>>>>> The
> > >> >>>>>>>>>>>>>> reasons
> > >> >>>>>>>>>>>>>>>>> are
> > >> >>>>>>>>>>>>>>>>>> as
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> following:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
> > >> >>>>>>> adding
> > >> >>>>>>>> or
> > >> >>>>>>>>>>>>> removing
> > >> >>>>>>>>>>>>>>>>> items
> > >> >>>>>>>>>>>>>>>>>> via
> > >> >>>>>>>>>>>>>>>>>>>>>> REST
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> API,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
> > >> >>>>>> it
> > >> >>>>>>>> to
> > >> >>>>>>>>>>>>>>>>> Long.MAX_VALUE .
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
> > >> >>>>>> a
> > >> >>>>>>>>>>> period of
> > >> >>>>>>>>>>>>>> time
> > >> >>>>>>>>>>>>>>>>>> (such as
> > >> >>>>>>>>>>>>>>>>>>>>>>> machine
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
> > >> >>>>>> prefer
> > >> >>>>>>>> that
> > >> >>>>>>>>>>>> Flink
> > >> >>>>>>>>>>>>>> can
> > >> >>>>>>>>>>>>>>> do
> > >> >>>>>>>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
> > >> >>>>>>> user
> > >> >>>>>>>> to
> > >> >>>>>>>>>>> do
> > >> >>>>>>>>>>>> it
> > >> >>>>>>>>>>>>>>>>> manually.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
> > >> >>>>>>>>>>> 于2022年4月27日周三
> > >> >>>>>>>>>>>>>>>>> 19:23写道：
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
> > >> >>>>>> only
> > >> >>>>>>>>>>> support
> > >> >>>>>>>>>>>>>>> manually
> > >> >>>>>>>>>>>>>>>>>> blocking
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
> > >> >>>>>>> different
> > >> >>>>>>>>>>>>> strategies
> > >> >>>>>>>>>>>>>>> can
> > >> >>>>>>>>>>>>>>>>>> also be
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
> > >> >>>>>>>> "bla.cklist"
> > >> >>>>>>>>>>> due
> > >> >>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>> controversy
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> around
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
> > >> >>>>>>> Ticket
> > >> >>>>>>>>>>> about
> > >> >>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>> topic a
> > >> >>>>>>>>>>>>>>>>>>>>> while
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> back
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
> > >> >>>>>>>> avoid
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>> term
> > >> >>>>>>>>>>>>>>>>>> blacklist &
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> whitelist
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
> > >> >>>>>> "denylist"
> > >> >>>>>>> or
> > >> >>>>>>>>>>>>>>> "quarantined"
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
> > >> >>>>>>> slightly
> > >> >>>>>>>>>>>> different
> > >> >>>>>>>>>>>>>>>>> design
> > >> >>>>>>>>>>>>>>>>>> as
> > >> >>>>>>>>>>>>>>>>>>>>> verbs
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> like
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
> > >> >>>>>>> anti-pattern
> > >> >>>>>>>>> for
> > >> >>>>>>>>>>>> REST
> > >> >>>>>>>>>>>>>>> APIs.
> > >> >>>>>>>>>>>>>>>>>> POST
> > >> >>>>>>>>>>>>>>>>>>>>> on a
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> list
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
> > >> >>>>>>>> items.
> > >> >>>>>>>>>>>> DELETE
> > >> >>>>>>>>>>>>> on
> > >> >>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>>> individual
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > >> >>>>>>>>> <host>/quarantine/items/<itemidentifier>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
> > >> >>>>>>>>> taskmanagers
> > >> >>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>> nodes
> > >> >>>>>>>>>>>>>>>>> in
> > >> >>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>>> REST
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> API
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
> > >> >>>>>>> opinion
> > >> >>>>>>>>> on
> > >> >>>>>>>>>>>> this?
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > >> >>>>>>> <host>/quarantine/nodes/<identifier>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> > >> >>>>>>>>>>> <host>/quarantine/taskmanager/<identifier>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
> > >> >>>>>>>>>>> non-active
> > >> >>>>>>>>>>>>>>>>> resource
> > >> >>>>>>>>>>>>>>>>>>>>>> managers,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> i.e.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
> > >> >>>>>>>>>>> minimal,
> > >> >>>>>>>>>>>> do
> > >> >>>>>>>>>>>>> we
> > >> >>>>>>>>>>>>>>>>> need
> > >> >>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>>>> timeout
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
> > >> >>>>>>>> manually
> > >> >>>>>>>>>>> we
> > >> >>>>>>>>>>>>> could
> > >> >>>>>>>>>>>>>>>>>> delegate
> > >> >>>>>>>>>>>>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> to the
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
> > >> >>>>>>>>> behavior
> > >> >>>>>>>>>>>>> would
> > >> >>>>>>>>>>>>>>>>> better
> > >> >>>>>>>>>>>>>>>>>> fit
> > >> >>>>>>>>>>>>>>>>>>>>>> into
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>
> > >> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> > >> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> > >> >>>>>>> schrieb
> > >> >>>>>>>>>>> Lijie
> > >> >>>>>>>>>>>>> Wang
> > >> >>>>>>>>>>>>>> <
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
> > >> >>>>>>>> cluster
> > >> >>>>>>>>>>> node
> > >> >>>>>>>>>>>>>>> issues
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> (insufficient
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> disk
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
> > >> >>>>>>>> abnormalities).
> > >> >>>>>>>>>>>> Flink
> > >> >>>>>>>>>>>>>> will
> > >> >>>>>>>>>>>>>>>>>> take care
> > >> >>>>>>>>>>>>>>>>>>>>>> of
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
> > >> >>>>>>> However,
> > >> >>>>>>>>> due
> > >> >>>>>>>>>>> to
> > >> >>>>>>>>>>>>> data
> > >> >>>>>>>>>>>>>>>>>> locality
> > >> >>>>>>>>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> limited
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
> > >> >>>>>> likely
> > >> >>>>>>>> to
> > >> >>>>>>>>> be
> > >> >>>>>>>>>>>>>>> redeployed
> > >> >>>>>>>>>>>>>>>>>> to the
> > >> >>>>>>>>>>>>>>>>>>>>>> same
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
> > >> >>>>>>>> task
> > >> >>>>>>>>>>>>>>> abnormalities
> > >> >>>>>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>>>>>> affect
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> job
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> progress.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
> > >> >>>>>> manually
> > >> >>>>>>>>>>> identify
> > >> >>>>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>>> problematic
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> node and
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
> > >> >>>>>>> But
> > >> >>>>>>>>>>> this
> > >> >>>>>>>>>>>>>>> approach
> > >> >>>>>>>>>>>>>>>>> has
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> following
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
> > >> >>>>>> heavy
> > >> >>>>>>>>>>> process.
> > >> >>>>>>>>>>>>>> Users
> > >> >>>>>>>>>>>>>>>>> may
> > >> >>>>>>>>>>>>>>>>>> need
> > >> >>>>>>>>>>>>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contact
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
> > >> >>>>>>>>> operation
> > >> >>>>>>>>>>> can
> > >> >>>>>>>>>>>>>> even
> > >> >>>>>>>>>>>>>>> be
> > >> >>>>>>>>>>>>>>>>>>>>> dangerous
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
> > >> >>>>>>>>> events.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
> > >> >>>>>> of
> > >> >>>>>>>>>>> problems
> > >> >>>>>>>>>>>>>>> manually
> > >> >>>>>>>>>>>>>>>>>> would
> > >> >>>>>>>>>>>>>>>>>>>>> be
> > >> >>>>>>>>>>>>>>>>>>>>>>>>> slow
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
> > >> >>>>>>>> propose
> > >> >>>>>>>>>>> to
> > >> >>>>>>>>>>>>>>>>> introduce a
> > >> >>>>>>>>>>>>>>>>>>>>>> blacklist
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
> > >> >>>>>>>>> problematic
> > >> >>>>>>>>>>>>>>> resources.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
> > >> >>>>>>> FLIP-224[1].
> > >> >>>>>>>>>>> Looking
> > >> >>>>>>>>>>>>>>> forward
> > >> >>>>>>>>>>>>>>>>>> to your
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>> feedback.
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>
> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>> --
> > >> >>>>>>>>>>>>> Best regards,
> > >> >>>>>>>>>>>>> Roman Boyko
> > >> >>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> --
> > >> >>>>>>>> https://twitter.com/snntrable
> > >> >>>>>>>> https://github.com/knaufk
> > >> >>>>>>>>
> > >> >>>>>>
> > >> >>>>>> --
> > >> >>>>>> https://twitter.com/snntrable
> > >> >>>>>> https://github.com/knaufk
> > >> >>>>>>
> > >> >>>>
> > >>
> > >>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.

Regarding the concern of the SlotManager, my two cents:
1. it is necessary for the SlotManager to host blocked slots, in 2 cases:
  a. In standalone mode, a taskmanager may be temporarily added to
the blocklist. We do not want the TM to get disconnected and shut down.
So we need to keep its connection to RM and keep hosting its slots.
  b. When we want to avoid allocating slots to a slow nodes but do not
want to kill current running tasks on the nodes (MARK_BLOCKED mode).

There is possible a way to keep the connection of a blocked task manager
while hide its slots from SlotManager, but I feel it may be even much more
complicated.

2. It will not complicate the SlotManager too much. The SlotManager will
be offered a BlockedTaskManagerChecker when created, and just need
to use it to filter out blocked slots on slot request. Therefore I think the
complication is acceptable.

Thanks,
Zhu

Lijie Wang <wa...@gmail.com> 于2022年5月25日周三 15:26写道：
>
> Hi everyone,
>
> I've updated the FLIP according to Chesnay's feedback, changes as follows:
> 1. Change the GET result to a map.
> 2. Only left *endTimestamp* in ADD operation, and change the rest (from
> POST) to PUT
> 3. Introduce a new slot pool implementation(BlocklistSlotPool) to
> encapsulate blocklist related functions.
> 4. Remove *mainThread* from BlocklistTracker, instead provide a
> *removeTimeoutItems* method to be called by outside components。
>
> Best,
> Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：
>
> > Hi Chesnay,
> >
> > Thanks for feedback.
> >
> > 1. Regarding the TM/Node id. Do you mean special characters may appear in
> > the rest URL?  Actually, I don't think so. The task manager id in REST API
> > should be the *ResourceID* of taskmanager in Flink, there should be no
> > special characters, and some existing REST APIs are already using it, e.g.
> > GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
> > id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
> > should also have no special characters.
> > 2. Regarding the GET query responses. I agree with you, it makes sense to
> > change the GET result to a map.
> >
> > 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
> > cover everything, and the endTimestamp is a unix timestamp, there should be
> > no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
> > add rest api is add or update, PUT can cover this semantics.
> >
> > 4. Regarding the slot pool/manager. I don't think the current slotpool
> > and slotmanager are able to support the MARK_BLOCKED(slots that are
> > already allocated will not be affected) action. The reasons are as
> > follows：
> >
> > a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
> > from reserved(task assigned) to free(no task assigned), it is necessary to
> > check whether the slot should be released immediately(it should be released
> > immediately if the task manager is blocked, otherwise it may be allocated
> > to other tasks). I think it cannot be supported without being aware of
> > the blocklist information. Compared to the solution in FLIP, a more
> > appropriate/prefered way may be: Introduce a new slot pool
> > implementation for blocklist(may be named BlocklistSlotPool, it
> > extends/wrapps the original slot pool), and implement the parts that need
> > to be aware of the blocklist in this newly introduced slot pool, and the
> > original slot pool basically does not need to change.
> >
> > b) for slot manager, with the MARK_BLOCKED action, there may be free but
> > blocked slots in slot manager (the corresponding TMs cannot be
> > released/unregistered because there are still running tasks on them).
> > Therefore, we need to filter out the blocked slots when trying to fulfill
> > the slot requirements. Therefore it also needs to know the blocklist information.
> > A better way may be to abstract a resource allocation strategy, and make
> > the blocklist as a special implementation, then pass the resource
> > allocation strategy in when constructing the slot manager. Unfortunately,
> > the data structures in the two existing slot manager
> > implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
> > quite different, it is not easy to abstract a common resource allocation
> > strategy, so we prefer to keep the current way(i.e. pass the blocklist
> > information directly into slot manager).
> >
> >
> > 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
> > does not need to be aware of the executor, and the timeout actions can be
> > done outside.
> >
> > Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
> >
> >> I have a number of concerns:
> >>
> >> Is the id used for deleting an item the same sent in the initial request
> >> (and not one returned by Flink)?
> >> I'm very concerned that the tm/node id can contain special characters.
> >>
> >> The GET query should return a map, not a list of items. This makes it
> >> easier to work with.
> >>
> >> The duality of endTimestamp and duration is also concerning.
> >> If we conclude that endTimestamps can in fact work (and aren't utterly
> >> unusable due to timezones),
> >> then this should be able to cover everything and rid us of some
> >> complexity w.r.t. POSTs to the same ID.
> >> Additions would be a PUT, changes a PATCH, deletes a DELETE.
> >>
> >>
> >> I also dislike how we're pushing more functionality into the
> >> slotpool/-manager.
> >> These components are complex enough as-is, and instead I'd propose a
> >> separate component that interacts with the SlotPool/-Manager instead,
> >> for example by removing the slots from that TM.
> >> The reason being that from the slot-pool perspective it is irrelevant
> >> whether a slot is gone because the TM was lost, or because it was blocked.
> >>
> >>
> >> The BlocklistTracker shouldn't be modeled as component that is aware of
> >> the concept of main threads.
> >> It really has no business knowing that; all it needs is an executor for
> >> handling timeouts/periodic actions,
> >> and a way to interact with the JM/RM (which internally can handle the
> >> scheduling into the main thread).
> >>
> >>
> >> On 20/05/2022 07:20, Lijie Wang wrote:
> >> > Hi everyone,
> >> >
> >> > I have started a vote for this FLIP [1]. Please cast your vote there or
> >> ask
> >> > additional questions here. [1]
> >> > https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
> >> >
> >> > Best, Lijie
> >> >
> >> > Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
> >> >
> >> >> Hi Konstantin,
> >> >>
> >> >> We found that Flink REST URL does not support the format ":merge" ,
> >> which
> >> >> will be recognized as a parameter in the URL(due to start with a
> >> colon).
> >> >>
> >> >> We will keep the previous way, i.e.
> >> >>
> >> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> >> >> and the "id" and "merge" flag are put into the request body.
> >> >>
> >> >> Best,
> >> >> Lijie
> >> >>
> >> >> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
> >> >>
> >> >>> Hi Weihua,
> >> >>> thanks for feedback.
> >> >>>
> >> >>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
> >> >>> step towards auto-detection.
> >> >>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> >> >>> exception of insufficient resources.
> >> >>> 3. No. This FLIP won't change the WebUI. The blocklist information
> >> can be
> >> >>> obtained through REST API and metrics.
> >> >>>
> >> >>> Best,
> >> >>> Lijie
> >> >>>
> >> >>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
> >> >>>
> >> >>>> Hi,
> >> >>>> Thanks for creating this FLIP.
> >> >>>> We have implemented an automatic blocklist detection mechanism
> >> >>>> internally, which is indeed very effective for handling node
> >> failures.
> >> >>>> Due to the large number of nodes, although SREs already support
> >> >>>> automatic offline failure nodes, the detection is not 100% accurate
> >> and
> >> >>>> there is some delay.
> >> >>>> So the blocklist mechanism can make flink job recover from failure
> >> much
> >> >>>> faster.
> >> >>>>
> >> >>>> Here are some of my thoughts:
> >> >>>> 1. In this FLIP, it needs users to locate machine failure manually,
> >> >>>> there is a certain cost of use
> >> >>>> 2. What happens if too many nodes are blocked, resulting in
> >> insufficient
> >> >>>> resources? Will there be a special Exception for the user?
> >> >>>> 3. Will we display the blocklist information in the WebUI? The
> >> blocklist
> >> >>>> is for cluster level, and if multiple users share a cluster, some
> >> users may
> >> >>>> be a little confused when resources are not enough, or when
> >> resources are
> >> >>>> applied for more.
> >> >>>>
> >> >>>> Also, Looking forward to the next FLIP on auto-detection.
> >> >>>>
> >> >>>> Best,
> >> >>>> Weihua
> >> >>>>
> >> >>>>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> >> >>>>>
> >> >>>>> Hi Konstantin,
> >> >>>>>
> >> >>>>> Maybe change it to the following:
> >> >>>>>
> >> >>>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >> >>>>> Merge is not allowed. If the {id} already exists, return error.
> >> >>>> Otherwise,
> >> >>>>> create a new item.
> >> >>>>>
> >> >>>>> 2. POST: http://
> >> >>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >> >>>>> Merge is allowed. If the {id} already exists, merge. Otherwise,
> >> create
> >> >>>> a
> >> >>>>> new item.
> >> >>>>>
> >> >>>>> WDYT?
> >> >>>>>
> >> >>>>> Best,
> >> >>>>> Lijie
> >> >>>>>
> >> >>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> >> >>>>>
> >> >>>>>> Hi Lijie,
> >> >>>>>>
> >> >>>>>> hm, maybe the following is more appropriate in that case
> >> >>>>>>
> >> >>>>>> POST: http://
> >> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>>
> >> >>>>>> Konstantin
> >> >>>>>>
> >> >>>>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> >> >>>>>> wangdachui9501@gmail.com>:
> >> >>>>>>
> >> >>>>>>> Hi Konstantin,
> >> >>>>>>> thanks for your feedback.
> >> >>>>>>>
> >> >>>>>>>  From what I understand, PUT should be idempotent. However, we
> >> have a
> >> >>>>>>> *timeout* field in the request. This means that initiating the
> >> same
> >> >>>>>> request
> >> >>>>>>> at two different times will lead to different resource status
> >> >>>> (timestamps
> >> >>>>>>> of the items to be removed will be different).
> >> >>>>>>>
> >> >>>>>>> Should we use PUT in this case? WDYT?
> >> >>>>>>>
> >> >>>>>>> Best,
> >> >>>>>>> Lijie
> >> >>>>>>>
> >> >>>>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> >> >>>>>>>
> >> >>>>>>>> Hi Lijie,
> >> >>>>>>>>
> >> >>>>>>>> wouldn't the REST API-idiomatic way for an update/replace be a
> >> PUT
> >> >>>> on
> >> >>>>>> the
> >> >>>>>>>> resource?
> >> >>>>>>>>
> >> >>>>>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >> >>>>>>>>
> >> >>>>>>>> Best,
> >> >>>>>>>>
> >> >>>>>>>> Konstantin
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> >> >>>>>>>> wangdachui9501@gmail.com>:
> >> >>>>>>>>
> >> >>>>>>>>> Hi everyone,
> >> >>>>>>>>>
> >> >>>>>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
> >> >>>> made
> >> >>>>>>> the
> >> >>>>>>>>> following changes on REST API:
> >> >>>>>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
> >> choose
> >> >>>>>>> one.
> >> >>>>>>>> If
> >> >>>>>>>>> both are specified, will return error.
> >> >>>>>>>>> 2.  If the specified item is already there, the *ADD* operation
> >> has
> >> >>>>>> two
> >> >>>>>>>>> behaviors:  *return error*(default value) or *merge/update*,
> >> and we
> >> >>>>>>> add a
> >> >>>>>>>>> flag to the request body to control it. You can find more
> >> details
> >> >>>>>>> "Public
> >> >>>>>>>>> Interface" section.
> >> >>>>>>>>>
> >> >>>>>>>>> If there is no more feedback, we will start the vote thread next
> >> >>>>>> week.
> >> >>>>>>>>> Best,
> >> >>>>>>>>> Lijie
> >> >>>>>>>>>
> >> >>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Becket Qin,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks for your suggestions.  I have moved the description of
> >> >>>>>>>>>> configurations, metrics and REST API into "Public Interface"
> >> >>>>>> section,
> >> >>>>>>>> and
> >> >>>>>>>>>> made a few updates according to your suggestion.  And in this
> >> >>>> FLIP,
> >> >>>>>>>> there
> >> >>>>>>>>>> no public java Interfaces or pluggables that users need to
> >> >>>>>> implement
> >> >>>>>>> by
> >> >>>>>>>>>> themselves.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Answers for you questions:
> >> >>>>>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> >> >>>>>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> >> >>>>>> items
> >> >>>>>>>> can
> >> >>>>>>>>>> only be added through the REST API, so these 2 action are
> >> >>>> mentioned
> >> >>>>>>> in
> >> >>>>>>>>> the
> >> >>>>>>>>>> REST API part (The REST API part has beed moved to public
> >> >>>> interface
> >> >>>>>>>> now).
> >> >>>>>>>>>> 2. I agree with you. I have changed the "Cause" field to
> >> String,
> >> >>>>>> and
> >> >>>>>>>>> allow
> >> >>>>>>>>>> users to specify it via REST API.
> >> >>>>>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
> >> >>>>>> above,
> >> >>>>>>>> we
> >> >>>>>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
> >> >>>> ADD
> >> >>>>>>>> REST
> >> >>>>>>>>>> API to specify when to remove the blocked item. These 2 fields
> >> are
> >> >>>>>>>>>> optional, if neither is specified, it means that the blocked
> >> item
> >> >>>>>> is
> >> >>>>>>>>>> permanent and will not be removed. If both are specified, the
> >> >>>>>> minimum
> >> >>>>>>>> of
> >> >>>>>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
> >> the
> >> >>>>>>> time
> >> >>>>>>>> to
> >> >>>>>>>>>> remove the blocked item. To keep the configurations more
> >> minimal,
> >> >>>>>> we
> >> >>>>>>>> have
> >> >>>>>>>>>> removed the *cluster.resource-blocklist.item.timeout*
> >> >>>> configuration
> >> >>>>>>>>>> option.
> >> >>>>>>>>>> 4. Yes, the block item will be overridden if the specified item
> >> >>>>>>> already
> >> >>>>>>>>>> exists. The ADD operation is *ADD or UPDATE*.
> >> >>>>>>>>>> 5. Yes. On JM/RM side, all the blocklist information is
> >> maintained
> >> >>>>>> in
> >> >>>>>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> >> >>>>>>>>> abstracted
> >> >>>>>>>>>> to other interfaces) will be propagated to different
> >> components.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Best,
> >> >>>>>>>>>> Lijie
> >> >>>>>>>>>>
> >> >>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi Lijie,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thanks for updating the FLIP. It looks like the public
> >> interface
> >> >>>>>>>> section
> >> >>>>>>>>>>> did not fully reflect all the user sensible behavior and API.
> >> Can
> >> >>>>>>> you
> >> >>>>>>>>> put
> >> >>>>>>>>>>> everything that users may be aware of there? That would
> >> include
> >> >>>>>> the
> >> >>>>>>>> REST
> >> >>>>>>>>>>> API, metrics, configurations, public java Interfaces or
> >> >>>> pluggables
> >> >>>>>>>> that
> >> >>>>>>>>>>> users may see or implement by themselves, as well as a brief
> >> >>>>>> summary
> >> >>>>>>>> of
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>> behavior of the public API.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Besides that, I have a few questions:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 1. According to the conversation in the discussion thread, it
> >> >>>>>> looks
> >> >>>>>>>> like
> >> >>>>>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
> >> >>>>>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
> >> so,
> >> >>>>>> can
> >> >>>>>>>> you
> >> >>>>>>>>>>> add
> >> >>>>>>>>>>> that to the public interface as well?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
> >> >>>>>>> Throwable
> >> >>>>>>>>> and
> >> >>>>>>>>>>> is not reflected in the REST API. Should that be included in
> >> the
> >> >>>>>>> query
> >> >>>>>>>>>>> response? And should we change that field to be a String so
> >> users
> >> >>>>>>> may
> >> >>>>>>>>>>> specify the cause via the REST API when they block some nodes
> >> /
> >> >>>>>> TMs?
> >> >>>>>>>>>>> 3. Would it be useful to allow users to have different
> >> timeouts
> >> >>>>>> for
> >> >>>>>>>>>>> different blocked items? So while there is a default timeout,
> >> >>>>>> users
> >> >>>>>>>> can
> >> >>>>>>>>>>> also override it via the REST API when they block an entity.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 4. Regarding the ADD operation, if the specified item is
> >> already
> >> >>>>>>>> there,
> >> >>>>>>>>>>> will the block item be overridden? For example, if the user
> >> wants
> >> >>>>>> to
> >> >>>>>>>>>>> extend
> >> >>>>>>>>>>> the timeout of a blocked item, can they just  issue an ADD
> >> >>>> command
> >> >>>>>>>>> again?
> >> >>>>>>>>>>> 5. I am not quite familiar with the details of this, but is
> >> there
> >> >>>>>> a
> >> >>>>>>>>> source
> >> >>>>>>>>>>> of truth for the blocked list? I think it might be good to
> >> have a
> >> >>>>>>>> single
> >> >>>>>>>>>>> source of truth for the blocked list and just propagate that
> >> list
> >> >>>>>> to
> >> >>>>>>>>>>> different components to take the action of actually blocking
> >> the
> >> >>>>>>>>> resource.
> >> >>>>>>>>>>> Thanks,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Jiangjie (Becket) Qin
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> >> >>>>>> wangdachui9501@gmail.com
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>> Hi everyone,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Based on the discussion in the mailing list, I updated the
> >> FLIP
> >> >>>>>>> doc,
> >> >>>>>>>>> the
> >> >>>>>>>>>>>> changes include:
> >> >>>>>>>>>>>> 1. Changed the description of the motivation section to more
> >> >>>>>>> clearly
> >> >>>>>>>>>>>> describe the problem this FLIP is trying to solve.
> >> >>>>>>>>>>>> 2. Only  *Manually* is supported.
> >> >>>>>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Best,
> >> >>>>>>>>>>>> Lijie
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> Hi Lijie!
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
> >> >>>>>>>>>>> helpfulhere.”
> >> >>>>>>>>>>>>> Can you explain it in detail? I don't fully understand that.
> >> >>>>>> In
> >> >>>>>>>>>>>> myopinion,
> >> >>>>>>>>>>>>> non-active and active are the same, and no special treatment
> >> >>>>>>>>>>> isrequired.*
> >> >>>>>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
> >> >>>>>>> were
> >> >>>>>>>>>>> talking
> >> >>>>>>>>>>>>> about the HA mode (but not about Active and Standalone
> >> >>>>>>>>>>> ResourceManager).
> >> >>>>>>>>>>>>> And the original question was - how to handle the
> >> blacklisted
> >> >>>>>>>> nodes
> >> >>>>>>>>>>> list
> >> >>>>>>>>>>>> at
> >> >>>>>>>>>>>>> the moment of leader change? Should we simply forget about
> >> >>>>>> them
> >> >>>>>>> or
> >> >>>>>>>>>>> try to
> >> >>>>>>>>>>>>> pre-save that list on the remote storage?
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <
> >> danrtsey.wy@gmail.com
> >> >>>>>>>>> wrote:
> >> >>>>>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> >> >>>>>> it
> >> >>>>>>> is
> >> >>>>>>>>>>> indeed
> >> >>>>>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
> >> >>>>>>>> could
> >> >>>>>>>>>>> not
> >> >>>>>>>>>>>>>> support.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>> Yang
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
> >> 22:18写道：
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> @Jiangang
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
> >> >>>>>> and
> >> >>>>>>>>> mark a
> >> >>>>>>>>>>>> node
> >> >>>>>>>>>>>>>>> blocked?
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> In fact, we currently plan to not support auto-detection
> >> >>>>>> in
> >> >>>>>>>> this
> >> >>>>>>>>>>>> FLIP.
> >> >>>>>>>>>>>>>> The
> >> >>>>>>>>>>>>>>> part about auto-detection may be continued in a separate
> >> >>>>>>> FLIP
> >> >>>>>>>> in
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>>>> future. Some guys have the same concerns as you, and the
> >> >>>>>>>>>>> correctness
> >> >>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>> necessity of auto-detection may require further discussion
> >> >>>>>>> in
> >> >>>>>>>>> the
> >> >>>>>>>>>>>>> future.
> >> >>>>>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
> >> >>>>>> node
> >> >>>>>>>> and
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>> node
> >> >>>>>>>>>>>>>>> should be marked blocked.
> >> >>>>>>>>>>>>>>> By design, the blocklist information will be shared among
> >> >>>>>>> all
> >> >>>>>>>>> jobs
> >> >>>>>>>>>>>> in a
> >> >>>>>>>>>>>>>>> cluster/session. The JM will sync blocklist information
> >> >>>>>> with
> >> >>>>>>>> RM.
> >> >>>>>>>>>>>>>>> @Martijn
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> >> >>>>>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
> >> >>>>>>>>>>> MARK_BLOCKLISTED(Just
> >> >>>>>>>>>>>>>> limits
> >> >>>>>>>>>>>>>>> the load of the node and does not  kill all the processes
> >> >>>>>> on
> >> >>>>>>>> it)
> >> >>>>>>>>>>> is
> >> >>>>>>>>>>>>> also
> >> >>>>>>>>>>>>>>> important, and we think that external systems (*yarn
> >> >>>>>> rmadmin
> >> >>>>>>>> or
> >> >>>>>>>>>>>> kubectl
> >> >>>>>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
> >> >>>>>>>> only
> >> >>>>>>>>>>>>>> *manually*.
> >> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> >> >>>>>>> indeed
> >> >>>>>>>>>>> super
> >> >>>>>>>>>>>>> hard
> >> >>>>>>>>>>>>>>> to get right.
> >> >>>>>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
> >> >>>>>>>>> concern.
> >> >>>>>>>>>>>>>>> However, we currently plan to not support auto-detection
> >> >>>>>> in
> >> >>>>>>>> this
> >> >>>>>>>>>>>> FLIP,
> >> >>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
> >> >>>>>> FLIP
> >> >>>>>>>>>>> provides
> >> >>>>>>>>>>>> a
> >> >>>>>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
> >> >>>>>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> >> >>>>>>>>>>>>>>> the auto-detection may be done by external systems.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>> Lijie
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> >> >>>>>>> 19:04写道：
> >> >>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> >> >>>>>> could
> >> >>>>>>>> not
> >> >>>>>>>>>>> see
> >> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> >> >>>>>>>>> approach(via
> >> >>>>>>>>>>>> *yarn
> >> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> To me this sounds yet again like one of those magical
> >> >>>>>>>>>>> mechanisms
> >> >>>>>>>>>>>>>> that
> >> >>>>>>>>>>>>>>>> will rarely work just right.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> >> >>>>>>> indeed
> >> >>>>>>>>>>> super
> >> >>>>>>>>>>>>> hard
> >> >>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>> get right.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Martijn
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> >> >>>>>>>>>>>> liujiangangpeng@gmail.com
> >> >>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
> >> >>>>>>>>> decrease
> >> >>>>>>>>>>>>> great
> >> >>>>>>>>>>>>>>> work
> >> >>>>>>>>>>>>>>>>> for us. We have implemented the similar feature in our
> >> >>>>>>>> inner
> >> >>>>>>>>>>> flink
> >> >>>>>>>>>>>>>>>>> version.
> >> >>>>>>>>>>>>>>>>> Below is something that I care about:
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>    1. For auto-detecting, I wonder how to make the
> >> >>>>>>> strategy
> >> >>>>>>>>> and
> >> >>>>>>>>>>>>> mark a
> >> >>>>>>>>>>>>>>>>> node
> >> >>>>>>>>>>>>>>>>>    blocked? Sometimes the blocked node is hard to be
> >> >>>>>>>>> detected,
> >> >>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>> example,
> >> >>>>>>>>>>>>>>>>>    the upper node or the down node will be blocked when
> >> >>>>>>>>> network
> >> >>>>>>>>>>>>>>>>> unreachable.
> >> >>>>>>>>>>>>>>>>>    2. I see that the strategy is made in JobMaster
> >> >>>>>> side.
> >> >>>>>>>> How
> >> >>>>>>>>>>> about
> >> >>>>>>>>>>>>>>>>>    implementing the similar logic in resource manager?
> >> >>>>>> In
> >> >>>>>>>>>>> session
> >> >>>>>>>>>>>>>> mode,
> >> >>>>>>>>>>>>>>>>> multi
> >> >>>>>>>>>>>>>>>>>    jobs can fail on the same bad node and the node
> >> >>>>>> should
> >> >>>>>>>> be
> >> >>>>>>>>>>>> marked
> >> >>>>>>>>>>>>>>>>> blocked.
> >> >>>>>>>>>>>>>>>>>    If the job makes the strategy, the node may be not
> >> >>>>>>>> marked
> >> >>>>>>>>>>>> blocked
> >> >>>>>>>>>>>>>> if
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>    fail times don't exceed the threshold.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> Thank you for all your feedback!
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
> >> >>>>>> some
> >> >>>>>>> of
> >> >>>>>>>>> my
> >> >>>>>>>>>>>>>> thoughts:
> >> >>>>>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
> >> >>>>>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
> >> >>>>>>>>>>>>>>>>>> The automatical way should be something built upon
> >> >>>>>> the
> >> >>>>>>>>>>> blocklist
> >> >>>>>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
> >> >>>>>>>>>>>> configurable
> >> >>>>>>>>>>>>>>>>>> blocklist strategy, but I think we can further
> >> >>>>>> decouple
> >> >>>>>>>> it
> >> >>>>>>>>> by
> >> >>>>>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
> >> >>>>>>>> suggested,
> >> >>>>>>>>>>> which
> >> >>>>>>>>>>>>>> just
> >> >>>>>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
> >> >>>>>>> detected.
> >> >>>>>>>>>>>> However,
> >> >>>>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>>>>> should be a separate FLIP with further dev
> >> >>>>>> discussions
> >> >>>>>>>> and
> >> >>>>>>>>>>>>> feedback
> >> >>>>>>>>>>>>>>>>>> from users. I also agree with Becket that different
> >> >>>>>>> users
> >> >>>>>>>>>>> have
> >> >>>>>>>>>>>>>>> different
> >> >>>>>>>>>>>>>>>>>> requirements, and we should listen to them.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
> >> >>>>>>>> externally
> >> >>>>>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
> >> >>>>>> way
> >> >>>>>>> to
> >> >>>>>>>>>>> avoid
> >> >>>>>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
> >> >>>>>>>> users
> >> >>>>>>>>>>> may
> >> >>>>>>>>>>>>> just
> >> >>>>>>>>>>>>>>>>>> want to limit the load of the node and do not want to
> >> >>>>>>>> kill
> >> >>>>>>>>>>> all
> >> >>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>> processes on it. Another case is the speculative
> >> >>>>>>>>> execution[1]
> >> >>>>>>>>>>>>> which
> >> >>>>>>>>>>>>>>>>>> may also leverage this feature to avoid starting
> >> >>>>>> mirror
> >> >>>>>>>>>>> tasks on
> >> >>>>>>>>>>>>>> slow
> >> >>>>>>>>>>>>>>>>>> nodes.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> Thanks,
> >> >>>>>>>>>>>>>>>>>> Zhu
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> [1]
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> >>>>>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> >> >>>>>>>>> 15:56写道：
> >> >>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> Thanks for your feedback.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
> >> >>>>>> here
> >> >>>>>>>>>>> because
> >> >>>>>>>>>>>> it
> >> >>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>>>>> affect the value and design of the blocklist
> >> >>>>>> mechanism
> >> >>>>>>>>>>> (perhaps
> >> >>>>>>>>>>>> I
> >> >>>>>>>>>>>>>>> should
> >> >>>>>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
> >> >>>>>>>> FLIP:
> >> >>>>>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
> >> >>>>>>> node
> >> >>>>>>>>> as
> >> >>>>>>>>>>>>>> blocked.
> >> >>>>>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
> >> >>>>>>>> task
> >> >>>>>>>>>>>> manager
> >> >>>>>>>>>>>>>> or
> >> >>>>>>>>>>>>>>>>> node.
> >> >>>>>>>>>>>>>>>>>> But slots that are already allocated will not be
> >> >>>>>>>> affected.
> >> >>>>>>>>> A
> >> >>>>>>>>>>>>> typical
> >> >>>>>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
> >> >>>>>>> In
> >> >>>>>>>>> this
> >> >>>>>>>>>>>>> case,
> >> >>>>>>>>>>>>>> we
> >> >>>>>>>>>>>>>>>>> hope
> >> >>>>>>>>>>>>>>>>>> that subsequent resource allocations will not be on
> >> >>>>>> the
> >> >>>>>>>> hot
> >> >>>>>>>>>>>>> machine,
> >> >>>>>>>>>>>>>>> but
> >> >>>>>>>>>>>>>>>>>> tasks currently running on it should not be affected.
> >> >>>>>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> >> >>>>>> task
> >> >>>>>>>>>>> manager
> >> >>>>>>>>>>>> or
> >> >>>>>>>>>>>>>>> node
> >> >>>>>>>>>>>>>>>>> as
> >> >>>>>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
> >> >>>>>> tasks
> >> >>>>>>>> will
> >> >>>>>>>>>>> be
> >> >>>>>>>>>>>>>>>>> restarted on
> >> >>>>>>>>>>>>>>>>>> non-blocked task managers.
> >> >>>>>>>>>>>>>>>>>>> For the above 2 actions, the former may more
> >> >>>>>>> highlight
> >> >>>>>>>>> the
> >> >>>>>>>>>>>>> meaning
> >> >>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>> this FLIP, because the external system cannot do
> >> >>>>>> that.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
> >> >>>>>> basically
> >> >>>>>>>>> agree
> >> >>>>>>>>>>>> with
> >> >>>>>>>>>>>>>>>>> @Becket
> >> >>>>>>>>>>>>>>>>>> Qin: different users have different answers. Not all
> >> >>>>>>>> users’
> >> >>>>>>>>>>>>>> deployment
> >> >>>>>>>>>>>>>>>>>> environments have a special external system that can
> >> >>>>>>>>> perform
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>>>> anomaly
> >> >>>>>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
> >> >>>>>>>>>>> auto-detection
> >> >>>>>>>>>>>>>>> doesn't
> >> >>>>>>>>>>>>>>>>>> require much extra work on top of manual
> >> >>>>>> specification.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> I will answer your other questions one by one.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> @Yangze
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
> >> >>>>>>> the
> >> >>>>>>> `cluster.resource-blocklist.item.timeout-check-interval`
> >> >>>>>>>> to
> >> >>>>>>>>>>>> users.
> >> >>>>>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
> >> >>>>>>> separate
> >> >>>>>>>>>>>> interface
> >> >>>>>>>>>>>>>>>>> (maybe
> >> >>>>>>>>>>>>>>>>>> BlocklistExceptionListener), and the
> >> >>>>>>>>>>>>> ResourceManagerBlocklistHandler
> >> >>>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>>>>> implement it in the future.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> @Martijn
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
> >> >>>>>> by
> >> >>>>>>>>>>> cluster
> >> >>>>>>>>>>>>>>>>> operators.
> >> >>>>>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
> >> >>>>>>>> according
> >> >>>>>>>>>>> to
> >> >>>>>>>>>>>> my
> >> >>>>>>>>>>>>>>>>>> experience, users are often the first to perceive the
> >> >>>>>>>>> machine
> >> >>>>>>>>>>>>>> problems
> >> >>>>>>>>>>>>>>>>>> (because of job failover or delay), and they will
> >> >>>>>>> contact
> >> >>>>>>>>>>>> cluster
> >> >>>>>>>>>>>>>>>>> operators
> >> >>>>>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
> >> >>>>>>>>>>> machine is
> >> >>>>>>>>>>>>>>>>>> problematic. From this point of view, I think the
> >> >>>>>>> people
> >> >>>>>>>>> who
> >> >>>>>>>>>>>>> really
> >> >>>>>>>>>>>>>>> need
> >> >>>>>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
> >> >>>>>>>> performed
> >> >>>>>>>>> by
> >> >>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> cluster
> >> >>>>>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> @Chesnay
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
> >> >>>>>>> for
> >> >>>>>>>>>>>>>>>>> MARK_BLOCKLISTED
> >> >>>>>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
> >> >>>>>> when
> >> >>>>>>>> the
> >> >>>>>>>>>>> task
> >> >>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> >> >>>>>>>> release
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>> slot
> >> >>>>>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
> >> >>>>>>> on
> >> >>>>>>>>> it.
> >> >>>>>>>>>>> If
> >> >>>>>>>>>>>> we
> >> >>>>>>>>>>>>>>> only
> >> >>>>>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
> >> >>>>>> needs
> >> >>>>>>> to
> >> >>>>>>>>>>>> retrieve
> >> >>>>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>> by
> >> >>>>>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
> >> >>>>>>>> relatively
> >> >>>>>>>>>>>> large,
> >> >>>>>>>>>>>>>> so
> >> >>>>>>>>>>>>>>> I
> >> >>>>>>>>>>>>>>>>>> think it's worth maintaining the blocklist
> >> >>>>>> information
> >> >>>>>>> on
> >> >>>>>>>>>>> the JM
> >> >>>>>>>>>>>>>> side
> >> >>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>> syncing them.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> @Роман
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>     a) “Probably storing inside Zookeeper/Configmap
> >> >>>>>>>> might
> >> >>>>>>>>>>> be
> >> >>>>>>>>>>>>>> helpful
> >> >>>>>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
> >> >>>>>>>>>>> understand
> >> >>>>>>>>>>>>> that.
> >> >>>>>>>>>>>>>>> In
> >> >>>>>>>>>>>>>>>>> my
> >> >>>>>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
> >> >>>>>>>> special
> >> >>>>>>>>>>>>>> treatment
> >> >>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>> required.
> >> >>>>>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
> >> >>>>>> sense,
> >> >>>>>>> I
> >> >>>>>>>>> will
> >> >>>>>>>>>>>> add
> >> >>>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> FLIP.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> @Yang
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
> >> >>>>>> cannot
> >> >>>>>>>>>>> support
> >> >>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> Looking forward to your further feedback.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> Lijie
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
> >> >>>>>>>> 21:09写道：
> >> >>>>>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
> >> >>>>>>>>>>> operations.
> >> >>>>>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
> >> >>>>>>>>>>> Reliability
> >> >>>>>>>>>>>>>>> Engineer)
> >> >>>>>>>>>>>>>>>>>>>> already has many tools to detect the unstable
> >> >>>>>> nodes,
> >> >>>>>>>>> which
> >> >>>>>>>>>>>>> could
> >> >>>>>>>>>>>>>>> take
> >> >>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>> system logs/metrics into consideration.
> >> >>>>>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
> >> >>>>>> taint
> >> >>>>>>>> in
> >> >>>>>>>>>>> K8s
> >> >>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>> prevent
> >> >>>>>>>>>>>>>>>>>> new
> >> >>>>>>>>>>>>>>>>>>>> allocations on these unstable nodes.
> >> >>>>>>>>>>>>>>>>>>>> At last, they will evict all the containers and
> >> >>>>>> pods
> >> >>>>>>>>>>> running
> >> >>>>>>>>>>>> on
> >> >>>>>>>>>>>>>>> these
> >> >>>>>>>>>>>>>>>>>> nodes.
> >> >>>>>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
> >> >>>>>>> So
> >> >>>>>>>> I
> >> >>>>>>>>> am
> >> >>>>>>>>>>>>> afraid
> >> >>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> >> >>>>>>>> could
> >> >>>>>>>>>>> not
> >> >>>>>>>>>>>>> see
> >> >>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> >> >>>>>>>>>>>> approach(via
> >> >>>>>>>>>>>>>>> *yarn
> >> >>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> >> >>>>>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
> >> >>>>>>> which
> >> >>>>>>>>>>> could
> >> >>>>>>>>>>>>>> expose
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
> >> >>>>>>> them
> >> >>>>>>>> if
> >> >>>>>>>>>>>>> enabled
> >> >>>>>>>>>>>>>>>>>> explicitly.
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>>>>>>> Yang
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> >> >>>>>>>> 16:36写道：
> >> >>>>>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
> >> >>>>>> and
> >> >>>>>>>>>>> somewhat
> >> >>>>>>>>>>>>>>> related
> >> >>>>>>>>>>>>>>>>>> to the
> >> >>>>>>>>>>>>>>>>>>>>> design principle about how people should operate
> >> >>>>>>>>> Flink.
> >> >>>>>>>>>>>>>>>>>>>>> I think there are three things involved in this
> >> >>>>>>>> FLIP.
> >> >>>>>>>>>>>>>>>>>>>>>      a) Detect and report the unstable node.
> >> >>>>>>>>>>>>>>>>>>>>>      b) Collect the information of the unstable
> >> >>>>>>> node
> >> >>>>>>>>> and
> >> >>>>>>>>>>>>> form a
> >> >>>>>>>>>>>>>>>>>> blocklist.
> >> >>>>>>>>>>>>>>>>>>>>>      c) Take the action to block nodes.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> My two cents:
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
> >> >>>>>>> should
> >> >>>>>>>>> have
> >> >>>>>>>>>>>> c).
> >> >>>>>>>>>>>>> It
> >> >>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>> not only
> >> >>>>>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
> >> >>>>>> handy
> >> >>>>>>>> for
> >> >>>>>>>>>>> some
> >> >>>>>>>>>>>>>>> planned
> >> >>>>>>>>>>>>>>>>>>>>> maintenance.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
> >> >>>>>> who
> >> >>>>>>>>>>> should be
> >> >>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> brain
> >> >>>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
> >> >>>>>>>>> largely
> >> >>>>>>>>>>>>> depends
> >> >>>>>>>>>>>>>>> on
> >> >>>>>>>>>>>>>>>>>> who we
> >> >>>>>>>>>>>>>>>>>>>>> talk to. Different users would probably give
> >> >>>>>>>> different
> >> >>>>>>>>>>>>> answers.
> >> >>>>>>>>>>>>>>> For
> >> >>>>>>>>>>>>>>>>>> people
> >> >>>>>>>>>>>>>>>>>>>>> who do have a centralized node health management
> >> >>>>>>>>>>> service,
> >> >>>>>>>>>>>> let
> >> >>>>>>>>>>>>>>> Flink
> >> >>>>>>>>>>>>>>>>>> do just
> >> >>>>>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
> >> >>>>>>>> Flink
> >> >>>>>>>>>>> would
> >> >>>>>>>>>>>>> be
> >> >>>>>>>>>>>>>>> one
> >> >>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
> >> >>>>>> it
> >> >>>>>>> to
> >> >>>>>>>>>>> that
> >> >>>>>>>>>>>>>>> service,
> >> >>>>>>>>>>>>>>>>>> and then
> >> >>>>>>>>>>>>>>>>>>>>> take the command from that service to block the
> >> >>>>>>>>>>> problematic
> >> >>>>>>>>>>>>>>> nodes.
> >> >>>>>>>>>>>>>>>>> On
> >> >>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
> >> >>>>>>>> service,
> >> >>>>>>>>>>>> simply
> >> >>>>>>>>>>>>>>>>> letting
> >> >>>>>>>>>>>>>>>>>> Flink
> >> >>>>>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
> >> >>>>>> nodes
> >> >>>>>>>>> might
> >> >>>>>>>>>>> be
> >> >>>>>>>>>>>>>>> desired
> >> >>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
> >> >>>>>>>> pluggable /
> >> >>>>>>>>>>>>>> optional.
> >> >>>>>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
> >> >>>>>>> have
> >> >>>>>>>>>>>> something
> >> >>>>>>>>>>>>>>>>>> pluggable
> >> >>>>>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
> >> >>>>>> unstable
> >> >>>>>>>>> nodes
> >> >>>>>>>>>>>>>>> actively.
> >> >>>>>>>>>>>>>>>>> (A
> >> >>>>>>>>>>>>>>>>>> more
> >> >>>>>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
> >> >>>>>>> which
> >> >>>>>>>>>>> can be
> >> >>>>>>>>>>>>>> used
> >> >>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> report
> >> >>>>>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
> >> >>>>>>> the
> >> >>>>>>>>>>> scope
> >> >>>>>>>>>>>>>>> relevant
> >> >>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
> >> >>>>>> OK
> >> >>>>>>> to
> >> >>>>>>>>>>> have a
> >> >>>>>>>>>>>>>>> default
> >> >>>>>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
> >> >>>>>>> Flink
> >> >>>>>>>> to
> >> >>>>>>>>>>> take
> >> >>>>>>>>>>>>>>> action
> >> >>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> block
> >> >>>>>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
> >> >>>>>>>>> timeout.
> >> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> >> >>>>>>>>>>>>>> ro.v.boyko@gmail.com
> >> >>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
> >> >>>>>>>>>>> development.
> >> >>>>>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
> >> >>>>>>> we
> >> >>>>>>>>>>> should
> >> >>>>>>>>>>>>>>> consider
> >> >>>>>>>>>>>>>>>>>> to move
> >> >>>>>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
> >> >>>>>> nodes/TMs
> >> >>>>>>>>> from
> >> >>>>>>>>>>>>> active
> >> >>>>>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
> >> >>>>>>>> storing
> >> >>>>>>>>>>>> inside
> >> >>>>>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
> >> >>>>>>>> organizations
> >> >>>>>>>>>>>> don't
> >> >>>>>>>>>>>>>> want
> >> >>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> expose
> >> >>>>>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
> >> >>>>>>>> it's
> >> >>>>>>>>>>>>> necessary
> >> >>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> have the
> >> >>>>>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
> >> >>>>>>> for
> >> >>>>>>>>>>>> avoiding
> >> >>>>>>>>>>>>>>>>> incorrect
> >> >>>>>>>>>>>>>>>>>>>>>> automatic behaviour.
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
> >> >>>>>>>> would
> >> >>>>>>>>> be
> >> >>>>>>>>>>>>> better
> >> >>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> extend
> >> >>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
> >> >>>>>> *endTimestamp*
> >> >>>>>>>>> field
> >> >>>>>>>>>>> and
> >> >>>>>>>>>>>>>> fill
> >> >>>>>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>>>>> at the
> >> >>>>>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
> >> >>>>>>> to:
> >> >>>>>>>>>>>>>>>>>>>>>>    -
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>    Provide the ability to users to setup the
> >> >>>>>>> exact
> >> >>>>>>>>>>> time
> >> >>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>> blocklist end
> >> >>>>>>>>>>>>>>>>>>>>>>    through RestAPI
> >> >>>>>>>>>>>>>>>>>>>>>>    -
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>    Not being tied to a single value of
> >> >>>>>>>>>>>>>>>>>>>>>>    *cluster.resource-blacklist.item.timeout*
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> >> >>>>>> <
> >> >>>>>>>>>>>>>>>>> chesnay@apache.org>
> >> >>>>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
> >> >>>>>>>> lines
> >> >>>>>>>>> a
> >> >>>>>>>>>>>> bit.
> >> >>>>>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
> >> >>>>>>>>> auto-detection
> >> >>>>>>>>>>>> and
> >> >>>>>>>>>>>>>> only
> >> >>>>>>>>>>>>>>>>>> have an
> >> >>>>>>>>>>>>>>>>>>>>>>> opt-in mechanism
> >> >>>>>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
> >> >>>>>> this
> >> >>>>>>>>> sounds
> >> >>>>>>>>>>>> yet
> >> >>>>>>>>>>>>>>> again
> >> >>>>>>>>>>>>>>>>>> like one
> >> >>>>>>>>>>>>>>>>>>>>>>> of those
> >> >>>>>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
> >> >>>>>> just
> >> >>>>>>>>> right.
> >> >>>>>>>>>>>>>>>>>>>>>>> An external system can leverage way more
> >> >>>>>>>>> information
> >> >>>>>>>>>>>>> after
> >> >>>>>>>>>>>>>>> all.
> >> >>>>>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
> >> >>>>>>>> complexity
> >> >>>>>>>>>>> of
> >> >>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>> proposal.
> >> >>>>>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
> >> >>>>>>> between
> >> >>>>>>>>>>>>>> components;
> >> >>>>>>>>>>>>>>>>>>>>> adjustments
> >> >>>>>>>>>>>>>>>>>>>>>>> to the
> >> >>>>>>>>>>>>>>>>>>>>>>> slot and resource protocol.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
> >> >>>>>>> resource
> >> >>>>>>>>>>>>> management
> >> >>>>>>>>>>>>>>>>>> perspective,
> >> >>>>>>>>>>>>>>>>>>>>>>> then there
> >> >>>>>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
> >> >>>>>> protocol
> >> >>>>>>>> at
> >> >>>>>>>>>>> all
> >> >>>>>>>>>>>> (or
> >> >>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>> fact
> >> >>>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
> >> >>>>>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
> >> >>>>>>> around
> >> >>>>>>>>>>> blocked
> >> >>>>>>>>>>>>> TMs
> >> >>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> first
> >> >>>>>>>>>>>>>>>>>>>>>>> place.
> >> >>>>>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
> >> >>>>>>> down
> >> >>>>>>>>>>> after
> >> >>>>>>>>>>>> a
> >> >>>>>>>>>>>>>>> while
> >> >>>>>>>>>>>>>>>>> by
> >> >>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> RM
> >> >>>>>>>>>>>>>>>>>>>>>>> because of
> >> >>>>>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
> >> >>>>>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
> >> >>>>>>>> blocking
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>>> process
> >> >>>>>>>>>>>>>>>>> or
> >> >>>>>>>>>>>>>>>>>> node)
> >> >>>>>>>>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
> >> >>>>>>>>>>> process(es).
> >> >>>>>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
> >> >>>>>>>> back
> >> >>>>>>>>>>> up.
> >> >>>>>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
> >> >>>>>> the
> >> >>>>>>>> same
> >> >>>>>>>>>>> line
> >> >>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>> thinking to
> >> >>>>>>>>>>>>>>>>>>>>>>> standalone resource management.
> >> >>>>>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
> >> >>>>>> process/node
> >> >>>>>>>>>>> manually
> >> >>>>>>>>>>>>>>> should
> >> >>>>>>>>>>>>>>>>> be
> >> >>>>>>>>>>>>>>>>>> a
> >> >>>>>>>>>>>>>>>>>>>>> core
> >> >>>>>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
> >> >>>>>>>> understand
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>>> problem
> >> >>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>> I see
> >> >>>>>>>>>>>>>>>>>>>>>>> value
> >> >>>>>>>>>>>>>>>>>>>>>>>> in the automatic detection and
> >> >>>>>>> blocklisting. I
> >> >>>>>>>>> do
> >> >>>>>>>>>>>> have
> >> >>>>>>>>>>>>>> some
> >> >>>>>>>>>>>>>>>>>> concerns
> >> >>>>>>>>>>>>>>>>>>>>>> with
> >> >>>>>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
> >> >>>>>>> blocked
> >> >>>>>>>>>>>>> resources.
> >> >>>>>>>>>>>>>> I
> >> >>>>>>>>>>>>>>>>> have
> >> >>>>>>>>>>>>>>>>>> two
> >> >>>>>>>>>>>>>>>>>>>>>>>> concerns;
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
> >> >>>>>>>>> separation
> >> >>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>> concerns,
> >> >>>>>>>>>>>>>>>>>>>>> meaning
> >> >>>>>>>>>>>>>>>>>>>>>>> that
> >> >>>>>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
> >> >>>>>>>> managing a
> >> >>>>>>>>>>>>> cluster
> >> >>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>> there's
> >> >>>>>>>>>>>>>>>>>>>>> a
> >> >>>>>>>>>>>>>>>>>>>>>>> user
> >> >>>>>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
> >> >>>>>>>>>>> introduction of
> >> >>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>> mechanism,
> >> >>>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>>> latter group now can influence the
> >> >>>>>>>>> responsibility
> >> >>>>>>>>>>> of
> >> >>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> first
> >> >>>>>>>>>>>>>>>>>> group.
> >> >>>>>>>>>>>>>>>>>>>>>> So
> >> >>>>>>>>>>>>>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
> >> >>>>>>>> group
> >> >>>>>>>>>>>> blocks
> >> >>>>>>>>>>>>>>>>>> something,
> >> >>>>>>>>>>>>>>>>>>>>>> which
> >> >>>>>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
> >> >>>>>>> paging
> >> >>>>>>>>>>>>> mechanism
> >> >>>>>>>>>>>>>>>>>> triggering
> >> >>>>>>>>>>>>>>>>>>>>>> etc)
> >> >>>>>>>>>>>>>>>>>>>>>>>> which impacts the first group.
> >> >>>>>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
> >> >>>>>> go
> >> >>>>>>>>>>> through
> >> >>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> process
> >> >>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>> manually
> >> >>>>>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
> >> >>>>>> it
> >> >>>>>>>>>>> should
> >> >>>>>>>>>>>>> be? I
> >> >>>>>>>>>>>>>>> do
> >> >>>>>>>>>>>>>>>>>> think
> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
> >> >>>>>>> make
> >> >>>>>>>>>>> sense
> >> >>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>> introduce
> >> >>>>>>>>>>>>>>>>>>>>> such
> >> >>>>>>>>>>>>>>>>>>>>>> a
> >> >>>>>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
> >> >>>>>>> really
> >> >>>>>>>>>>> small
> >> >>>>>>>>>>>>> user
> >> >>>>>>>>>>>>>>>>> group
> >> >>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>> Flink?
> >> >>>>>>>>>>>>>>>>>>>>>>> We
> >> >>>>>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
> >> >>>>>>> such
> >> >>>>>>>> a
> >> >>>>>>>>>>>>> feature.
> >> >>>>>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
> >> >>>>>>> I'm
> >> >>>>>>>>>>>> leaning
> >> >>>>>>>>>>>>>>>>> towards
> >> >>>>>>>>>>>>>>>>>> not
> >> >>>>>>>>>>>>>>>>>>>>>>> exposing
> >> >>>>>>>>>>>>>>>>>>>>>>>> this to the user group but having this
> >> >>>>>>>> available
> >> >>>>>>>>>>>>> strictly
> >> >>>>>>>>>>>>>>> for
> >> >>>>>>>>>>>>>>>>>> cluster
> >> >>>>>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
> >> >>>>>> their
> >> >>>>>>>>>>>>>>>>>> paging/metrics/logging
> >> >>>>>>>>>>>>>>>>>>>>>>> system
> >> >>>>>>>>>>>>>>>>>>>>>>>> to take this into account.
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> >> >>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> >> >>>>>>>>>>>>>>> karmagyz@gmail.com
> >> >>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
> >> >>>>>>> some
> >> >>>>>>>>>>> cents
> >> >>>>>>>>>>>>> here:
> >> >>>>>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
> >> >>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>> user?
> >> >>>>>>>>>>>>>>>>>>>>>>>>> I think the semantics of
> >> >>>>>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
> >> >>>>>>>>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
> >> >>>>>>> the
> >> >>>>>>>>>>>> timeout
> >> >>>>>>>>>>>>>>>>>> mechanism is
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
> >> >>>>>> it
> >> >>>>>>>>> will
> >> >>>>>>>>>>> be
> >> >>>>>>>>>>>>> very
> >> >>>>>>>>>>>>>>>>>> confusing
> >> >>>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
> >> >>>>>> exception
> >> >>>>>>>> of a
> >> >>>>>>>>>>> task
> >> >>>>>>>>>>>>>>>>> manager to
> >> >>>>>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
> >> >>>>>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
> >> >>>>>> fail
> >> >>>>>>>> in
> >> >>>>>>>>>>> case
> >> >>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> target
> >> >>>>>>>>>>>>>>>>>> task
> >> >>>>>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
> >> >>>>>> I
> >> >>>>>>>>> don't
> >> >>>>>>>>>>>> mean
> >> >>>>>>>>>>>>> we
> >> >>>>>>>>>>>>>>>>> need
> >> >>>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>>>>> cover
> >> >>>>>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
> >> >>>>>> also
> >> >>>>>>>>> open a
> >> >>>>>>>>>>>>>>>>>> `notifyException`
> >> >>>>>>>>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
> >> >>>>>>>>>>> ResourceManager,
> >> >>>>>>>>>>>>> will
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>> slot of
> >> >>>>>>>>>>>>>>>>>>>>> a
> >> >>>>>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
> >> >>>>>>> released
> >> >>>>>>>>> and
> >> >>>>>>>>>>>>>>> allocated?
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>>>>>>>>>>>> Yangze Guo
> >> >>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
> >> >>>>>> Wang
> >> >>>>>>> <
> >> >>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
> >> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
> >> >>>>>> response
> >> >>>>>>>>> your 4
> >> >>>>>>>>>>>>>> remarks:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
> >> >>>>>>>>> controversy. I
> >> >>>>>>>>>>>>> think
> >> >>>>>>>>>>>>>>>>>> “BlockList”
> >> >>>>>>>>>>>>>>>>>>>>> is
> >> >>>>>>>>>>>>>>>>>>>>>>>>> good
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
> >> >>>>>>>> good
> >> >>>>>>>>>>> idea.
> >> >>>>>>>>>>>>>> Based
> >> >>>>>>>>>>>>>>>>> on
> >> >>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> above, I
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >> >>>>>> <host>/blocklist/node/<identifier>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >> >>>>>>>>> <host>/blocklist/taskmanager/<identifier>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
> >> >>>>>>>> means
> >> >>>>>>>>>>> that
> >> >>>>>>>>>>>>> all
> >> >>>>>>>>>>>>>>>>> task
> >> >>>>>>>>>>>>>>>>>>>>> managers
> >> >>>>>>>>>>>>>>>>>>>>>>> on
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
> >> >>>>>>>> these
> >> >>>>>>>>>>> TMs
> >> >>>>>>>>>>>> are
> >> >>>>>>>>>>>>>> not
> >> >>>>>>>>>>>>>>>>>>>>> available.
> >> >>>>>>>>>>>>>>>>>>>>>>> This
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
> >> >>>>>> these
> >> >>>>>>>> TMs
> >> >>>>>>>>>>> are
> >> >>>>>>>>>>>>> not
> >> >>>>>>>>>>>>>>>>> really
> >> >>>>>>>>>>>>>>>>>> lost,
> >> >>>>>>>>>>>>>>>>>>>>>>> they
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
> >> >>>>>> are
> >> >>>>>>>>> still
> >> >>>>>>>>>>>>>>> registered
> >> >>>>>>>>>>>>>>>>>> in this
> >> >>>>>>>>>>>>>>>>>>>>>>> flink
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
> >> >>>>>> once
> >> >>>>>>>> the
> >> >>>>>>>>>>>>>>>>> corresponding
> >> >>>>>>>>>>>>>>>>>>>>>> blocklist
> >> >>>>>>>>>>>>>>>>>>>>>>>>> item
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
> >> >>>>>>>>>>>>>>> active/non-active
> >> >>>>>>>>>>>>>>>>>>>>> clusters.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
> >> >>>>>> TMs
> >> >>>>>>>> may
> >> >>>>>>>>> be
> >> >>>>>>>>>>>>>> released
> >> >>>>>>>>>>>>>>>>> due
> >> >>>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>>>>> idle
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> timeouts.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
> >> >>>>>> keep
> >> >>>>>>>> it.
> >> >>>>>>>>>>> The
> >> >>>>>>>>>>>>>> reasons
> >> >>>>>>>>>>>>>>>>> are
> >> >>>>>>>>>>>>>>>>>> as
> >> >>>>>>>>>>>>>>>>>>>>>>>>> following:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
> >> >>>>>>> adding
> >> >>>>>>>> or
> >> >>>>>>>>>>>>> removing
> >> >>>>>>>>>>>>>>>>> items
> >> >>>>>>>>>>>>>>>>>> via
> >> >>>>>>>>>>>>>>>>>>>>>> REST
> >> >>>>>>>>>>>>>>>>>>>>>>>>> API,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
> >> >>>>>> it
> >> >>>>>>>> to
> >> >>>>>>>>>>>>>>>>> Long.MAX_VALUE .
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
> >> >>>>>> a
> >> >>>>>>>>>>> period of
> >> >>>>>>>>>>>>>> time
> >> >>>>>>>>>>>>>>>>>> (such as
> >> >>>>>>>>>>>>>>>>>>>>>>> machine
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
> >> >>>>>> prefer
> >> >>>>>>>> that
> >> >>>>>>>>>>>> Flink
> >> >>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>> do
> >> >>>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
> >> >>>>>>> user
> >> >>>>>>>> to
> >> >>>>>>>>>>> do
> >> >>>>>>>>>>>> it
> >> >>>>>>>>>>>>>>>>> manually.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
> >> >>>>>>>>>>> 于2022年4月27日周三
> >> >>>>>>>>>>>>>>>>> 19:23写道：
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
> >> >>>>>> only
> >> >>>>>>>>>>> support
> >> >>>>>>>>>>>>>>> manually
> >> >>>>>>>>>>>>>>>>>> blocking
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
> >> >>>>>>> different
> >> >>>>>>>>>>>>> strategies
> >> >>>>>>>>>>>>>>> can
> >> >>>>>>>>>>>>>>>>>> also be
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
> >> >>>>>>>> "bla.cklist"
> >> >>>>>>>>>>> due
> >> >>>>>>>>>>>> to
> >> >>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>> controversy
> >> >>>>>>>>>>>>>>>>>>>>>>>>> around
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
> >> >>>>>>> Ticket
> >> >>>>>>>>>>> about
> >> >>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>> topic a
> >> >>>>>>>>>>>>>>>>>>>>> while
> >> >>>>>>>>>>>>>>>>>>>>>>>>> back
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
> >> >>>>>>>> avoid
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>> term
> >> >>>>>>>>>>>>>>>>>> blacklist &
> >> >>>>>>>>>>>>>>>>>>>>>>>>> whitelist
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
> >> >>>>>> "denylist"
> >> >>>>>>> or
> >> >>>>>>>>>>>>>>> "quarantined"
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
> >> >>>>>>> slightly
> >> >>>>>>>>>>>> different
> >> >>>>>>>>>>>>>>>>> design
> >> >>>>>>>>>>>>>>>>>> as
> >> >>>>>>>>>>>>>>>>>>>>> verbs
> >> >>>>>>>>>>>>>>>>>>>>>>>>> like
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
> >> >>>>>>> anti-pattern
> >> >>>>>>>>> for
> >> >>>>>>>>>>>> REST
> >> >>>>>>>>>>>>>>> APIs.
> >> >>>>>>>>>>>>>>>>>> POST
> >> >>>>>>>>>>>>>>>>>>>>> on a
> >> >>>>>>>>>>>>>>>>>>>>>>>>> list
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
> >> >>>>>>>> items.
> >> >>>>>>>>>>>> DELETE
> >> >>>>>>>>>>>>> on
> >> >>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>> individual
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >> >>>>>>>>> <host>/quarantine/items/<itemidentifier>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
> >> >>>>>>>>> taskmanagers
> >> >>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>> nodes
> >> >>>>>>>>>>>>>>>>> in
> >> >>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>> REST
> >> >>>>>>>>>>>>>>>>>>>>>>>>> API
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
> >> >>>>>>> opinion
> >> >>>>>>>>> on
> >> >>>>>>>>>>>> this?
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >> >>>>>>> <host>/quarantine/nodes/<identifier>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >> >>>>>>>>>>> <host>/quarantine/taskmanager/<identifier>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
> >> >>>>>>>>>>> non-active
> >> >>>>>>>>>>>>>>>>> resource
> >> >>>>>>>>>>>>>>>>>>>>>> managers,
> >> >>>>>>>>>>>>>>>>>>>>>>>>> i.e.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
> >> >>>>>>>>>>> minimal,
> >> >>>>>>>>>>>> do
> >> >>>>>>>>>>>>> we
> >> >>>>>>>>>>>>>>>>> need
> >> >>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>> timeout
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
> >> >>>>>>>> manually
> >> >>>>>>>>>>> we
> >> >>>>>>>>>>>>> could
> >> >>>>>>>>>>>>>>>>>> delegate
> >> >>>>>>>>>>>>>>>>>>>>> this
> >> >>>>>>>>>>>>>>>>>>>>>>>>> to the
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
> >> >>>>>>>>> behavior
> >> >>>>>>>>>>>>> would
> >> >>>>>>>>>>>>>>>>> better
> >> >>>>>>>>>>>>>>>>>> fit
> >> >>>>>>>>>>>>>>>>>>>>>> into
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>
> >> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> >> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> >> >>>>>>> schrieb
> >> >>>>>>>>>>> Lijie
> >> >>>>>>>>>>>>> Wang
> >> >>>>>>>>>>>>>> <
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
> >> >>>>>>>> cluster
> >> >>>>>>>>>>> node
> >> >>>>>>>>>>>>>>> issues
> >> >>>>>>>>>>>>>>>>>>>>>>>>> (insufficient
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> disk
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
> >> >>>>>>>> abnormalities).
> >> >>>>>>>>>>>> Flink
> >> >>>>>>>>>>>>>> will
> >> >>>>>>>>>>>>>>>>>> take care
> >> >>>>>>>>>>>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
> >> >>>>>>> However,
> >> >>>>>>>>> due
> >> >>>>>>>>>>> to
> >> >>>>>>>>>>>>> data
> >> >>>>>>>>>>>>>>>>>> locality
> >> >>>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> limited
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
> >> >>>>>> likely
> >> >>>>>>>> to
> >> >>>>>>>>> be
> >> >>>>>>>>>>>>>>> redeployed
> >> >>>>>>>>>>>>>>>>>> to the
> >> >>>>>>>>>>>>>>>>>>>>>> same
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
> >> >>>>>>>> task
> >> >>>>>>>>>>>>>>> abnormalities
> >> >>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>> affect
> >> >>>>>>>>>>>>>>>>>>>>>>>>> job
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> progress.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
> >> >>>>>> manually
> >> >>>>>>>>>>> identify
> >> >>>>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>>> problematic
> >> >>>>>>>>>>>>>>>>>>>>>>>>> node and
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
> >> >>>>>>> But
> >> >>>>>>>>>>> this
> >> >>>>>>>>>>>>>>> approach
> >> >>>>>>>>>>>>>>>>> has
> >> >>>>>>>>>>>>>>>>>>>>>>>>> following
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
> >> >>>>>> heavy
> >> >>>>>>>>>>> process.
> >> >>>>>>>>>>>>>> Users
> >> >>>>>>>>>>>>>>>>> may
> >> >>>>>>>>>>>>>>>>>> need
> >> >>>>>>>>>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> contact
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
> >> >>>>>>>>> operation
> >> >>>>>>>>>>> can
> >> >>>>>>>>>>>>>> even
> >> >>>>>>>>>>>>>>> be
> >> >>>>>>>>>>>>>>>>>>>>> dangerous
> >> >>>>>>>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> not
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
> >> >>>>>>>>> events.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
> >> >>>>>> of
> >> >>>>>>>>>>> problems
> >> >>>>>>>>>>>>>>> manually
> >> >>>>>>>>>>>>>>>>>> would
> >> >>>>>>>>>>>>>>>>>>>>> be
> >> >>>>>>>>>>>>>>>>>>>>>>>>> slow
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
> >> >>>>>>>> propose
> >> >>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>>> introduce a
> >> >>>>>>>>>>>>>>>>>>>>>> blacklist
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
> >> >>>>>>>>> problematic
> >> >>>>>>>>>>>>>>> resources.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
> >> >>>>>>> FLIP-224[1].
> >> >>>>>>>>>>> Looking
> >> >>>>>>>>>>>>>>> forward
> >> >>>>>>>>>>>>>>>>>> to your
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>> feedback.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> --
> >> >>>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>>> Roman Boyko
> >> >>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >> >>>>>>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> --
> >> >>>>>>>> https://twitter.com/snntrable
> >> >>>>>>>> https://github.com/knaufk
> >> >>>>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> https://twitter.com/snntrable
> >> >>>>>> https://github.com/knaufk
> >> >>>>>>
> >> >>>>
> >>
> >>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi everyone,

I've updated the FLIP according to Chesnay's feedback, changes as follows:
1. Change the GET result to a map.
2. Only left *endTimestamp* in ADD operation, and change the rest (from
POST) to PUT
3. Introduce a new slot pool implementation(BlocklistSlotPool) to
encapsulate blocklist related functions.
4. Remove *mainThread* from BlocklistTracker, instead provide a
*removeTimeoutItems* method to be called by outside components。

Best,
Lijie

Lijie Wang <wa...@gmail.com> 于2022年5月23日周一 22:51写道：

> Hi Chesnay,
>
> Thanks for feedback.
>
> 1. Regarding the TM/Node id. Do you mean special characters may appear in
> the rest URL?  Actually, I don't think so. The task manager id in REST API
> should be the *ResourceID* of taskmanager in Flink, there should be no
> special characters, and some existing REST APIs are already using it, e.g.
> GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
> id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
> should also have no special characters.
> 2. Regarding the GET query responses. I agree with you, it makes sense to
> change the GET result to a map.
>
> 3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
> cover everything, and the endTimestamp is a unix timestamp, there should be
> no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
> add rest api is add or update, PUT can cover this semantics.
>
> 4. Regarding the slot pool/manager. I don't think the current slotpool
> and slotmanager are able to support the MARK_BLOCKED(slots that are
> already allocated will not be affected) action. The reasons are as
> follows：
>
> a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
> from reserved(task assigned) to free(no task assigned), it is necessary to
> check whether the slot should be released immediately(it should be released
> immediately if the task manager is blocked, otherwise it may be allocated
> to other tasks). I think it cannot be supported without being aware of
> the blocklist information. Compared to the solution in FLIP, a more
> appropriate/prefered way may be: Introduce a new slot pool
> implementation for blocklist(may be named BlocklistSlotPool, it
> extends/wrapps the original slot pool), and implement the parts that need
> to be aware of the blocklist in this newly introduced slot pool, and the
> original slot pool basically does not need to change.
>
> b) for slot manager, with the MARK_BLOCKED action, there may be free but
> blocked slots in slot manager (the corresponding TMs cannot be
> released/unregistered because there are still running tasks on them).
> Therefore, we need to filter out the blocked slots when trying to fulfill
> the slot requirements. Therefore it also needs to know the blocklist information.
> A better way may be to abstract a resource allocation strategy, and make
> the blocklist as a special implementation, then pass the resource
> allocation strategy in when constructing the slot manager. Unfortunately,
> the data structures in the two existing slot manager
> implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
> quite different, it is not easy to abstract a common resource allocation
> strategy, so we prefer to keep the current way(i.e. pass the blocklist
> information directly into slot manager).
>
>
> 5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
> does not need to be aware of the executor, and the timeout actions can be
> done outside.
>
> Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：
>
>> I have a number of concerns:
>>
>> Is the id used for deleting an item the same sent in the initial request
>> (and not one returned by Flink)?
>> I'm very concerned that the tm/node id can contain special characters.
>>
>> The GET query should return a map, not a list of items. This makes it
>> easier to work with.
>>
>> The duality of endTimestamp and duration is also concerning.
>> If we conclude that endTimestamps can in fact work (and aren't utterly
>> unusable due to timezones),
>> then this should be able to cover everything and rid us of some
>> complexity w.r.t. POSTs to the same ID.
>> Additions would be a PUT, changes a PATCH, deletes a DELETE.
>>
>>
>> I also dislike how we're pushing more functionality into the
>> slotpool/-manager.
>> These components are complex enough as-is, and instead I'd propose a
>> separate component that interacts with the SlotPool/-Manager instead,
>> for example by removing the slots from that TM.
>> The reason being that from the slot-pool perspective it is irrelevant
>> whether a slot is gone because the TM was lost, or because it was blocked.
>>
>>
>> The BlocklistTracker shouldn't be modeled as component that is aware of
>> the concept of main threads.
>> It really has no business knowing that; all it needs is an executor for
>> handling timeouts/periodic actions,
>> and a way to interact with the JM/RM (which internally can handle the
>> scheduling into the main thread).
>>
>>
>> On 20/05/2022 07:20, Lijie Wang wrote:
>> > Hi everyone,
>> >
>> > I have started a vote for this FLIP [1]. Please cast your vote there or
>> ask
>> > additional questions here. [1]
>> > https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
>> >
>> > Best, Lijie
>> >
>> > Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
>> >
>> >> Hi Konstantin,
>> >>
>> >> We found that Flink REST URL does not support the format ":merge" ,
>> which
>> >> will be recognized as a parameter in the URL(due to start with a
>> colon).
>> >>
>> >> We will keep the previous way, i.e.
>> >>
>> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
>> >> and the "id" and "merge" flag are put into the request body.
>> >>
>> >> Best,
>> >> Lijie
>> >>
>> >> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
>> >>
>> >>> Hi Weihua,
>> >>> thanks for feedback.
>> >>>
>> >>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
>> >>> step towards auto-detection.
>> >>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
>> >>> exception of insufficient resources.
>> >>> 3. No. This FLIP won't change the WebUI. The blocklist information
>> can be
>> >>> obtained through REST API and metrics.
>> >>>
>> >>> Best,
>> >>> Lijie
>> >>>
>> >>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
>> >>>
>> >>>> Hi,
>> >>>> Thanks for creating this FLIP.
>> >>>> We have implemented an automatic blocklist detection mechanism
>> >>>> internally, which is indeed very effective for handling node
>> failures.
>> >>>> Due to the large number of nodes, although SREs already support
>> >>>> automatic offline failure nodes, the detection is not 100% accurate
>> and
>> >>>> there is some delay.
>> >>>> So the blocklist mechanism can make flink job recover from failure
>> much
>> >>>> faster.
>> >>>>
>> >>>> Here are some of my thoughts:
>> >>>> 1. In this FLIP, it needs users to locate machine failure manually,
>> >>>> there is a certain cost of use
>> >>>> 2. What happens if too many nodes are blocked, resulting in
>> insufficient
>> >>>> resources? Will there be a special Exception for the user?
>> >>>> 3. Will we display the blocklist information in the WebUI? The
>> blocklist
>> >>>> is for cluster level, and if multiple users share a cluster, some
>> users may
>> >>>> be a little confused when resources are not enough, or when
>> resources are
>> >>>> applied for more.
>> >>>>
>> >>>> Also, Looking forward to the next FLIP on auto-detection.
>> >>>>
>> >>>> Best,
>> >>>> Weihua
>> >>>>
>> >>>>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
>> >>>>>
>> >>>>> Hi Konstantin,
>> >>>>>
>> >>>>> Maybe change it to the following:
>> >>>>>
>> >>>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>> >>>>> Merge is not allowed. If the {id} already exists, return error.
>> >>>> Otherwise,
>> >>>>> create a new item.
>> >>>>>
>> >>>>> 2. POST: http://
>> >>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>> >>>>> Merge is allowed. If the {id} already exists, merge. Otherwise,
>> create
>> >>>> a
>> >>>>> new item.
>> >>>>>
>> >>>>> WDYT?
>> >>>>>
>> >>>>> Best,
>> >>>>> Lijie
>> >>>>>
>> >>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
>> >>>>>
>> >>>>>> Hi Lijie,
>> >>>>>>
>> >>>>>> hm, maybe the following is more appropriate in that case
>> >>>>>>
>> >>>>>> POST: http://
>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>> >>>>>>
>> >>>>>> Best,
>> >>>>>>
>> >>>>>> Konstantin
>> >>>>>>
>> >>>>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>> >>>>>> wangdachui9501@gmail.com>:
>> >>>>>>
>> >>>>>>> Hi Konstantin,
>> >>>>>>> thanks for your feedback.
>> >>>>>>>
>> >>>>>>>  From what I understand, PUT should be idempotent. However, we
>> have a
>> >>>>>>> *timeout* field in the request. This means that initiating the
>> same
>> >>>>>> request
>> >>>>>>> at two different times will lead to different resource status
>> >>>> (timestamps
>> >>>>>>> of the items to be removed will be different).
>> >>>>>>>
>> >>>>>>> Should we use PUT in this case? WDYT?
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Lijie
>> >>>>>>>
>> >>>>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>> >>>>>>>
>> >>>>>>>> Hi Lijie,
>> >>>>>>>>
>> >>>>>>>> wouldn't the REST API-idiomatic way for an update/replace be a
>> PUT
>> >>>> on
>> >>>>>> the
>> >>>>>>>> resource?
>> >>>>>>>>
>> >>>>>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>>
>> >>>>>>>> Konstantin
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>> >>>>>>>> wangdachui9501@gmail.com>:
>> >>>>>>>>
>> >>>>>>>>> Hi everyone,
>> >>>>>>>>>
>> >>>>>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
>> >>>> made
>> >>>>>>> the
>> >>>>>>>>> following changes on REST API:
>> >>>>>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
>> choose
>> >>>>>>> one.
>> >>>>>>>> If
>> >>>>>>>>> both are specified, will return error.
>> >>>>>>>>> 2.  If the specified item is already there, the *ADD* operation
>> has
>> >>>>>> two
>> >>>>>>>>> behaviors:  *return error*(default value) or *merge/update*,
>> and we
>> >>>>>>> add a
>> >>>>>>>>> flag to the request body to control it. You can find more
>> details
>> >>>>>>> "Public
>> >>>>>>>>> Interface" section.
>> >>>>>>>>>
>> >>>>>>>>> If there is no more feedback, we will start the vote thread next
>> >>>>>> week.
>> >>>>>>>>> Best,
>> >>>>>>>>> Lijie
>> >>>>>>>>>
>> >>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>> >>>>>>>>>
>> >>>>>>>>>> Hi Becket Qin,
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for your suggestions.  I have moved the description of
>> >>>>>>>>>> configurations, metrics and REST API into "Public Interface"
>> >>>>>> section,
>> >>>>>>>> and
>> >>>>>>>>>> made a few updates according to your suggestion.  And in this
>> >>>> FLIP,
>> >>>>>>>> there
>> >>>>>>>>>> no public java Interfaces or pluggables that users need to
>> >>>>>> implement
>> >>>>>>> by
>> >>>>>>>>>> themselves.
>> >>>>>>>>>>
>> >>>>>>>>>> Answers for you questions:
>> >>>>>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>> >>>>>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>> >>>>>> items
>> >>>>>>>> can
>> >>>>>>>>>> only be added through the REST API, so these 2 action are
>> >>>> mentioned
>> >>>>>>> in
>> >>>>>>>>> the
>> >>>>>>>>>> REST API part (The REST API part has beed moved to public
>> >>>> interface
>> >>>>>>>> now).
>> >>>>>>>>>> 2. I agree with you. I have changed the "Cause" field to
>> String,
>> >>>>>> and
>> >>>>>>>>> allow
>> >>>>>>>>>> users to specify it via REST API.
>> >>>>>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>> >>>>>> above,
>> >>>>>>>> we
>> >>>>>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
>> >>>> ADD
>> >>>>>>>> REST
>> >>>>>>>>>> API to specify when to remove the blocked item. These 2 fields
>> are
>> >>>>>>>>>> optional, if neither is specified, it means that the blocked
>> item
>> >>>>>> is
>> >>>>>>>>>> permanent and will not be removed. If both are specified, the
>> >>>>>> minimum
>> >>>>>>>> of
>> >>>>>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
>> the
>> >>>>>>> time
>> >>>>>>>> to
>> >>>>>>>>>> remove the blocked item. To keep the configurations more
>> minimal,
>> >>>>>> we
>> >>>>>>>> have
>> >>>>>>>>>> removed the *cluster.resource-blocklist.item.timeout*
>> >>>> configuration
>> >>>>>>>>>> option.
>> >>>>>>>>>> 4. Yes, the block item will be overridden if the specified item
>> >>>>>>> already
>> >>>>>>>>>> exists. The ADD operation is *ADD or UPDATE*.
>> >>>>>>>>>> 5. Yes. On JM/RM side, all the blocklist information is
>> maintained
>> >>>>>> in
>> >>>>>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>> >>>>>>>>> abstracted
>> >>>>>>>>>> to other interfaces) will be propagated to different
>> components.
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Lijie
>> >>>>>>>>>>
>> >>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Lijie,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for updating the FLIP. It looks like the public
>> interface
>> >>>>>>>> section
>> >>>>>>>>>>> did not fully reflect all the user sensible behavior and API.
>> Can
>> >>>>>>> you
>> >>>>>>>>> put
>> >>>>>>>>>>> everything that users may be aware of there? That would
>> include
>> >>>>>> the
>> >>>>>>>> REST
>> >>>>>>>>>>> API, metrics, configurations, public java Interfaces or
>> >>>> pluggables
>> >>>>>>>> that
>> >>>>>>>>>>> users may see or implement by themselves, as well as a brief
>> >>>>>> summary
>> >>>>>>>> of
>> >>>>>>>>>>> the
>> >>>>>>>>>>> behavior of the public API.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Besides that, I have a few questions:
>> >>>>>>>>>>>
>> >>>>>>>>>>> 1. According to the conversation in the discussion thread, it
>> >>>>>> looks
>> >>>>>>>> like
>> >>>>>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>> >>>>>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If
>> so,
>> >>>>>> can
>> >>>>>>>> you
>> >>>>>>>>>>> add
>> >>>>>>>>>>> that to the public interface as well?
>> >>>>>>>>>>>
>> >>>>>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>> >>>>>>> Throwable
>> >>>>>>>>> and
>> >>>>>>>>>>> is not reflected in the REST API. Should that be included in
>> the
>> >>>>>>> query
>> >>>>>>>>>>> response? And should we change that field to be a String so
>> users
>> >>>>>>> may
>> >>>>>>>>>>> specify the cause via the REST API when they block some nodes
>> /
>> >>>>>> TMs?
>> >>>>>>>>>>> 3. Would it be useful to allow users to have different
>> timeouts
>> >>>>>> for
>> >>>>>>>>>>> different blocked items? So while there is a default timeout,
>> >>>>>> users
>> >>>>>>>> can
>> >>>>>>>>>>> also override it via the REST API when they block an entity.
>> >>>>>>>>>>>
>> >>>>>>>>>>> 4. Regarding the ADD operation, if the specified item is
>> already
>> >>>>>>>> there,
>> >>>>>>>>>>> will the block item be overridden? For example, if the user
>> wants
>> >>>>>> to
>> >>>>>>>>>>> extend
>> >>>>>>>>>>> the timeout of a blocked item, can they just  issue an ADD
>> >>>> command
>> >>>>>>>>> again?
>> >>>>>>>>>>> 5. I am not quite familiar with the details of this, but is
>> there
>> >>>>>> a
>> >>>>>>>>> source
>> >>>>>>>>>>> of truth for the blocked list? I think it might be good to
>> have a
>> >>>>>>>> single
>> >>>>>>>>>>> source of truth for the blocked list and just propagate that
>> list
>> >>>>>> to
>> >>>>>>>>>>> different components to take the action of actually blocking
>> the
>> >>>>>>>>> resource.
>> >>>>>>>>>>> Thanks,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>> >>>>>> wangdachui9501@gmail.com
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Based on the discussion in the mailing list, I updated the
>> FLIP
>> >>>>>>> doc,
>> >>>>>>>>> the
>> >>>>>>>>>>>> changes include:
>> >>>>>>>>>>>> 1. Changed the description of the motivation section to more
>> >>>>>>> clearly
>> >>>>>>>>>>>> describe the problem this FLIP is trying to solve.
>> >>>>>>>>>>>> 2. Only  *Manually* is supported.
>> >>>>>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Best,
>> >>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Lijie!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>> >>>>>>>>>>> helpfulhere.”
>> >>>>>>>>>>>>> Can you explain it in detail? I don't fully understand that.
>> >>>>>> In
>> >>>>>>>>>>>> myopinion,
>> >>>>>>>>>>>>> non-active and active are the same, and no special treatment
>> >>>>>>>>>>> isrequired.*
>> >>>>>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>> >>>>>>> were
>> >>>>>>>>>>> talking
>> >>>>>>>>>>>>> about the HA mode (but not about Active and Standalone
>> >>>>>>>>>>> ResourceManager).
>> >>>>>>>>>>>>> And the original question was - how to handle the
>> blacklisted
>> >>>>>>>> nodes
>> >>>>>>>>>>> list
>> >>>>>>>>>>>> at
>> >>>>>>>>>>>>> the moment of leader change? Should we simply forget about
>> >>>>>> them
>> >>>>>>> or
>> >>>>>>>>>>> try to
>> >>>>>>>>>>>>> pre-save that list on the remote storage?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <
>> danrtsey.wy@gmail.com
>> >>>>>>>>> wrote:
>> >>>>>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>> >>>>>> it
>> >>>>>>> is
>> >>>>>>>>>>> indeed
>> >>>>>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>> >>>>>>>> could
>> >>>>>>>>>>> not
>> >>>>>>>>>>>>>> support.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>> Yang
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五
>> 22:18写道：
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Jiangang
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>> >>>>>> and
>> >>>>>>>>> mark a
>> >>>>>>>>>>>> node
>> >>>>>>>>>>>>>>> blocked?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> In fact, we currently plan to not support auto-detection
>> >>>>>> in
>> >>>>>>>> this
>> >>>>>>>>>>>> FLIP.
>> >>>>>>>>>>>>>> The
>> >>>>>>>>>>>>>>> part about auto-detection may be continued in a separate
>> >>>>>>> FLIP
>> >>>>>>>> in
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>> future. Some guys have the same concerns as you, and the
>> >>>>>>>>>>> correctness
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>> necessity of auto-detection may require further discussion
>> >>>>>>> in
>> >>>>>>>>> the
>> >>>>>>>>>>>>> future.
>> >>>>>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>> >>>>>> node
>> >>>>>>>> and
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>> node
>> >>>>>>>>>>>>>>> should be marked blocked.
>> >>>>>>>>>>>>>>> By design, the blocklist information will be shared among
>> >>>>>>> all
>> >>>>>>>>> jobs
>> >>>>>>>>>>>> in a
>> >>>>>>>>>>>>>>> cluster/session. The JM will sync blocklist information
>> >>>>>> with
>> >>>>>>>> RM.
>> >>>>>>>>>>>>>>> @Martijn
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
>> >>>>>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>> >>>>>>>>>>> MARK_BLOCKLISTED(Just
>> >>>>>>>>>>>>>> limits
>> >>>>>>>>>>>>>>> the load of the node and does not  kill all the processes
>> >>>>>> on
>> >>>>>>>> it)
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>> also
>> >>>>>>>>>>>>>>> important, and we think that external systems (*yarn
>> >>>>>> rmadmin
>> >>>>>>>> or
>> >>>>>>>>>>>> kubectl
>> >>>>>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>> >>>>>>>> only
>> >>>>>>>>>>>>>> *manually*.
>> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>> >>>>>>> indeed
>> >>>>>>>>>>> super
>> >>>>>>>>>>>>> hard
>> >>>>>>>>>>>>>>> to get right.
>> >>>>>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>> >>>>>>>>> concern.
>> >>>>>>>>>>>>>>> However, we currently plan to not support auto-detection
>> >>>>>> in
>> >>>>>>>> this
>> >>>>>>>>>>>> FLIP,
>> >>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>> >>>>>> FLIP
>> >>>>>>>>>>> provides
>> >>>>>>>>>>>> a
>> >>>>>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>> >>>>>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>> >>>>>>>>>>>>>>> the auto-detection may be done by external systems.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
>> >>>>>>> 19:04写道：
>> >>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>> >>>>>> could
>> >>>>>>>> not
>> >>>>>>>>>>> see
>> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>> >>>>>>>>> approach(via
>> >>>>>>>>>>>> *yarn
>> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> To me this sounds yet again like one of those magical
>> >>>>>>>>>>> mechanisms
>> >>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>> will rarely work just right.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>> >>>>>>> indeed
>> >>>>>>>>>>> super
>> >>>>>>>>>>>>> hard
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>> get right.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Martijn
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>> >>>>>>>>>>>> liujiangangpeng@gmail.com
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>> >>>>>>>>> decrease
>> >>>>>>>>>>>>> great
>> >>>>>>>>>>>>>>> work
>> >>>>>>>>>>>>>>>>> for us. We have implemented the similar feature in our
>> >>>>>>>> inner
>> >>>>>>>>>>> flink
>> >>>>>>>>>>>>>>>>> version.
>> >>>>>>>>>>>>>>>>> Below is something that I care about:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>    1. For auto-detecting, I wonder how to make the
>> >>>>>>> strategy
>> >>>>>>>>> and
>> >>>>>>>>>>>>> mark a
>> >>>>>>>>>>>>>>>>> node
>> >>>>>>>>>>>>>>>>>    blocked? Sometimes the blocked node is hard to be
>> >>>>>>>>> detected,
>> >>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>> example,
>> >>>>>>>>>>>>>>>>>    the upper node or the down node will be blocked when
>> >>>>>>>>> network
>> >>>>>>>>>>>>>>>>> unreachable.
>> >>>>>>>>>>>>>>>>>    2. I see that the strategy is made in JobMaster
>> >>>>>> side.
>> >>>>>>>> How
>> >>>>>>>>>>> about
>> >>>>>>>>>>>>>>>>>    implementing the similar logic in resource manager?
>> >>>>>> In
>> >>>>>>>>>>> session
>> >>>>>>>>>>>>>> mode,
>> >>>>>>>>>>>>>>>>> multi
>> >>>>>>>>>>>>>>>>>    jobs can fail on the same bad node and the node
>> >>>>>> should
>> >>>>>>>> be
>> >>>>>>>>>>>> marked
>> >>>>>>>>>>>>>>>>> blocked.
>> >>>>>>>>>>>>>>>>>    If the job makes the strategy, the node may be not
>> >>>>>>>> marked
>> >>>>>>>>>>>> blocked
>> >>>>>>>>>>>>>> if
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>    fail times don't exceed the threshold.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thank you for all your feedback!
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>> >>>>>> some
>> >>>>>>> of
>> >>>>>>>>> my
>> >>>>>>>>>>>>>> thoughts:
>> >>>>>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>> >>>>>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>> >>>>>>>>>>>>>>>>>> The automatical way should be something built upon
>> >>>>>> the
>> >>>>>>>>>>> blocklist
>> >>>>>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>> >>>>>>>>>>>> configurable
>> >>>>>>>>>>>>>>>>>> blocklist strategy, but I think we can further
>> >>>>>> decouple
>> >>>>>>>> it
>> >>>>>>>>> by
>> >>>>>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>> >>>>>>>> suggested,
>> >>>>>>>>>>> which
>> >>>>>>>>>>>>>> just
>> >>>>>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>> >>>>>>> detected.
>> >>>>>>>>>>>> However,
>> >>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>> should be a separate FLIP with further dev
>> >>>>>> discussions
>> >>>>>>>> and
>> >>>>>>>>>>>>> feedback
>> >>>>>>>>>>>>>>>>>> from users. I also agree with Becket that different
>> >>>>>>> users
>> >>>>>>>>>>> have
>> >>>>>>>>>>>>>>> different
>> >>>>>>>>>>>>>>>>>> requirements, and we should listen to them.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>> >>>>>>>> externally
>> >>>>>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>> >>>>>> way
>> >>>>>>> to
>> >>>>>>>>>>> avoid
>> >>>>>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>> >>>>>>>> users
>> >>>>>>>>>>> may
>> >>>>>>>>>>>>> just
>> >>>>>>>>>>>>>>>>>> want to limit the load of the node and do not want to
>> >>>>>>>> kill
>> >>>>>>>>>>> all
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> processes on it. Another case is the speculative
>> >>>>>>>>> execution[1]
>> >>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>> may also leverage this feature to avoid starting
>> >>>>>> mirror
>> >>>>>>>>>>> tasks on
>> >>>>>>>>>>>>>> slow
>> >>>>>>>>>>>>>>>>>> nodes.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>> Zhu
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>
>> >>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> >>>>>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
>> >>>>>>>>> 15:56写道：
>> >>>>>>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Thanks for your feedback.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>> >>>>>> here
>> >>>>>>>>>>> because
>> >>>>>>>>>>>> it
>> >>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>> affect the value and design of the blocklist
>> >>>>>> mechanism
>> >>>>>>>>>>> (perhaps
>> >>>>>>>>>>>> I
>> >>>>>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>> >>>>>>>> FLIP:
>> >>>>>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>> >>>>>>> node
>> >>>>>>>>> as
>> >>>>>>>>>>>>>> blocked.
>> >>>>>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>> >>>>>>>> task
>> >>>>>>>>>>>> manager
>> >>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>> node.
>> >>>>>>>>>>>>>>>>>> But slots that are already allocated will not be
>> >>>>>>>> affected.
>> >>>>>>>>> A
>> >>>>>>>>>>>>> typical
>> >>>>>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>> >>>>>>> In
>> >>>>>>>>> this
>> >>>>>>>>>>>>> case,
>> >>>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>> hope
>> >>>>>>>>>>>>>>>>>> that subsequent resource allocations will not be on
>> >>>>>> the
>> >>>>>>>> hot
>> >>>>>>>>>>>>> machine,
>> >>>>>>>>>>>>>>> but
>> >>>>>>>>>>>>>>>>>> tasks currently running on it should not be affected.
>> >>>>>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>> >>>>>> task
>> >>>>>>>>>>> manager
>> >>>>>>>>>>>> or
>> >>>>>>>>>>>>>>> node
>> >>>>>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>> >>>>>> tasks
>> >>>>>>>> will
>> >>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>> restarted on
>> >>>>>>>>>>>>>>>>>> non-blocked task managers.
>> >>>>>>>>>>>>>>>>>>> For the above 2 actions, the former may more
>> >>>>>>> highlight
>> >>>>>>>>> the
>> >>>>>>>>>>>>> meaning
>> >>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> this FLIP, because the external system cannot do
>> >>>>>> that.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>> >>>>>> basically
>> >>>>>>>>> agree
>> >>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>> @Becket
>> >>>>>>>>>>>>>>>>>> Qin: different users have different answers. Not all
>> >>>>>>>> users’
>> >>>>>>>>>>>>>> deployment
>> >>>>>>>>>>>>>>>>>> environments have a special external system that can
>> >>>>>>>>> perform
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>> anomaly
>> >>>>>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>> >>>>>>>>>>> auto-detection
>> >>>>>>>>>>>>>>> doesn't
>> >>>>>>>>>>>>>>>>>> require much extra work on top of manual
>> >>>>>> specification.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> I will answer your other questions one by one.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> @Yangze
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>> >>>>>>> the
>> >>>>>>> `cluster.resource-blocklist.item.timeout-check-interval`
>> >>>>>>>> to
>> >>>>>>>>>>>> users.
>> >>>>>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>> >>>>>>> separate
>> >>>>>>>>>>>> interface
>> >>>>>>>>>>>>>>>>> (maybe
>> >>>>>>>>>>>>>>>>>> BlocklistExceptionListener), and the
>> >>>>>>>>>>>>> ResourceManagerBlocklistHandler
>> >>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>> implement it in the future.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> @Martijn
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>> >>>>>> by
>> >>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>>>> operators.
>> >>>>>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>> >>>>>>>> according
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> my
>> >>>>>>>>>>>>>>>>>> experience, users are often the first to perceive the
>> >>>>>>>>> machine
>> >>>>>>>>>>>>>> problems
>> >>>>>>>>>>>>>>>>>> (because of job failover or delay), and they will
>> >>>>>>> contact
>> >>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>>>> operators
>> >>>>>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>> >>>>>>>>>>> machine is
>> >>>>>>>>>>>>>>>>>> problematic. From this point of view, I think the
>> >>>>>>> people
>> >>>>>>>>> who
>> >>>>>>>>>>>>> really
>> >>>>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>> >>>>>>>> performed
>> >>>>>>>>> by
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> @Chesnay
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>> >>>>>>> for
>> >>>>>>>>>>>>>>>>> MARK_BLOCKLISTED
>> >>>>>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>> >>>>>> when
>> >>>>>>>> the
>> >>>>>>>>>>> task
>> >>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>> >>>>>>>> release
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>> slot
>> >>>>>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>> >>>>>>> on
>> >>>>>>>>> it.
>> >>>>>>>>>>> If
>> >>>>>>>>>>>> we
>> >>>>>>>>>>>>>>> only
>> >>>>>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>> >>>>>> needs
>> >>>>>>> to
>> >>>>>>>>>>>> retrieve
>> >>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>> by
>> >>>>>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>> >>>>>>>> relatively
>> >>>>>>>>>>>> large,
>> >>>>>>>>>>>>>> so
>> >>>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>>> think it's worth maintaining the blocklist
>> >>>>>> information
>> >>>>>>> on
>> >>>>>>>>>>> the JM
>> >>>>>>>>>>>>>> side
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>> syncing them.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> @Роман
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>     a) “Probably storing inside Zookeeper/Configmap
>> >>>>>>>> might
>> >>>>>>>>>>> be
>> >>>>>>>>>>>>>> helpful
>> >>>>>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>> >>>>>>>>>>> understand
>> >>>>>>>>>>>>> that.
>> >>>>>>>>>>>>>>> In
>> >>>>>>>>>>>>>>>>> my
>> >>>>>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>> >>>>>>>> special
>> >>>>>>>>>>>>>> treatment
>> >>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>> required.
>> >>>>>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>> >>>>>> sense,
>> >>>>>>> I
>> >>>>>>>>> will
>> >>>>>>>>>>>> add
>> >>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> FLIP.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> @Yang
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>> >>>>>> cannot
>> >>>>>>>>>>> support
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Looking forward to your further feedback.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
>> >>>>>>>> 21:09写道：
>> >>>>>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>> >>>>>>>>>>> operations.
>> >>>>>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>> >>>>>>>>>>> Reliability
>> >>>>>>>>>>>>>>> Engineer)
>> >>>>>>>>>>>>>>>>>>>> already has many tools to detect the unstable
>> >>>>>> nodes,
>> >>>>>>>>> which
>> >>>>>>>>>>>>> could
>> >>>>>>>>>>>>>>> take
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> system logs/metrics into consideration.
>> >>>>>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>> >>>>>> taint
>> >>>>>>>> in
>> >>>>>>>>>>> K8s
>> >>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> prevent
>> >>>>>>>>>>>>>>>>>> new
>> >>>>>>>>>>>>>>>>>>>> allocations on these unstable nodes.
>> >>>>>>>>>>>>>>>>>>>> At last, they will evict all the containers and
>> >>>>>> pods
>> >>>>>>>>>>> running
>> >>>>>>>>>>>> on
>> >>>>>>>>>>>>>>> these
>> >>>>>>>>>>>>>>>>>> nodes.
>> >>>>>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>> >>>>>>> So
>> >>>>>>>> I
>> >>>>>>>>> am
>> >>>>>>>>>>>>> afraid
>> >>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>> >>>>>>>> could
>> >>>>>>>>>>> not
>> >>>>>>>>>>>>> see
>> >>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>> >>>>>>>>>>>> approach(via
>> >>>>>>>>>>>>>>> *yarn
>> >>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>> >>>>>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>> >>>>>>> which
>> >>>>>>>>>>> could
>> >>>>>>>>>>>>>> expose
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>> >>>>>>> them
>> >>>>>>>> if
>> >>>>>>>>>>>>> enabled
>> >>>>>>>>>>>>>>>>>> explicitly.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>> Yang
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
>> >>>>>>>> 16:36写道：
>> >>>>>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>> >>>>>> and
>> >>>>>>>>>>> somewhat
>> >>>>>>>>>>>>>>> related
>> >>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>>>> design principle about how people should operate
>> >>>>>>>>> Flink.
>> >>>>>>>>>>>>>>>>>>>>> I think there are three things involved in this
>> >>>>>>>> FLIP.
>> >>>>>>>>>>>>>>>>>>>>>      a) Detect and report the unstable node.
>> >>>>>>>>>>>>>>>>>>>>>      b) Collect the information of the unstable
>> >>>>>>> node
>> >>>>>>>>> and
>> >>>>>>>>>>>>> form a
>> >>>>>>>>>>>>>>>>>> blocklist.
>> >>>>>>>>>>>>>>>>>>>>>      c) Take the action to block nodes.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> My two cents:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>> >>>>>>> should
>> >>>>>>>>> have
>> >>>>>>>>>>>> c).
>> >>>>>>>>>>>>> It
>> >>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>> not only
>> >>>>>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>> >>>>>> handy
>> >>>>>>>> for
>> >>>>>>>>>>> some
>> >>>>>>>>>>>>>>> planned
>> >>>>>>>>>>>>>>>>>>>>> maintenance.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>> >>>>>> who
>> >>>>>>>>>>> should be
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> brain
>> >>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>> >>>>>>>>> largely
>> >>>>>>>>>>>>> depends
>> >>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>> who we
>> >>>>>>>>>>>>>>>>>>>>> talk to. Different users would probably give
>> >>>>>>>> different
>> >>>>>>>>>>>>> answers.
>> >>>>>>>>>>>>>>> For
>> >>>>>>>>>>>>>>>>>> people
>> >>>>>>>>>>>>>>>>>>>>> who do have a centralized node health management
>> >>>>>>>>>>> service,
>> >>>>>>>>>>>> let
>> >>>>>>>>>>>>>>> Flink
>> >>>>>>>>>>>>>>>>>> do just
>> >>>>>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>> >>>>>>>> Flink
>> >>>>>>>>>>> would
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>> one
>> >>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>> >>>>>> it
>> >>>>>>> to
>> >>>>>>>>>>> that
>> >>>>>>>>>>>>>>> service,
>> >>>>>>>>>>>>>>>>>> and then
>> >>>>>>>>>>>>>>>>>>>>> take the command from that service to block the
>> >>>>>>>>>>> problematic
>> >>>>>>>>>>>>>>> nodes.
>> >>>>>>>>>>>>>>>>> On
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>> >>>>>>>> service,
>> >>>>>>>>>>>> simply
>> >>>>>>>>>>>>>>>>> letting
>> >>>>>>>>>>>>>>>>>> Flink
>> >>>>>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>> >>>>>> nodes
>> >>>>>>>>> might
>> >>>>>>>>>>> be
>> >>>>>>>>>>>>>>> desired
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>> >>>>>>>> pluggable /
>> >>>>>>>>>>>>>> optional.
>> >>>>>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>> >>>>>>> have
>> >>>>>>>>>>>> something
>> >>>>>>>>>>>>>>>>>> pluggable
>> >>>>>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>> >>>>>> unstable
>> >>>>>>>>> nodes
>> >>>>>>>>>>>>>>> actively.
>> >>>>>>>>>>>>>>>>> (A
>> >>>>>>>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>> >>>>>>> which
>> >>>>>>>>>>> can be
>> >>>>>>>>>>>>>> used
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> report
>> >>>>>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>> >>>>>>> the
>> >>>>>>>>>>> scope
>> >>>>>>>>>>>>>>> relevant
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>> >>>>>> OK
>> >>>>>>> to
>> >>>>>>>>>>> have a
>> >>>>>>>>>>>>>>> default
>> >>>>>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>> >>>>>>> Flink
>> >>>>>>>> to
>> >>>>>>>>>>> take
>> >>>>>>>>>>>>>>> action
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> block
>> >>>>>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>> >>>>>>>>> timeout.
>> >>>>>>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>> >>>>>>>>>>>>>> ro.v.boyko@gmail.com
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>> >>>>>>>>>>> development.
>> >>>>>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>> >>>>>>> we
>> >>>>>>>>>>> should
>> >>>>>>>>>>>>>>> consider
>> >>>>>>>>>>>>>>>>>> to move
>> >>>>>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>> >>>>>> nodes/TMs
>> >>>>>>>>> from
>> >>>>>>>>>>>>> active
>> >>>>>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>> >>>>>>>> storing
>> >>>>>>>>>>>> inside
>> >>>>>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>> >>>>>>>> organizations
>> >>>>>>>>>>>> don't
>> >>>>>>>>>>>>>> want
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> expose
>> >>>>>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>> >>>>>>>> it's
>> >>>>>>>>>>>>> necessary
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> have the
>> >>>>>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>> >>>>>>> for
>> >>>>>>>>>>>> avoiding
>> >>>>>>>>>>>>>>>>> incorrect
>> >>>>>>>>>>>>>>>>>>>>>> automatic behaviour.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>> >>>>>>>> would
>> >>>>>>>>> be
>> >>>>>>>>>>>>> better
>> >>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> extend
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>> >>>>>> *endTimestamp*
>> >>>>>>>>> field
>> >>>>>>>>>>> and
>> >>>>>>>>>>>>>> fill
>> >>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>> at the
>> >>>>>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>> >>>>>>> to:
>> >>>>>>>>>>>>>>>>>>>>>>    -
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>    Provide the ability to users to setup the
>> >>>>>>> exact
>> >>>>>>>>>>> time
>> >>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> blocklist end
>> >>>>>>>>>>>>>>>>>>>>>>    through RestAPI
>> >>>>>>>>>>>>>>>>>>>>>>    -
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>    Not being tied to a single value of
>> >>>>>>>>>>>>>>>>>>>>>>    *cluster.resource-blacklist.item.timeout*
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>> >>>>>> <
>> >>>>>>>>>>>>>>>>> chesnay@apache.org>
>> >>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>> >>>>>>>> lines
>> >>>>>>>>> a
>> >>>>>>>>>>>> bit.
>> >>>>>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>> >>>>>>>>> auto-detection
>> >>>>>>>>>>>> and
>> >>>>>>>>>>>>>> only
>> >>>>>>>>>>>>>>>>>> have an
>> >>>>>>>>>>>>>>>>>>>>>>> opt-in mechanism
>> >>>>>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>> >>>>>> this
>> >>>>>>>>> sounds
>> >>>>>>>>>>>> yet
>> >>>>>>>>>>>>>>> again
>> >>>>>>>>>>>>>>>>>> like one
>> >>>>>>>>>>>>>>>>>>>>>>> of those
>> >>>>>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>> >>>>>> just
>> >>>>>>>>> right.
>> >>>>>>>>>>>>>>>>>>>>>>> An external system can leverage way more
>> >>>>>>>>> information
>> >>>>>>>>>>>>> after
>> >>>>>>>>>>>>>>> all.
>> >>>>>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>> >>>>>>>> complexity
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>> proposal.
>> >>>>>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>> >>>>>>> between
>> >>>>>>>>>>>>>> components;
>> >>>>>>>>>>>>>>>>>>>>> adjustments
>> >>>>>>>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>>>>>> slot and resource protocol.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>> >>>>>>> resource
>> >>>>>>>>>>>>> management
>> >>>>>>>>>>>>>>>>>> perspective,
>> >>>>>>>>>>>>>>>>>>>>>>> then there
>> >>>>>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>> >>>>>> protocol
>> >>>>>>>> at
>> >>>>>>>>>>> all
>> >>>>>>>>>>>> (or
>> >>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>> fact
>> >>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>> >>>>>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>> >>>>>>> around
>> >>>>>>>>>>> blocked
>> >>>>>>>>>>>>> TMs
>> >>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> first
>> >>>>>>>>>>>>>>>>>>>>>>> place.
>> >>>>>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>> >>>>>>> down
>> >>>>>>>>>>> after
>> >>>>>>>>>>>> a
>> >>>>>>>>>>>>>>> while
>> >>>>>>>>>>>>>>>>> by
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> RM
>> >>>>>>>>>>>>>>>>>>>>>>> because of
>> >>>>>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>> >>>>>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>> >>>>>>>> blocking
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> process
>> >>>>>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>>> node)
>> >>>>>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>> >>>>>>>>>>> process(es).
>> >>>>>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>> >>>>>>>> back
>> >>>>>>>>>>> up.
>> >>>>>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>> >>>>>> the
>> >>>>>>>> same
>> >>>>>>>>>>> line
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> thinking to
>> >>>>>>>>>>>>>>>>>>>>>>> standalone resource management.
>> >>>>>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>> >>>>>> process/node
>> >>>>>>>>>>> manually
>> >>>>>>>>>>>>>>> should
>> >>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>> core
>> >>>>>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>> >>>>>>>> understand
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> problem
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>> I see
>> >>>>>>>>>>>>>>>>>>>>>>> value
>> >>>>>>>>>>>>>>>>>>>>>>>> in the automatic detection and
>> >>>>>>> blocklisting. I
>> >>>>>>>>> do
>> >>>>>>>>>>>> have
>> >>>>>>>>>>>>>> some
>> >>>>>>>>>>>>>>>>>> concerns
>> >>>>>>>>>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>> >>>>>>> blocked
>> >>>>>>>>>>>>> resources.
>> >>>>>>>>>>>>>> I
>> >>>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>>>> two
>> >>>>>>>>>>>>>>>>>>>>>>>> concerns;
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>> >>>>>>>>> separation
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>> concerns,
>> >>>>>>>>>>>>>>>>>>>>> meaning
>> >>>>>>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>> >>>>>>>> managing a
>> >>>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>> there's
>> >>>>>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>> >>>>>>>>>>> introduction of
>> >>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>> mechanism,
>> >>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>> latter group now can influence the
>> >>>>>>>>> responsibility
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> first
>> >>>>>>>>>>>>>>>>>> group.
>> >>>>>>>>>>>>>>>>>>>>>> So
>> >>>>>>>>>>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>> >>>>>>>> group
>> >>>>>>>>>>>> blocks
>> >>>>>>>>>>>>>>>>>> something,
>> >>>>>>>>>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>> >>>>>>> paging
>> >>>>>>>>>>>>> mechanism
>> >>>>>>>>>>>>>>>>>> triggering
>> >>>>>>>>>>>>>>>>>>>>>> etc)
>> >>>>>>>>>>>>>>>>>>>>>>>> which impacts the first group.
>> >>>>>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>> >>>>>> go
>> >>>>>>>>>>> through
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> process
>> >>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>> manually
>> >>>>>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>> >>>>>> it
>> >>>>>>>>>>> should
>> >>>>>>>>>>>>> be? I
>> >>>>>>>>>>>>>>> do
>> >>>>>>>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>> >>>>>>> make
>> >>>>>>>>>>> sense
>> >>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>> introduce
>> >>>>>>>>>>>>>>>>>>>>> such
>> >>>>>>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>> >>>>>>> really
>> >>>>>>>>>>> small
>> >>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>> group
>> >>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>> Flink?
>> >>>>>>>>>>>>>>>>>>>>>>> We
>> >>>>>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>> >>>>>>> such
>> >>>>>>>> a
>> >>>>>>>>>>>>> feature.
>> >>>>>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>> >>>>>>> I'm
>> >>>>>>>>>>>> leaning
>> >>>>>>>>>>>>>>>>> towards
>> >>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>> exposing
>> >>>>>>>>>>>>>>>>>>>>>>>> this to the user group but having this
>> >>>>>>>> available
>> >>>>>>>>>>>>> strictly
>> >>>>>>>>>>>>>>> for
>> >>>>>>>>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>> >>>>>> their
>> >>>>>>>>>>>>>>>>>> paging/metrics/logging
>> >>>>>>>>>>>>>>>>>>>>>>> system
>> >>>>>>>>>>>>>>>>>>>>>>>> to take this into account.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>> >>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>> >>>>>>>>>>>>>>> karmagyz@gmail.com
>> >>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>> >>>>>>> some
>> >>>>>>>>>>> cents
>> >>>>>>>>>>>>> here:
>> >>>>>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> user?
>> >>>>>>>>>>>>>>>>>>>>>>>>> I think the semantics of
>> >>>>>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>> >>>>>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>> >>>>>>> the
>> >>>>>>>>>>>> timeout
>> >>>>>>>>>>>>>>>>>> mechanism is
>> >>>>>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>> >>>>>> it
>> >>>>>>>>> will
>> >>>>>>>>>>> be
>> >>>>>>>>>>>>> very
>> >>>>>>>>>>>>>>>>>> confusing
>> >>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>> >>>>>> exception
>> >>>>>>>> of a
>> >>>>>>>>>>> task
>> >>>>>>>>>>>>>>>>> manager to
>> >>>>>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>> >>>>>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>> >>>>>> fail
>> >>>>>>>> in
>> >>>>>>>>>>> case
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> target
>> >>>>>>>>>>>>>>>>>> task
>> >>>>>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>> >>>>>> I
>> >>>>>>>>> don't
>> >>>>>>>>>>>> mean
>> >>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> cover
>> >>>>>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>> >>>>>> also
>> >>>>>>>>> open a
>> >>>>>>>>>>>>>>>>>> `notifyException`
>> >>>>>>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>> >>>>>>>>>>> ResourceManager,
>> >>>>>>>>>>>>> will
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> slot of
>> >>>>>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>> >>>>>>> released
>> >>>>>>>>> and
>> >>>>>>>>>>>>>>> allocated?
>> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>> Yangze Guo
>> >>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>> >>>>>> Wang
>> >>>>>>> <
>> >>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>> >>>>>> response
>> >>>>>>>>> your 4
>> >>>>>>>>>>>>>> remarks:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>> >>>>>>>>> controversy. I
>> >>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>>>>> “BlockList”
>> >>>>>>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>>>>>> good
>> >>>>>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>> >>>>>>>> good
>> >>>>>>>>>>> idea.
>> >>>>>>>>>>>>>> Based
>> >>>>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> above, I
>> >>>>>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>>> <host>/blocklist/node/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>>>>>> <host>/blocklist/taskmanager/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>> >>>>>>>> means
>> >>>>>>>>>>> that
>> >>>>>>>>>>>>> all
>> >>>>>>>>>>>>>>>>> task
>> >>>>>>>>>>>>>>>>>>>>> managers
>> >>>>>>>>>>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>> >>>>>>>> these
>> >>>>>>>>>>> TMs
>> >>>>>>>>>>>> are
>> >>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>> available.
>> >>>>>>>>>>>>>>>>>>>>>>> This
>> >>>>>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>> >>>>>> these
>> >>>>>>>> TMs
>> >>>>>>>>>>> are
>> >>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>> really
>> >>>>>>>>>>>>>>>>>> lost,
>> >>>>>>>>>>>>>>>>>>>>>>> they
>> >>>>>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>> >>>>>> are
>> >>>>>>>>> still
>> >>>>>>>>>>>>>>> registered
>> >>>>>>>>>>>>>>>>>> in this
>> >>>>>>>>>>>>>>>>>>>>>>> flink
>> >>>>>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>> >>>>>> once
>> >>>>>>>> the
>> >>>>>>>>>>>>>>>>> corresponding
>> >>>>>>>>>>>>>>>>>>>>>> blocklist
>> >>>>>>>>>>>>>>>>>>>>>>>>> item
>> >>>>>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>> >>>>>>>>>>>>>>> active/non-active
>> >>>>>>>>>>>>>>>>>>>>> clusters.
>> >>>>>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>> >>>>>> TMs
>> >>>>>>>> may
>> >>>>>>>>> be
>> >>>>>>>>>>>>>> released
>> >>>>>>>>>>>>>>>>> due
>> >>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>> idle
>> >>>>>>>>>>>>>>>>>>>>>>>>>> timeouts.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>> >>>>>> keep
>> >>>>>>>> it.
>> >>>>>>>>>>> The
>> >>>>>>>>>>>>>> reasons
>> >>>>>>>>>>>>>>>>> are
>> >>>>>>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>>>>>>>>> following:
>> >>>>>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>> >>>>>>> adding
>> >>>>>>>> or
>> >>>>>>>>>>>>> removing
>> >>>>>>>>>>>>>>>>> items
>> >>>>>>>>>>>>>>>>>> via
>> >>>>>>>>>>>>>>>>>>>>>> REST
>> >>>>>>>>>>>>>>>>>>>>>>>>> API,
>> >>>>>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>> >>>>>> it
>> >>>>>>>> to
>> >>>>>>>>>>>>>>>>> Long.MAX_VALUE .
>> >>>>>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>> >>>>>> a
>> >>>>>>>>>>> period of
>> >>>>>>>>>>>>>> time
>> >>>>>>>>>>>>>>>>>> (such as
>> >>>>>>>>>>>>>>>>>>>>>>> machine
>> >>>>>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>> >>>>>> prefer
>> >>>>>>>> that
>> >>>>>>>>>>>> Flink
>> >>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>> do
>> >>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>> >>>>>>> user
>> >>>>>>>> to
>> >>>>>>>>>>> do
>> >>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>> manually.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
>> >>>>>>>>>>> 于2022年4月27日周三
>> >>>>>>>>>>>>>>>>> 19:23写道：
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>> >>>>>> only
>> >>>>>>>>>>> support
>> >>>>>>>>>>>>>>> manually
>> >>>>>>>>>>>>>>>>>> blocking
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>> >>>>>>> different
>> >>>>>>>>>>>>> strategies
>> >>>>>>>>>>>>>>> can
>> >>>>>>>>>>>>>>>>>> also be
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>> >>>>>>>> "bla.cklist"
>> >>>>>>>>>>> due
>> >>>>>>>>>>>> to
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> controversy
>> >>>>>>>>>>>>>>>>>>>>>>>>> around
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>> >>>>>>> Ticket
>> >>>>>>>>>>> about
>> >>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>> topic a
>> >>>>>>>>>>>>>>>>>>>>> while
>> >>>>>>>>>>>>>>>>>>>>>>>>> back
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>> >>>>>>>> avoid
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>> term
>> >>>>>>>>>>>>>>>>>> blacklist &
>> >>>>>>>>>>>>>>>>>>>>>>>>> whitelist
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>> >>>>>> "denylist"
>> >>>>>>> or
>> >>>>>>>>>>>>>>> "quarantined"
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>> >>>>>>> slightly
>> >>>>>>>>>>>> different
>> >>>>>>>>>>>>>>>>> design
>> >>>>>>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>>>>> verbs
>> >>>>>>>>>>>>>>>>>>>>>>>>> like
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>> >>>>>>> anti-pattern
>> >>>>>>>>> for
>> >>>>>>>>>>>> REST
>> >>>>>>>>>>>>>>> APIs.
>> >>>>>>>>>>>>>>>>>> POST
>> >>>>>>>>>>>>>>>>>>>>> on a
>> >>>>>>>>>>>>>>>>>>>>>>>>> list
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>> >>>>>>>> items.
>> >>>>>>>>>>>> DELETE
>> >>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>> individual
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>>>>>> <host>/quarantine/items/<itemidentifier>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>> >>>>>>>>> taskmanagers
>> >>>>>>>>>>>> and
>> >>>>>>>>>>>>>>> nodes
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>> REST
>> >>>>>>>>>>>>>>>>>>>>>>>>> API
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>> >>>>>>> opinion
>> >>>>>>>>> on
>> >>>>>>>>>>>> this?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>>>> <host>/quarantine/nodes/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>>>>>>>> <host>/quarantine/taskmanager/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>> >>>>>>>>>>> non-active
>> >>>>>>>>>>>>>>>>> resource
>> >>>>>>>>>>>>>>>>>>>>>> managers,
>> >>>>>>>>>>>>>>>>>>>>>>>>> i.e.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>> >>>>>>>>>>> minimal,
>> >>>>>>>>>>>> do
>> >>>>>>>>>>>>> we
>> >>>>>>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>> timeout
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>> >>>>>>>> manually
>> >>>>>>>>>>> we
>> >>>>>>>>>>>>> could
>> >>>>>>>>>>>>>>>>>> delegate
>> >>>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>> >>>>>>>>> behavior
>> >>>>>>>>>>>>> would
>> >>>>>>>>>>>>>>>>> better
>> >>>>>>>>>>>>>>>>>> fit
>> >>>>>>>>>>>>>>>>>>>>>> into
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>> >>>>>>> schrieb
>> >>>>>>>>>>> Lijie
>> >>>>>>>>>>>>> Wang
>> >>>>>>>>>>>>>> <
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>> >>>>>>>> cluster
>> >>>>>>>>>>> node
>> >>>>>>>>>>>>>>> issues
>> >>>>>>>>>>>>>>>>>>>>>>>>> (insufficient
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> disk
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>> >>>>>>>> abnormalities).
>> >>>>>>>>>>>> Flink
>> >>>>>>>>>>>>>> will
>> >>>>>>>>>>>>>>>>>> take care
>> >>>>>>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>> >>>>>>> However,
>> >>>>>>>>> due
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>> data
>> >>>>>>>>>>>>>>>>>> locality
>> >>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> limited
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>> >>>>>> likely
>> >>>>>>>> to
>> >>>>>>>>> be
>> >>>>>>>>>>>>>>> redeployed
>> >>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>>>>> same
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>> >>>>>>>> task
>> >>>>>>>>>>>>>>> abnormalities
>> >>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>> affect
>> >>>>>>>>>>>>>>>>>>>>>>>>> job
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> progress.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>> >>>>>> manually
>> >>>>>>>>>>> identify
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> problematic
>> >>>>>>>>>>>>>>>>>>>>>>>>> node and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>> >>>>>>> But
>> >>>>>>>>>>> this
>> >>>>>>>>>>>>>>> approach
>> >>>>>>>>>>>>>>>>> has
>> >>>>>>>>>>>>>>>>>>>>>>>>> following
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>> >>>>>> heavy
>> >>>>>>>>>>> process.
>> >>>>>>>>>>>>>> Users
>> >>>>>>>>>>>>>>>>> may
>> >>>>>>>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> contact
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>> >>>>>>>>> operation
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>>> even
>> >>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>> dangerous
>> >>>>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>> >>>>>>>>> events.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>> >>>>>> of
>> >>>>>>>>>>> problems
>> >>>>>>>>>>>>>>> manually
>> >>>>>>>>>>>>>>>>>> would
>> >>>>>>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>>>>>> slow
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>> >>>>>>>> propose
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> introduce a
>> >>>>>>>>>>>>>>>>>>>>>> blacklist
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>> >>>>>>>>> problematic
>> >>>>>>>>>>>>>>> resources.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>> >>>>>>> FLIP-224[1].
>> >>>>>>>>>>> Looking
>> >>>>>>>>>>>>>>> forward
>> >>>>>>>>>>>>>>>>>> to your
>> >>>>>>>>>>>>>>>>>>>>>>>>>>> feedback.
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> --
>> >>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>> Roman Boyko
>> >>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>> >>>>>>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> https://twitter.com/snntrable
>> >>>>>>>> https://github.com/knaufk
>> >>>>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> https://twitter.com/snntrable
>> >>>>>> https://github.com/knaufk
>> >>>>>>
>> >>>>
>>
>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi Chesnay,

Thanks for feedback.

1. Regarding the TM/Node id. Do you mean special characters may appear in
the rest URL?  Actually, I don't think so. The task manager id in REST API
should be the *ResourceID* of taskmanager in Flink, there should be no
special characters, and some existing REST APIs are already using it, e.g.
GET: http://{jm_rest_address:port}/taskmanagers/<taskmanagerid>. The node
id should be an IP of a machine or node name in Yarn/Kubernetes, I think it
should also have no special characters.
2. Regarding the GET query responses. I agree with you, it makes sense to
change the GET result to a map.

3. Regarding the endTimestamp.  I also agree with you, endTimestamp can
cover everything, and the endTimestamp is a unix timestamp, there should be
no timezone issues. But I think PUT and DELETE are enough, no PATCH.  The
add rest api is add or update, PUT can cover this semantics.

4. Regarding the slot pool/manager. I don't think the current slotpool and
slotmanager are able to support the MARK_BLOCKED(slots that are already
allocated will not be affected) action. The reasons are as follows：

a) for slot pool, with the MARK_BLOCKED action, when a slot state changes
from reserved(task assigned) to free(no task assigned), it is necessary to
check whether the slot should be released immediately(it should be released
immediately if the task manager is blocked, otherwise it may be allocated
to other tasks). I think it cannot be supported without being aware of the
blocklist information. Compared to the solution in FLIP, a more
appropriate/prefered way may be: Introduce a new slot pool  implementation
for blocklist(may be named BlocklistSlotPool, it extends/wrapps the
original slot pool), and implement the parts that need to be aware of the
blocklist in this newly introduced slot pool, and the original slot pool
basically does not need to change.

b) for slot manager, with the MARK_BLOCKED action, there may be free but
blocked slots in slot manager (the corresponding TMs cannot be
released/unregistered because there are still running tasks on them).
Therefore, we need to filter out the blocked slots when trying to fulfill
the slot requirements. Therefore it also needs to know the blocklist
information.
A better way may be to abstract a resource allocation strategy, and make
the blocklist as a special implementation, then pass the resource
allocation strategy in when constructing the slot manager. Unfortunately,
the data structures in the two existing slot manager
implementations(*DeclarativeSlotManager* and *FineGrainedSlotManager*) are
quite different, it is not easy to abstract a common resource allocation
strategy, so we prefer to keep the current way(i.e. pass the blocklist
information directly into slot manager).


5. Regarding the BlocklistTracker. I also agree with you, the BlocklistTracker
does not need to be aware of the executor, and the timeout actions can be
done outside.

Chesnay Schepler <ch...@apache.org> 于2022年5月20日周五 17:34写道：

> I have a number of concerns:
>
> Is the id used for deleting an item the same sent in the initial request
> (and not one returned by Flink)?
> I'm very concerned that the tm/node id can contain special characters.
>
> The GET query should return a map, not a list of items. This makes it
> easier to work with.
>
> The duality of endTimestamp and duration is also concerning.
> If we conclude that endTimestamps can in fact work (and aren't utterly
> unusable due to timezones),
> then this should be able to cover everything and rid us of some
> complexity w.r.t. POSTs to the same ID.
> Additions would be a PUT, changes a PATCH, deletes a DELETE.
>
>
> I also dislike how we're pushing more functionality into the
> slotpool/-manager.
> These components are complex enough as-is, and instead I'd propose a
> separate component that interacts with the SlotPool/-Manager instead,
> for example by removing the slots from that TM.
> The reason being that from the slot-pool perspective it is irrelevant
> whether a slot is gone because the TM was lost, or because it was blocked.
>
>
> The BlocklistTracker shouldn't be modeled as component that is aware of
> the concept of main threads.
> It really has no business knowing that; all it needs is an executor for
> handling timeouts/periodic actions,
> and a way to interact with the JM/RM (which internally can handle the
> scheduling into the main thread).
>
>
> On 20/05/2022 07:20, Lijie Wang wrote:
> > Hi everyone,
> >
> > I have started a vote for this FLIP [1]. Please cast your vote there or
> ask
> > additional questions here. [1]
> > https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
> >
> > Best, Lijie
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
> >
> >> Hi Konstantin,
> >>
> >> We found that Flink REST URL does not support the format ":merge" ,
> which
> >> will be recognized as a parameter in the URL(due to start with a colon).
> >>
> >> We will keep the previous way, i.e.
> >>
> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> >> and the "id" and "merge" flag are put into the request body.
> >>
> >> Best,
> >> Lijie
> >>
> >> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
> >>
> >>> Hi Weihua,
> >>> thanks for feedback.
> >>>
> >>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
> >>> step towards auto-detection.
> >>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> >>> exception of insufficient resources.
> >>> 3. No. This FLIP won't change the WebUI. The blocklist information can
> be
> >>> obtained through REST API and metrics.
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
> >>>
> >>>> Hi,
> >>>> Thanks for creating this FLIP.
> >>>> We have implemented an automatic blocklist detection mechanism
> >>>> internally, which is indeed very effective for handling node failures.
> >>>> Due to the large number of nodes, although SREs already support
> >>>> automatic offline failure nodes, the detection is not 100% accurate
> and
> >>>> there is some delay.
> >>>> So the blocklist mechanism can make flink job recover from failure
> much
> >>>> faster.
> >>>>
> >>>> Here are some of my thoughts:
> >>>> 1. In this FLIP, it needs users to locate machine failure manually,
> >>>> there is a certain cost of use
> >>>> 2. What happens if too many nodes are blocked, resulting in
> insufficient
> >>>> resources? Will there be a special Exception for the user?
> >>>> 3. Will we display the blocklist information in the WebUI? The
> blocklist
> >>>> is for cluster level, and if multiple users share a cluster, some
> users may
> >>>> be a little confused when resources are not enough, or when resources
> are
> >>>> applied for more.
> >>>>
> >>>> Also, Looking forward to the next FLIP on auto-detection.
> >>>>
> >>>> Best,
> >>>> Weihua
> >>>>
> >>>>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> >>>>>
> >>>>> Hi Konstantin,
> >>>>>
> >>>>> Maybe change it to the following:
> >>>>>
> >>>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >>>>> Merge is not allowed. If the {id} already exists, return error.
> >>>> Otherwise,
> >>>>> create a new item.
> >>>>>
> >>>>> 2. POST: http://
> >>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >>>>> Merge is allowed. If the {id} already exists, merge. Otherwise,
> create
> >>>> a
> >>>>> new item.
> >>>>>
> >>>>> WDYT?
> >>>>>
> >>>>> Best,
> >>>>> Lijie
> >>>>>
> >>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> >>>>>
> >>>>>> Hi Lijie,
> >>>>>>
> >>>>>> hm, maybe the following is more appropriate in that case
> >>>>>>
> >>>>>> POST: http://
> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Konstantin
> >>>>>>
> >>>>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> >>>>>> wangdachui9501@gmail.com>:
> >>>>>>
> >>>>>>> Hi Konstantin,
> >>>>>>> thanks for your feedback.
> >>>>>>>
> >>>>>>>  From what I understand, PUT should be idempotent. However, we
> have a
> >>>>>>> *timeout* field in the request. This means that initiating the same
> >>>>>> request
> >>>>>>> at two different times will lead to different resource status
> >>>> (timestamps
> >>>>>>> of the items to be removed will be different).
> >>>>>>>
> >>>>>>> Should we use PUT in this case? WDYT?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Lijie
> >>>>>>>
> >>>>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> >>>>>>>
> >>>>>>>> Hi Lijie,
> >>>>>>>>
> >>>>>>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT
> >>>> on
> >>>>>> the
> >>>>>>>> resource?
> >>>>>>>>
> >>>>>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Konstantin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> >>>>>>>> wangdachui9501@gmail.com>:
> >>>>>>>>
> >>>>>>>>> Hi everyone,
> >>>>>>>>>
> >>>>>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
> >>>> made
> >>>>>>> the
> >>>>>>>>> following changes on REST API:
> >>>>>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only
> choose
> >>>>>>> one.
> >>>>>>>> If
> >>>>>>>>> both are specified, will return error.
> >>>>>>>>> 2.  If the specified item is already there, the *ADD* operation
> has
> >>>>>> two
> >>>>>>>>> behaviors:  *return error*(default value) or *merge/update*, and
> we
> >>>>>>> add a
> >>>>>>>>> flag to the request body to control it. You can find more details
> >>>>>>> "Public
> >>>>>>>>> Interface" section.
> >>>>>>>>>
> >>>>>>>>> If there is no more feedback, we will start the vote thread next
> >>>>>> week.
> >>>>>>>>> Best,
> >>>>>>>>> Lijie
> >>>>>>>>>
> >>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> >>>>>>>>>
> >>>>>>>>>> Hi Becket Qin,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for your suggestions.  I have moved the description of
> >>>>>>>>>> configurations, metrics and REST API into "Public Interface"
> >>>>>> section,
> >>>>>>>> and
> >>>>>>>>>> made a few updates according to your suggestion.  And in this
> >>>> FLIP,
> >>>>>>>> there
> >>>>>>>>>> no public java Interfaces or pluggables that users need to
> >>>>>> implement
> >>>>>>> by
> >>>>>>>>>> themselves.
> >>>>>>>>>>
> >>>>>>>>>> Answers for you questions:
> >>>>>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> >>>>>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> >>>>>> items
> >>>>>>>> can
> >>>>>>>>>> only be added through the REST API, so these 2 action are
> >>>> mentioned
> >>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>> REST API part (The REST API part has beed moved to public
> >>>> interface
> >>>>>>>> now).
> >>>>>>>>>> 2. I agree with you. I have changed the "Cause" field to String,
> >>>>>> and
> >>>>>>>>> allow
> >>>>>>>>>> users to specify it via REST API.
> >>>>>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
> >>>>>> above,
> >>>>>>>> we
> >>>>>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
> >>>> ADD
> >>>>>>>> REST
> >>>>>>>>>> API to specify when to remove the blocked item. These 2 fields
> are
> >>>>>>>>>> optional, if neither is specified, it means that the blocked
> item
> >>>>>> is
> >>>>>>>>>> permanent and will not be removed. If both are specified, the
> >>>>>> minimum
> >>>>>>>> of
> >>>>>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as
> the
> >>>>>>> time
> >>>>>>>> to
> >>>>>>>>>> remove the blocked item. To keep the configurations more
> minimal,
> >>>>>> we
> >>>>>>>> have
> >>>>>>>>>> removed the *cluster.resource-blocklist.item.timeout*
> >>>> configuration
> >>>>>>>>>> option.
> >>>>>>>>>> 4. Yes, the block item will be overridden if the specified item
> >>>>>>> already
> >>>>>>>>>> exists. The ADD operation is *ADD or UPDATE*.
> >>>>>>>>>> 5. Yes. On JM/RM side, all the blocklist information is
> maintained
> >>>>>> in
> >>>>>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> >>>>>>>>> abstracted
> >>>>>>>>>> to other interfaces) will be propagated to different components.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Lijie
> >>>>>>>>>>
> >>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> >>>>>>>>>>
> >>>>>>>>>>> Hi Lijie,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for updating the FLIP. It looks like the public
> interface
> >>>>>>>> section
> >>>>>>>>>>> did not fully reflect all the user sensible behavior and API.
> Can
> >>>>>>> you
> >>>>>>>>> put
> >>>>>>>>>>> everything that users may be aware of there? That would include
> >>>>>> the
> >>>>>>>> REST
> >>>>>>>>>>> API, metrics, configurations, public java Interfaces or
> >>>> pluggables
> >>>>>>>> that
> >>>>>>>>>>> users may see or implement by themselves, as well as a brief
> >>>>>> summary
> >>>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>> behavior of the public API.
> >>>>>>>>>>>
> >>>>>>>>>>> Besides that, I have a few questions:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. According to the conversation in the discussion thread, it
> >>>>>> looks
> >>>>>>>> like
> >>>>>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
> >>>>>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
> >>>>>> can
> >>>>>>>> you
> >>>>>>>>>>> add
> >>>>>>>>>>> that to the public interface as well?
> >>>>>>>>>>>
> >>>>>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
> >>>>>>> Throwable
> >>>>>>>>> and
> >>>>>>>>>>> is not reflected in the REST API. Should that be included in
> the
> >>>>>>> query
> >>>>>>>>>>> response? And should we change that field to be a String so
> users
> >>>>>>> may
> >>>>>>>>>>> specify the cause via the REST API when they block some nodes /
> >>>>>> TMs?
> >>>>>>>>>>> 3. Would it be useful to allow users to have different timeouts
> >>>>>> for
> >>>>>>>>>>> different blocked items? So while there is a default timeout,
> >>>>>> users
> >>>>>>>> can
> >>>>>>>>>>> also override it via the REST API when they block an entity.
> >>>>>>>>>>>
> >>>>>>>>>>> 4. Regarding the ADD operation, if the specified item is
> already
> >>>>>>>> there,
> >>>>>>>>>>> will the block item be overridden? For example, if the user
> wants
> >>>>>> to
> >>>>>>>>>>> extend
> >>>>>>>>>>> the timeout of a blocked item, can they just  issue an ADD
> >>>> command
> >>>>>>>>> again?
> >>>>>>>>>>> 5. I am not quite familiar with the details of this, but is
> there
> >>>>>> a
> >>>>>>>>> source
> >>>>>>>>>>> of truth for the blocked list? I think it might be good to
> have a
> >>>>>>>> single
> >>>>>>>>>>> source of truth for the blocked list and just propagate that
> list
> >>>>>> to
> >>>>>>>>>>> different components to take the action of actually blocking
> the
> >>>>>>>>> resource.
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> >>>>>> wangdachui9501@gmail.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Based on the discussion in the mailing list, I updated the
> FLIP
> >>>>>>> doc,
> >>>>>>>>> the
> >>>>>>>>>>>> changes include:
> >>>>>>>>>>>> 1. Changed the description of the motivation section to more
> >>>>>>> clearly
> >>>>>>>>>>>> describe the problem this FLIP is trying to solve.
> >>>>>>>>>>>> 2. Only  *Manually* is supported.
> >>>>>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Lijie
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Lijie!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
> >>>>>>>>>>> helpfulhere.”
> >>>>>>>>>>>>> Can you explain it in detail? I don't fully understand that.
> >>>>>> In
> >>>>>>>>>>>> myopinion,
> >>>>>>>>>>>>> non-active and active are the same, and no special treatment
> >>>>>>>>>>> isrequired.*
> >>>>>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
> >>>>>>> were
> >>>>>>>>>>> talking
> >>>>>>>>>>>>> about the HA mode (but not about Active and Standalone
> >>>>>>>>>>> ResourceManager).
> >>>>>>>>>>>>> And the original question was - how to handle the blacklisted
> >>>>>>>> nodes
> >>>>>>>>>>> list
> >>>>>>>>>>>> at
> >>>>>>>>>>>>> the moment of leader change? Should we simply forget about
> >>>>>> them
> >>>>>>> or
> >>>>>>>>>>> try to
> >>>>>>>>>>>>> pre-save that list on the remote storage?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <
> danrtsey.wy@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> >>>>>> it
> >>>>>>> is
> >>>>>>>>>>> indeed
> >>>>>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
> >>>>>>>> could
> >>>>>>>>>>> not
> >>>>>>>>>>>>>> support.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Yang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Jiangang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
> >>>>>> and
> >>>>>>>>> mark a
> >>>>>>>>>>>> node
> >>>>>>>>>>>>>>> blocked?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In fact, we currently plan to not support auto-detection
> >>>>>> in
> >>>>>>>> this
> >>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>> part about auto-detection may be continued in a separate
> >>>>>>> FLIP
> >>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> future. Some guys have the same concerns as you, and the
> >>>>>>>>>>> correctness
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> necessity of auto-detection may require further discussion
> >>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>>>>> future.
> >>>>>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
> >>>>>> node
> >>>>>>>> and
> >>>>>>>>>>> the
> >>>>>>>>>>>>> node
> >>>>>>>>>>>>>>> should be marked blocked.
> >>>>>>>>>>>>>>> By design, the blocklist information will be shared among
> >>>>>>> all
> >>>>>>>>> jobs
> >>>>>>>>>>>> in a
> >>>>>>>>>>>>>>> cluster/session. The JM will sync blocklist information
> >>>>>> with
> >>>>>>>> RM.
> >>>>>>>>>>>>>>> @Martijn
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> >>>>>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
> >>>>>>>>>>> MARK_BLOCKLISTED(Just
> >>>>>>>>>>>>>> limits
> >>>>>>>>>>>>>>> the load of the node and does not  kill all the processes
> >>>>>> on
> >>>>>>>> it)
> >>>>>>>>>>> is
> >>>>>>>>>>>>> also
> >>>>>>>>>>>>>>> important, and we think that external systems (*yarn
> >>>>>> rmadmin
> >>>>>>>> or
> >>>>>>>>>>>> kubectl
> >>>>>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
> >>>>>>>> only
> >>>>>>>>>>>>>> *manually*.
> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> >>>>>>> indeed
> >>>>>>>>>>> super
> >>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>> to get right.
> >>>>>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
> >>>>>>>>> concern.
> >>>>>>>>>>>>>>> However, we currently plan to not support auto-detection
> >>>>>> in
> >>>>>>>> this
> >>>>>>>>>>>> FLIP,
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
> >>>>>> FLIP
> >>>>>>>>>>> provides
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
> >>>>>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> >>>>>>>>>>>>>>> the auto-detection may be done by external systems.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Lijie
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> >>>>>>> 19:04写道：
> >>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> >>>>>> could
> >>>>>>>> not
> >>>>>>>>>>> see
> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> >>>>>>>>> approach(via
> >>>>>>>>>>>> *yarn
> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I agree with Yang Wang on this.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To me this sounds yet again like one of those magical
> >>>>>>>>>>> mechanisms
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> will rarely work just right.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> >>>>>>> indeed
> >>>>>>>>>>> super
> >>>>>>>>>>>>> hard
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> get right.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Martijn
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> >>>>>>>>>>>> liujiangangpeng@gmail.com
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
> >>>>>>>>> decrease
> >>>>>>>>>>>>> great
> >>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>> for us. We have implemented the similar feature in our
> >>>>>>>> inner
> >>>>>>>>>>> flink
> >>>>>>>>>>>>>>>>> version.
> >>>>>>>>>>>>>>>>> Below is something that I care about:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>    1. For auto-detecting, I wonder how to make the
> >>>>>>> strategy
> >>>>>>>>> and
> >>>>>>>>>>>>> mark a
> >>>>>>>>>>>>>>>>> node
> >>>>>>>>>>>>>>>>>    blocked? Sometimes the blocked node is hard to be
> >>>>>>>>> detected,
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> example,
> >>>>>>>>>>>>>>>>>    the upper node or the down node will be blocked when
> >>>>>>>>> network
> >>>>>>>>>>>>>>>>> unreachable.
> >>>>>>>>>>>>>>>>>    2. I see that the strategy is made in JobMaster
> >>>>>> side.
> >>>>>>>> How
> >>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>    implementing the similar logic in resource manager?
> >>>>>> In
> >>>>>>>>>>> session
> >>>>>>>>>>>>>> mode,
> >>>>>>>>>>>>>>>>> multi
> >>>>>>>>>>>>>>>>>    jobs can fail on the same bad node and the node
> >>>>>> should
> >>>>>>>> be
> >>>>>>>>>>>> marked
> >>>>>>>>>>>>>>>>> blocked.
> >>>>>>>>>>>>>>>>>    If the job makes the strategy, the node may be not
> >>>>>>>> marked
> >>>>>>>>>>>> blocked
> >>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>    fail times don't exceed the threshold.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thank you for all your feedback!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
> >>>>>> some
> >>>>>>> of
> >>>>>>>>> my
> >>>>>>>>>>>>>> thoughts:
> >>>>>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
> >>>>>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
> >>>>>>>>>>>>>>>>>> The automatical way should be something built upon
> >>>>>> the
> >>>>>>>>>>> blocklist
> >>>>>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
> >>>>>>>>>>>> configurable
> >>>>>>>>>>>>>>>>>> blocklist strategy, but I think we can further
> >>>>>> decouple
> >>>>>>>> it
> >>>>>>>>> by
> >>>>>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
> >>>>>>>> suggested,
> >>>>>>>>>>> which
> >>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
> >>>>>>> detected.
> >>>>>>>>>>>> However,
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> should be a separate FLIP with further dev
> >>>>>> discussions
> >>>>>>>> and
> >>>>>>>>>>>>> feedback
> >>>>>>>>>>>>>>>>>> from users. I also agree with Becket that different
> >>>>>>> users
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>> requirements, and we should listen to them.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
> >>>>>>>> externally
> >>>>>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
> >>>>>> way
> >>>>>>> to
> >>>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
> >>>>>>>> users
> >>>>>>>>>>> may
> >>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>> want to limit the load of the node and do not want to
> >>>>>>>> kill
> >>>>>>>>>>> all
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> processes on it. Another case is the speculative
> >>>>>>>>> execution[1]
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>> may also leverage this feature to avoid starting
> >>>>>> mirror
> >>>>>>>>>>> tasks on
> >>>>>>>>>>>>>> slow
> >>>>>>>>>>>>>>>>>> nodes.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> Zhu
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>
> >>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >>>>>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> >>>>>>>>> 15:56写道：
> >>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for your feedback.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
> >>>>>> here
> >>>>>>>>>>> because
> >>>>>>>>>>>> it
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> affect the value and design of the blocklist
> >>>>>> mechanism
> >>>>>>>>>>> (perhaps
> >>>>>>>>>>>> I
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
> >>>>>>>> FLIP:
> >>>>>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
> >>>>>>> node
> >>>>>>>>> as
> >>>>>>>>>>>>>> blocked.
> >>>>>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
> >>>>>>>> task
> >>>>>>>>>>>> manager
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>> node.
> >>>>>>>>>>>>>>>>>> But slots that are already allocated will not be
> >>>>>>>> affected.
> >>>>>>>>> A
> >>>>>>>>>>>>> typical
> >>>>>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
> >>>>>>> In
> >>>>>>>>> this
> >>>>>>>>>>>>> case,
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> hope
> >>>>>>>>>>>>>>>>>> that subsequent resource allocations will not be on
> >>>>>> the
> >>>>>>>> hot
> >>>>>>>>>>>>> machine,
> >>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>> tasks currently running on it should not be affected.
> >>>>>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> >>>>>> task
> >>>>>>>>>>> manager
> >>>>>>>>>>>> or
> >>>>>>>>>>>>>>> node
> >>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
> >>>>>> tasks
> >>>>>>>> will
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> restarted on
> >>>>>>>>>>>>>>>>>> non-blocked task managers.
> >>>>>>>>>>>>>>>>>>> For the above 2 actions, the former may more
> >>>>>>> highlight
> >>>>>>>>> the
> >>>>>>>>>>>>> meaning
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> this FLIP, because the external system cannot do
> >>>>>> that.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
> >>>>>> basically
> >>>>>>>>> agree
> >>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> @Becket
> >>>>>>>>>>>>>>>>>> Qin: different users have different answers. Not all
> >>>>>>>> users’
> >>>>>>>>>>>>>> deployment
> >>>>>>>>>>>>>>>>>> environments have a special external system that can
> >>>>>>>>> perform
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> anomaly
> >>>>>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
> >>>>>>>>>>> auto-detection
> >>>>>>>>>>>>>>> doesn't
> >>>>>>>>>>>>>>>>>> require much extra work on top of manual
> >>>>>> specification.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I will answer your other questions one by one.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> @Yangze
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
> >>>>>>> the
> >>>>>>> `cluster.resource-blocklist.item.timeout-check-interval`
> >>>>>>>> to
> >>>>>>>>>>>> users.
> >>>>>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
> >>>>>>> separate
> >>>>>>>>>>>> interface
> >>>>>>>>>>>>>>>>> (maybe
> >>>>>>>>>>>>>>>>>> BlocklistExceptionListener), and the
> >>>>>>>>>>>>> ResourceManagerBlocklistHandler
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> implement it in the future.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> @Martijn
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
> >>>>>> by
> >>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> operators.
> >>>>>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
> >>>>>>>> according
> >>>>>>>>>>> to
> >>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>> experience, users are often the first to perceive the
> >>>>>>>>> machine
> >>>>>>>>>>>>>> problems
> >>>>>>>>>>>>>>>>>> (because of job failover or delay), and they will
> >>>>>>> contact
> >>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> operators
> >>>>>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
> >>>>>>>>>>> machine is
> >>>>>>>>>>>>>>>>>> problematic. From this point of view, I think the
> >>>>>>> people
> >>>>>>>>> who
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
> >>>>>>>> performed
> >>>>>>>>> by
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> @Chesnay
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
> >>>>>>> for
> >>>>>>>>>>>>>>>>> MARK_BLOCKLISTED
> >>>>>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
> >>>>>> when
> >>>>>>>> the
> >>>>>>>>>>> task
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> >>>>>>>> release
> >>>>>>>>>>> the
> >>>>>>>>>>>>> slot
> >>>>>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
> >>>>>>> on
> >>>>>>>>> it.
> >>>>>>>>>>> If
> >>>>>>>>>>>> we
> >>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
> >>>>>> needs
> >>>>>>> to
> >>>>>>>>>>>> retrieve
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
> >>>>>>>> relatively
> >>>>>>>>>>>> large,
> >>>>>>>>>>>>>> so
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> think it's worth maintaining the blocklist
> >>>>>> information
> >>>>>>> on
> >>>>>>>>>>> the JM
> >>>>>>>>>>>>>> side
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> syncing them.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> @Роман
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>     a) “Probably storing inside Zookeeper/Configmap
> >>>>>>>> might
> >>>>>>>>>>> be
> >>>>>>>>>>>>>> helpful
> >>>>>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
> >>>>>>>>>>> understand
> >>>>>>>>>>>>> that.
> >>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
> >>>>>>>> special
> >>>>>>>>>>>>>> treatment
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> required.
> >>>>>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
> >>>>>> sense,
> >>>>>>> I
> >>>>>>>>> will
> >>>>>>>>>>>> add
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> @Yang
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
> >>>>>> cannot
> >>>>>>>>>>> support
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Looking forward to your further feedback.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Lijie
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
> >>>>>>>> 21:09写道：
> >>>>>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
> >>>>>>>>>>> operations.
> >>>>>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
> >>>>>>>>>>> Reliability
> >>>>>>>>>>>>>>> Engineer)
> >>>>>>>>>>>>>>>>>>>> already has many tools to detect the unstable
> >>>>>> nodes,
> >>>>>>>>> which
> >>>>>>>>>>>>> could
> >>>>>>>>>>>>>>> take
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> system logs/metrics into consideration.
> >>>>>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
> >>>>>> taint
> >>>>>>>> in
> >>>>>>>>>>> K8s
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> prevent
> >>>>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>> allocations on these unstable nodes.
> >>>>>>>>>>>>>>>>>>>> At last, they will evict all the containers and
> >>>>>> pods
> >>>>>>>>>>> running
> >>>>>>>>>>>> on
> >>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>> nodes.
> >>>>>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
> >>>>>>> So
> >>>>>>>> I
> >>>>>>>>> am
> >>>>>>>>>>>>> afraid
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> >>>>>>>> could
> >>>>>>>>>>> not
> >>>>>>>>>>>>> see
> >>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> >>>>>>>>>>>> approach(via
> >>>>>>>>>>>>>>> *yarn
> >>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> >>>>>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
> >>>>>>> which
> >>>>>>>>>>> could
> >>>>>>>>>>>>>> expose
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
> >>>>>>> them
> >>>>>>>> if
> >>>>>>>>>>>>> enabled
> >>>>>>>>>>>>>>>>>> explicitly.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>> Yang
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> >>>>>>>> 16:36写道：
> >>>>>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
> >>>>>> and
> >>>>>>>>>>> somewhat
> >>>>>>>>>>>>>>> related
> >>>>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>>>> design principle about how people should operate
> >>>>>>>>> Flink.
> >>>>>>>>>>>>>>>>>>>>> I think there are three things involved in this
> >>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>>>>>>      a) Detect and report the unstable node.
> >>>>>>>>>>>>>>>>>>>>>      b) Collect the information of the unstable
> >>>>>>> node
> >>>>>>>>> and
> >>>>>>>>>>>>> form a
> >>>>>>>>>>>>>>>>>> blocklist.
> >>>>>>>>>>>>>>>>>>>>>      c) Take the action to block nodes.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> My two cents:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
> >>>>>>> should
> >>>>>>>>> have
> >>>>>>>>>>>> c).
> >>>>>>>>>>>>> It
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> not only
> >>>>>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
> >>>>>> handy
> >>>>>>>> for
> >>>>>>>>>>> some
> >>>>>>>>>>>>>>> planned
> >>>>>>>>>>>>>>>>>>>>> maintenance.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
> >>>>>> who
> >>>>>>>>>>> should be
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> brain
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
> >>>>>>>>> largely
> >>>>>>>>>>>>> depends
> >>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>> who we
> >>>>>>>>>>>>>>>>>>>>> talk to. Different users would probably give
> >>>>>>>> different
> >>>>>>>>>>>>> answers.
> >>>>>>>>>>>>>>> For
> >>>>>>>>>>>>>>>>>> people
> >>>>>>>>>>>>>>>>>>>>> who do have a centralized node health management
> >>>>>>>>>>> service,
> >>>>>>>>>>>> let
> >>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>> do just
> >>>>>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
> >>>>>>>> Flink
> >>>>>>>>>>> would
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
> >>>>>> it
> >>>>>>> to
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>> service,
> >>>>>>>>>>>>>>>>>> and then
> >>>>>>>>>>>>>>>>>>>>> take the command from that service to block the
> >>>>>>>>>>> problematic
> >>>>>>>>>>>>>>> nodes.
> >>>>>>>>>>>>>>>>> On
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
> >>>>>>>> service,
> >>>>>>>>>>>> simply
> >>>>>>>>>>>>>>>>> letting
> >>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
> >>>>>> nodes
> >>>>>>>>> might
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>> desired
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
> >>>>>>>> pluggable /
> >>>>>>>>>>>>>> optional.
> >>>>>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
> >>>>>>> have
> >>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>> pluggable
> >>>>>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
> >>>>>> unstable
> >>>>>>>>> nodes
> >>>>>>>>>>>>>>> actively.
> >>>>>>>>>>>>>>>>> (A
> >>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
> >>>>>>> which
> >>>>>>>>>>> can be
> >>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> report
> >>>>>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
> >>>>>>> the
> >>>>>>>>>>> scope
> >>>>>>>>>>>>>>> relevant
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
> >>>>>> OK
> >>>>>>> to
> >>>>>>>>>>> have a
> >>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
> >>>>>>> Flink
> >>>>>>>> to
> >>>>>>>>>>> take
> >>>>>>>>>>>>>>> action
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> block
> >>>>>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
> >>>>>>>>> timeout.
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> >>>>>>>>>>>>>> ro.v.boyko@gmail.com
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
> >>>>>>>>>>> development.
> >>>>>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
> >>>>>>> we
> >>>>>>>>>>> should
> >>>>>>>>>>>>>>> consider
> >>>>>>>>>>>>>>>>>> to move
> >>>>>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
> >>>>>> nodes/TMs
> >>>>>>>>> from
> >>>>>>>>>>>>> active
> >>>>>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
> >>>>>>>> storing
> >>>>>>>>>>>> inside
> >>>>>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
> >>>>>>>> organizations
> >>>>>>>>>>>> don't
> >>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> expose
> >>>>>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
> >>>>>>>> it's
> >>>>>>>>>>>>> necessary
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> have the
> >>>>>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
> >>>>>>> for
> >>>>>>>>>>>> avoiding
> >>>>>>>>>>>>>>>>> incorrect
> >>>>>>>>>>>>>>>>>>>>>> automatic behaviour.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
> >>>>>>>> would
> >>>>>>>>> be
> >>>>>>>>>>>>> better
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> extend
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
> >>>>>> *endTimestamp*
> >>>>>>>>> field
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> fill
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
> >>>>>>> to:
> >>>>>>>>>>>>>>>>>>>>>>    -
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>    Provide the ability to users to setup the
> >>>>>>> exact
> >>>>>>>>>>> time
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> blocklist end
> >>>>>>>>>>>>>>>>>>>>>>    through RestAPI
> >>>>>>>>>>>>>>>>>>>>>>    -
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>    Not being tied to a single value of
> >>>>>>>>>>>>>>>>>>>>>>    *cluster.resource-blacklist.item.timeout*
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> >>>>>> <
> >>>>>>>>>>>>>>>>> chesnay@apache.org>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
> >>>>>>>> lines
> >>>>>>>>> a
> >>>>>>>>>>>> bit.
> >>>>>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
> >>>>>>>>> auto-detection
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>> have an
> >>>>>>>>>>>>>>>>>>>>>>> opt-in mechanism
> >>>>>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
> >>>>>> this
> >>>>>>>>> sounds
> >>>>>>>>>>>> yet
> >>>>>>>>>>>>>>> again
> >>>>>>>>>>>>>>>>>> like one
> >>>>>>>>>>>>>>>>>>>>>>> of those
> >>>>>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
> >>>>>> just
> >>>>>>>>> right.
> >>>>>>>>>>>>>>>>>>>>>>> An external system can leverage way more
> >>>>>>>>> information
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>>>> all.
> >>>>>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
> >>>>>>>> complexity
> >>>>>>>>>>> of
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> proposal.
> >>>>>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
> >>>>>>> between
> >>>>>>>>>>>>>> components;
> >>>>>>>>>>>>>>>>>>>>> adjustments
> >>>>>>>>>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>>>>>> slot and resource protocol.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
> >>>>>>> resource
> >>>>>>>>>>>>> management
> >>>>>>>>>>>>>>>>>> perspective,
> >>>>>>>>>>>>>>>>>>>>>>> then there
> >>>>>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
> >>>>>> protocol
> >>>>>>>> at
> >>>>>>>>>>> all
> >>>>>>>>>>>> (or
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> fact
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
> >>>>>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
> >>>>>>> around
> >>>>>>>>>>> blocked
> >>>>>>>>>>>>> TMs
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>>>>>> place.
> >>>>>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
> >>>>>>> down
> >>>>>>>>>>> after
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> RM
> >>>>>>>>>>>>>>>>>>>>>>> because of
> >>>>>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
> >>>>>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
> >>>>>>>> blocking
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>> node)
> >>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
> >>>>>>>>>>> process(es).
> >>>>>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
> >>>>>>>> back
> >>>>>>>>>>> up.
> >>>>>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
> >>>>>> the
> >>>>>>>> same
> >>>>>>>>>>> line
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> thinking to
> >>>>>>>>>>>>>>>>>>>>>>> standalone resource management.
> >>>>>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
> >>>>>> process/node
> >>>>>>>>>>> manually
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> core
> >>>>>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
> >>>>>>>> understand
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> I see
> >>>>>>>>>>>>>>>>>>>>>>> value
> >>>>>>>>>>>>>>>>>>>>>>>> in the automatic detection and
> >>>>>>> blocklisting. I
> >>>>>>>>> do
> >>>>>>>>>>>> have
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>> concerns
> >>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
> >>>>>>> blocked
> >>>>>>>>>>>>> resources.
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>>>>>>>>> concerns;
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
> >>>>>>>>> separation
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> concerns,
> >>>>>>>>>>>>>>>>>>>>> meaning
> >>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
> >>>>>>>> managing a
> >>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> there's
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
> >>>>>>>>>>> introduction of
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> mechanism,
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> latter group now can influence the
> >>>>>>>>> responsibility
> >>>>>>>>>>> of
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>> group.
> >>>>>>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
> >>>>>>>> group
> >>>>>>>>>>>> blocks
> >>>>>>>>>>>>>>>>>> something,
> >>>>>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
> >>>>>>> paging
> >>>>>>>>>>>>> mechanism
> >>>>>>>>>>>>>>>>>> triggering
> >>>>>>>>>>>>>>>>>>>>>> etc)
> >>>>>>>>>>>>>>>>>>>>>>>> which impacts the first group.
> >>>>>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
> >>>>>> go
> >>>>>>>>>>> through
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>> manually
> >>>>>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
> >>>>>> it
> >>>>>>>>>>> should
> >>>>>>>>>>>>> be? I
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
> >>>>>>> make
> >>>>>>>>>>> sense
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
> >>>>>>> really
> >>>>>>>>>>> small
> >>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>> group
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> Flink?
> >>>>>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
> >>>>>>> such
> >>>>>>>> a
> >>>>>>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
> >>>>>>> I'm
> >>>>>>>>>>>> leaning
> >>>>>>>>>>>>>>>>> towards
> >>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>> exposing
> >>>>>>>>>>>>>>>>>>>>>>>> this to the user group but having this
> >>>>>>>> available
> >>>>>>>>>>>>> strictly
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
> >>>>>> their
> >>>>>>>>>>>>>>>>>> paging/metrics/logging
> >>>>>>>>>>>>>>>>>>>>>>> system
> >>>>>>>>>>>>>>>>>>>>>>>> to take this into account.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> >>>>>>>>>>>>>>> karmagyz@gmail.com
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
> >>>>>>> some
> >>>>>>>>>>> cents
> >>>>>>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> user?
> >>>>>>>>>>>>>>>>>>>>>>>>> I think the semantics of
> >>>>>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
> >>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
> >>>>>>> the
> >>>>>>>>>>>> timeout
> >>>>>>>>>>>>>>>>>> mechanism is
> >>>>>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
> >>>>>> it
> >>>>>>>>> will
> >>>>>>>>>>> be
> >>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>> confusing
> >>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
> >>>>>> exception
> >>>>>>>> of a
> >>>>>>>>>>> task
> >>>>>>>>>>>>>>>>> manager to
> >>>>>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
> >>>>>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
> >>>>>> fail
> >>>>>>>> in
> >>>>>>>>>>> case
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> target
> >>>>>>>>>>>>>>>>>> task
> >>>>>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
> >>>>>> I
> >>>>>>>>> don't
> >>>>>>>>>>>> mean
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> cover
> >>>>>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
> >>>>>> also
> >>>>>>>>> open a
> >>>>>>>>>>>>>>>>>> `notifyException`
> >>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
> >>>>>>>>>>> ResourceManager,
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> slot of
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
> >>>>>>> released
> >>>>>>>>> and
> >>>>>>>>>>>>>>> allocated?
> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>> Yangze Guo
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
> >>>>>> Wang
> >>>>>>> <
> >>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
> >>>>>> response
> >>>>>>>>> your 4
> >>>>>>>>>>>>>> remarks:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
> >>>>>>>>> controversy. I
> >>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>> “BlockList”
> >>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
> >>>>>>>> good
> >>>>>>>>>>> idea.
> >>>>>>>>>>>>>> Based
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> above, I
> >>>>>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>>> <host>/blocklist/node/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>>>>>> <host>/blocklist/taskmanager/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
> >>>>>>>> means
> >>>>>>>>>>> that
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> task
> >>>>>>>>>>>>>>>>>>>>> managers
> >>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
> >>>>>>>> these
> >>>>>>>>>>> TMs
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>> available.
> >>>>>>>>>>>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
> >>>>>> these
> >>>>>>>> TMs
> >>>>>>>>>>> are
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>>> lost,
> >>>>>>>>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
> >>>>>> are
> >>>>>>>>> still
> >>>>>>>>>>>>>>> registered
> >>>>>>>>>>>>>>>>>> in this
> >>>>>>>>>>>>>>>>>>>>>>> flink
> >>>>>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
> >>>>>> once
> >>>>>>>> the
> >>>>>>>>>>>>>>>>> corresponding
> >>>>>>>>>>>>>>>>>>>>>> blocklist
> >>>>>>>>>>>>>>>>>>>>>>>>> item
> >>>>>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
> >>>>>>>>>>>>>>> active/non-active
> >>>>>>>>>>>>>>>>>>>>> clusters.
> >>>>>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
> >>>>>> TMs
> >>>>>>>> may
> >>>>>>>>> be
> >>>>>>>>>>>>>> released
> >>>>>>>>>>>>>>>>> due
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> idle
> >>>>>>>>>>>>>>>>>>>>>>>>>> timeouts.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
> >>>>>> keep
> >>>>>>>> it.
> >>>>>>>>>>> The
> >>>>>>>>>>>>>> reasons
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>>> following:
> >>>>>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
> >>>>>>> adding
> >>>>>>>> or
> >>>>>>>>>>>>> removing
> >>>>>>>>>>>>>>>>> items
> >>>>>>>>>>>>>>>>>> via
> >>>>>>>>>>>>>>>>>>>>>> REST
> >>>>>>>>>>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
> >>>>>> it
> >>>>>>>> to
> >>>>>>>>>>>>>>>>> Long.MAX_VALUE .
> >>>>>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
> >>>>>> a
> >>>>>>>>>>> period of
> >>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>> (such as
> >>>>>>>>>>>>>>>>>>>>>>> machine
> >>>>>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
> >>>>>> prefer
> >>>>>>>> that
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
> >>>>>>> user
> >>>>>>>> to
> >>>>>>>>>>> do
> >>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> manually.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
> >>>>>>>>>>> 于2022年4月27日周三
> >>>>>>>>>>>>>>>>> 19:23写道：
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
> >>>>>> only
> >>>>>>>>>>> support
> >>>>>>>>>>>>>>> manually
> >>>>>>>>>>>>>>>>>> blocking
> >>>>>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
> >>>>>>> different
> >>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> also be
> >>>>>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
> >>>>>>>> "bla.cklist"
> >>>>>>>>>>> due
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> controversy
> >>>>>>>>>>>>>>>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
> >>>>>>> Ticket
> >>>>>>>>>>> about
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> topic a
> >>>>>>>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>>>>>>>>> back
> >>>>>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
> >>>>>>>> avoid
> >>>>>>>>>>> the
> >>>>>>>>>>>>> term
> >>>>>>>>>>>>>>>>>> blacklist &
> >>>>>>>>>>>>>>>>>>>>>>>>> whitelist
> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
> >>>>>> "denylist"
> >>>>>>> or
> >>>>>>>>>>>>>>> "quarantined"
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
> >>>>>>> slightly
> >>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>> design
> >>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>> verbs
> >>>>>>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
> >>>>>>> anti-pattern
> >>>>>>>>> for
> >>>>>>>>>>>> REST
> >>>>>>>>>>>>>>> APIs.
> >>>>>>>>>>>>>>>>>> POST
> >>>>>>>>>>>>>>>>>>>>> on a
> >>>>>>>>>>>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
> >>>>>>>> items.
> >>>>>>>>>>>> DELETE
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>>>>>> <host>/quarantine/items/<itemidentifier>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
> >>>>>>>>> taskmanagers
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>> nodes
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> REST
> >>>>>>>>>>>>>>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
> >>>>>>> opinion
> >>>>>>>>> on
> >>>>>>>>>>>> this?
> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
> >>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>>>> <host>/quarantine/nodes/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>>>>>>>> <host>/quarantine/taskmanager/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
> >>>>>>>>>>> non-active
> >>>>>>>>>>>>>>>>> resource
> >>>>>>>>>>>>>>>>>>>>>> managers,
> >>>>>>>>>>>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
> >>>>>>>>>>> minimal,
> >>>>>>>>>>>> do
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> timeout
> >>>>>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
> >>>>>>>> manually
> >>>>>>>>>>> we
> >>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>> delegate
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
> >>>>>>>>> behavior
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> >>>>>>> schrieb
> >>>>>>>>>>> Lijie
> >>>>>>>>>>>>> Wang
> >>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
> >>>>>>>> cluster
> >>>>>>>>>>> node
> >>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>>>>>> (insufficient
> >>>>>>>>>>>>>>>>>>>>>>>>>>> disk
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
> >>>>>>>> abnormalities).
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>> take care
> >>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
> >>>>>>> However,
> >>>>>>>>> due
> >>>>>>>>>>> to
> >>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>> locality
> >>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>> limited
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
> >>>>>> likely
> >>>>>>>> to
> >>>>>>>>> be
> >>>>>>>>>>>>>>> redeployed
> >>>>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
> >>>>>>>> task
> >>>>>>>>>>>>>>> abnormalities
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>> affect
> >>>>>>>>>>>>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> progress.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
> >>>>>> manually
> >>>>>>>>>>> identify
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> problematic
> >>>>>>>>>>>>>>>>>>>>>>>>> node and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
> >>>>>>> But
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>> approach
> >>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>>>>>> following
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
> >>>>>> heavy
> >>>>>>>>>>> process.
> >>>>>>>>>>>>>> Users
> >>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> contact
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
> >>>>>>>>> operation
> >>>>>>>>>>> can
> >>>>>>>>>>>>>> even
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>> dangerous
> >>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
> >>>>>>>>> events.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
> >>>>>> of
> >>>>>>>>>>> problems
> >>>>>>>>>>>>>>> manually
> >>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>> slow
> >>>>>>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
> >>>>>>>> propose
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>>>>>>>> blacklist
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
> >>>>>>>>> problematic
> >>>>>>>>>>>>>>> resources.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
> >>>>>>> FLIP-224[1].
> >>>>>>>>>>> Looking
> >>>>>>>>>>>>>>> forward
> >>>>>>>>>>>>>>>>>> to your
> >>>>>>>>>>>>>>>>>>>>>>>>>>> feedback.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>> Roman Boyko
> >>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> https://twitter.com/snntrable
> >>>>>>>> https://github.com/knaufk
> >>>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> https://twitter.com/snntrable
> >>>>>> https://github.com/knaufk
> >>>>>>
> >>>>
>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Chesnay Schepler <ch...@apache.org>.

I have a number of concerns:

Is the id used for deleting an item the same sent in the initial request 
(and not one returned by Flink)?
I'm very concerned that the tm/node id can contain special characters.

The GET query should return a map, not a list of items. This makes it 
easier to work with.

The duality of endTimestamp and duration is also concerning.
If we conclude that endTimestamps can in fact work (and aren't utterly 
unusable due to timezones),
then this should be able to cover everything and rid us of some 
complexity w.r.t. POSTs to the same ID.
Additions would be a PUT, changes a PATCH, deletes a DELETE.


I also dislike how we're pushing more functionality into the 
slotpool/-manager.
These components are complex enough as-is, and instead I'd propose a 
separate component that interacts with the SlotPool/-Manager instead, 
for example by removing the slots from that TM.
The reason being that from the slot-pool perspective it is irrelevant 
whether a slot is gone because the TM was lost, or because it was blocked.


The BlocklistTracker shouldn't be modeled as component that is aware of 
the concept of main threads.
It really has no business knowing that; all it needs is an executor for 
handling timeouts/periodic actions,
and a way to interact with the JM/RM (which internally can handle the 
scheduling into the main thread).


On 20/05/2022 07:20, Lijie Wang wrote:
> Hi everyone,
>
> I have started a vote for this FLIP [1]. Please cast your vote there or ask
> additional questions here. [1]
> https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
>
> Best, Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：
>
>> Hi Konstantin,
>>
>> We found that Flink REST URL does not support the format ":merge" , which
>> will be recognized as a parameter in the URL(due to start with a colon).
>>
>> We will keep the previous way, i.e.
>>
>> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
>> and the "id" and "merge" flag are put into the request body.
>>
>> Best,
>> Lijie
>>
>> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
>>
>>> Hi Weihua,
>>> thanks for feedback.
>>>
>>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
>>> step towards auto-detection.
>>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
>>> exception of insufficient resources.
>>> 3. No. This FLIP won't change the WebUI. The blocklist information can be
>>> obtained through REST API and metrics.
>>>
>>> Best,
>>> Lijie
>>>
>>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
>>>
>>>> Hi,
>>>> Thanks for creating this FLIP.
>>>> We have implemented an automatic blocklist detection mechanism
>>>> internally, which is indeed very effective for handling node failures.
>>>> Due to the large number of nodes, although SREs already support
>>>> automatic offline failure nodes, the detection is not 100% accurate and
>>>> there is some delay.
>>>> So the blocklist mechanism can make flink job recover from failure much
>>>> faster.
>>>>
>>>> Here are some of my thoughts:
>>>> 1. In this FLIP, it needs users to locate machine failure manually,
>>>> there is a certain cost of use
>>>> 2. What happens if too many nodes are blocked, resulting in insufficient
>>>> resources? Will there be a special Exception for the user?
>>>> 3. Will we display the blocklist information in the WebUI? The blocklist
>>>> is for cluster level, and if multiple users share a cluster, some users may
>>>> be a little confused when resources are not enough, or when resources are
>>>> applied for more.
>>>>
>>>> Also, Looking forward to the next FLIP on auto-detection.
>>>>
>>>> Best,
>>>> Weihua
>>>>
>>>>> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
>>>>>
>>>>> Hi Konstantin,
>>>>>
>>>>> Maybe change it to the following:
>>>>>
>>>>> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>>>> Merge is not allowed. If the {id} already exists, return error.
>>>> Otherwise,
>>>>> create a new item.
>>>>>
>>>>> 2. POST: http://
>>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>>>> Merge is allowed. If the {id} already exists, merge. Otherwise, create
>>>> a
>>>>> new item.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> Best,
>>>>> Lijie
>>>>>
>>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
>>>>>
>>>>>> Hi Lijie,
>>>>>>
>>>>>> hm, maybe the following is more appropriate in that case
>>>>>>
>>>>>> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Konstantin
>>>>>>
>>>>>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>>>>>> wangdachui9501@gmail.com>:
>>>>>>
>>>>>>> Hi Konstantin,
>>>>>>> thanks for your feedback.
>>>>>>>
>>>>>>>  From what I understand, PUT should be idempotent. However, we have a
>>>>>>> *timeout* field in the request. This means that initiating the same
>>>>>> request
>>>>>>> at two different times will lead to different resource status
>>>> (timestamps
>>>>>>> of the items to be removed will be different).
>>>>>>>
>>>>>>> Should we use PUT in this case? WDYT?
>>>>>>>
>>>>>>> Best,
>>>>>>> Lijie
>>>>>>>
>>>>>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>>>>>>>
>>>>>>>> Hi Lijie,
>>>>>>>>
>>>>>>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT
>>>> on
>>>>>> the
>>>>>>>> resource?
>>>>>>>>
>>>>>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Konstantin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>>>>>>>> wangdachui9501@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
>>>> made
>>>>>>> the
>>>>>>>>> following changes on REST API:
>>>>>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
>>>>>>> one.
>>>>>>>> If
>>>>>>>>> both are specified, will return error.
>>>>>>>>> 2.  If the specified item is already there, the *ADD* operation has
>>>>>> two
>>>>>>>>> behaviors:  *return error*(default value) or *merge/update*, and we
>>>>>>> add a
>>>>>>>>> flag to the request body to control it. You can find more details
>>>>>>> "Public
>>>>>>>>> Interface" section.
>>>>>>>>>
>>>>>>>>> If there is no more feedback, we will start the vote thread next
>>>>>> week.
>>>>>>>>> Best,
>>>>>>>>> Lijie
>>>>>>>>>
>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>>>>>>>>>
>>>>>>>>>> Hi Becket Qin,
>>>>>>>>>>
>>>>>>>>>> Thanks for your suggestions.  I have moved the description of
>>>>>>>>>> configurations, metrics and REST API into "Public Interface"
>>>>>> section,
>>>>>>>> and
>>>>>>>>>> made a few updates according to your suggestion.  And in this
>>>> FLIP,
>>>>>>>> there
>>>>>>>>>> no public java Interfaces or pluggables that users need to
>>>>>> implement
>>>>>>> by
>>>>>>>>>> themselves.
>>>>>>>>>>
>>>>>>>>>> Answers for you questions:
>>>>>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>>>>>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>>>>>> items
>>>>>>>> can
>>>>>>>>>> only be added through the REST API, so these 2 action are
>>>> mentioned
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>> REST API part (The REST API part has beed moved to public
>>>> interface
>>>>>>>> now).
>>>>>>>>>> 2. I agree with you. I have changed the "Cause" field to String,
>>>>>> and
>>>>>>>>> allow
>>>>>>>>>> users to specify it via REST API.
>>>>>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>>>>>> above,
>>>>>>>> we
>>>>>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
>>>> ADD
>>>>>>>> REST
>>>>>>>>>> API to specify when to remove the blocked item. These 2 fields are
>>>>>>>>>> optional, if neither is specified, it means that the blocked item
>>>>>> is
>>>>>>>>>> permanent and will not be removed. If both are specified, the
>>>>>> minimum
>>>>>>>> of
>>>>>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the
>>>>>>> time
>>>>>>>> to
>>>>>>>>>> remove the blocked item. To keep the configurations more minimal,
>>>>>> we
>>>>>>>> have
>>>>>>>>>> removed the *cluster.resource-blocklist.item.timeout*
>>>> configuration
>>>>>>>>>> option.
>>>>>>>>>> 4. Yes, the block item will be overridden if the specified item
>>>>>>> already
>>>>>>>>>> exists. The ADD operation is *ADD or UPDATE*.
>>>>>>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained
>>>>>> in
>>>>>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>>>>>>>>> abstracted
>>>>>>>>>> to other interfaces) will be propagated to different components.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Lijie
>>>>>>>>>>
>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Lijie,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the FLIP. It looks like the public interface
>>>>>>>> section
>>>>>>>>>>> did not fully reflect all the user sensible behavior and API. Can
>>>>>>> you
>>>>>>>>> put
>>>>>>>>>>> everything that users may be aware of there? That would include
>>>>>> the
>>>>>>>> REST
>>>>>>>>>>> API, metrics, configurations, public java Interfaces or
>>>> pluggables
>>>>>>>> that
>>>>>>>>>>> users may see or implement by themselves, as well as a brief
>>>>>> summary
>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>> behavior of the public API.
>>>>>>>>>>>
>>>>>>>>>>> Besides that, I have a few questions:
>>>>>>>>>>>
>>>>>>>>>>> 1. According to the conversation in the discussion thread, it
>>>>>> looks
>>>>>>>> like
>>>>>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>>>>>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
>>>>>> can
>>>>>>>> you
>>>>>>>>>>> add
>>>>>>>>>>> that to the public interface as well?
>>>>>>>>>>>
>>>>>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>>>>>>> Throwable
>>>>>>>>> and
>>>>>>>>>>> is not reflected in the REST API. Should that be included in the
>>>>>>> query
>>>>>>>>>>> response? And should we change that field to be a String so users
>>>>>>> may
>>>>>>>>>>> specify the cause via the REST API when they block some nodes /
>>>>>> TMs?
>>>>>>>>>>> 3. Would it be useful to allow users to have different timeouts
>>>>>> for
>>>>>>>>>>> different blocked items? So while there is a default timeout,
>>>>>> users
>>>>>>>> can
>>>>>>>>>>> also override it via the REST API when they block an entity.
>>>>>>>>>>>
>>>>>>>>>>> 4. Regarding the ADD operation, if the specified item is already
>>>>>>>> there,
>>>>>>>>>>> will the block item be overridden? For example, if the user wants
>>>>>> to
>>>>>>>>>>> extend
>>>>>>>>>>> the timeout of a blocked item, can they just  issue an ADD
>>>> command
>>>>>>>>> again?
>>>>>>>>>>> 5. I am not quite familiar with the details of this, but is there
>>>>>> a
>>>>>>>>> source
>>>>>>>>>>> of truth for the blocked list? I think it might be good to have a
>>>>>>>> single
>>>>>>>>>>> source of truth for the blocked list and just propagate that list
>>>>>> to
>>>>>>>>>>> different components to take the action of actually blocking the
>>>>>>>>> resource.
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>>>>>> wangdachui9501@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> Based on the discussion in the mailing list, I updated the FLIP
>>>>>>> doc,
>>>>>>>>> the
>>>>>>>>>>>> changes include:
>>>>>>>>>>>> 1. Changed the description of the motivation section to more
>>>>>>> clearly
>>>>>>>>>>>> describe the problem this FLIP is trying to solve.
>>>>>>>>>>>> 2. Only  *Manually* is supported.
>>>>>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Lijie
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Lijie!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>>>>>>>>>>> helpfulhere.”
>>>>>>>>>>>>> Can you explain it in detail? I don't fully understand that.
>>>>>> In
>>>>>>>>>>>> myopinion,
>>>>>>>>>>>>> non-active and active are the same, and no special treatment
>>>>>>>>>>> isrequired.*
>>>>>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>>>>>>> were
>>>>>>>>>>> talking
>>>>>>>>>>>>> about the HA mode (but not about Active and Standalone
>>>>>>>>>>> ResourceManager).
>>>>>>>>>>>>> And the original question was - how to handle the blacklisted
>>>>>>>> nodes
>>>>>>>>>>> list
>>>>>>>>>>>> at
>>>>>>>>>>>>> the moment of leader change? Should we simply forget about
>>>>>> them
>>>>>>> or
>>>>>>>>>>> try to
>>>>>>>>>>>>> pre-save that list on the remote storage?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey.wy@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>>>>>> it
>>>>>>> is
>>>>>>>>>>> indeed
>>>>>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>>>>>>>> could
>>>>>>>>>>> not
>>>>>>>>>>>>>> support.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Jiangang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>>>>>> and
>>>>>>>>> mark a
>>>>>>>>>>>> node
>>>>>>>>>>>>>>> blocked?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In fact, we currently plan to not support auto-detection
>>>>>> in
>>>>>>>> this
>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>> part about auto-detection may be continued in a separate
>>>>>>> FLIP
>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>> future. Some guys have the same concerns as you, and the
>>>>>>>>>>> correctness
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> necessity of auto-detection may require further discussion
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>>>>>> node
>>>>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>>>> node
>>>>>>>>>>>>>>> should be marked blocked.
>>>>>>>>>>>>>>> By design, the blocklist information will be shared among
>>>>>>> all
>>>>>>>>> jobs
>>>>>>>>>>>> in a
>>>>>>>>>>>>>>> cluster/session. The JM will sync blocklist information
>>>>>> with
>>>>>>>> RM.
>>>>>>>>>>>>>>> @Martijn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree with Yang Wang on this.
>>>>>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>>>>>>>>>>> MARK_BLOCKLISTED(Just
>>>>>>>>>>>>>> limits
>>>>>>>>>>>>>>> the load of the node and does not  kill all the processes
>>>>>> on
>>>>>>>> it)
>>>>>>>>>>> is
>>>>>>>>>>>>> also
>>>>>>>>>>>>>>> important, and we think that external systems (*yarn
>>>>>> rmadmin
>>>>>>>> or
>>>>>>>>>>>> kubectl
>>>>>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>>>>>>>> only
>>>>>>>>>>>>>> *manually*.
>>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>>>>>> indeed
>>>>>>>>>>> super
>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>> to get right.
>>>>>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>>>>>>>>> concern.
>>>>>>>>>>>>>>> However, we currently plan to not support auto-detection
>>>>>> in
>>>>>>>> this
>>>>>>>>>>>> FLIP,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>>>>>> FLIP
>>>>>>>>>>> provides
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>>>>>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>>>>>>>>>>>>>>> the auto-detection may be done by external systems.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
>>>>>>> 19:04写道：
>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>>>>>> could
>>>>>>>> not
>>>>>>>>>>> see
>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>>>>>>>> approach(via
>>>>>>>>>>>> *yarn
>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree with Yang Wang on this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To me this sounds yet again like one of those magical
>>>>>>>>>>> mechanisms
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> will rarely work just right.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>>>>>> indeed
>>>>>>>>>>> super
>>>>>>>>>>>>> hard
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> get right.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Martijn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>>>>>>>>>>>> liujiangangpeng@gmail.com
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>>>>>>>>> decrease
>>>>>>>>>>>>> great
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>> for us. We have implemented the similar feature in our
>>>>>>>> inner
>>>>>>>>>>> flink
>>>>>>>>>>>>>>>>> version.
>>>>>>>>>>>>>>>>> Below is something that I care about:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    1. For auto-detecting, I wonder how to make the
>>>>>>> strategy
>>>>>>>>> and
>>>>>>>>>>>>> mark a
>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>    blocked? Sometimes the blocked node is hard to be
>>>>>>>>> detected,
>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> example,
>>>>>>>>>>>>>>>>>    the upper node or the down node will be blocked when
>>>>>>>>> network
>>>>>>>>>>>>>>>>> unreachable.
>>>>>>>>>>>>>>>>>    2. I see that the strategy is made in JobMaster
>>>>>> side.
>>>>>>>> How
>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>    implementing the similar logic in resource manager?
>>>>>> In
>>>>>>>>>>> session
>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>>>> multi
>>>>>>>>>>>>>>>>>    jobs can fail on the same bad node and the node
>>>>>> should
>>>>>>>> be
>>>>>>>>>>>> marked
>>>>>>>>>>>>>>>>> blocked.
>>>>>>>>>>>>>>>>>    If the job makes the strategy, the node may be not
>>>>>>>> marked
>>>>>>>>>>>> blocked
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>    fail times don't exceed the threshold.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you for all your feedback!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>>>>>> some
>>>>>>> of
>>>>>>>>> my
>>>>>>>>>>>>>> thoughts:
>>>>>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>>>>>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>>>>>>>>>>>>>>>>>> The automatical way should be something built upon
>>>>>> the
>>>>>>>>>>> blocklist
>>>>>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>>>>>>>>>>>> configurable
>>>>>>>>>>>>>>>>>> blocklist strategy, but I think we can further
>>>>>> decouple
>>>>>>>> it
>>>>>>>>> by
>>>>>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>>>>>>>> suggested,
>>>>>>>>>>> which
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>>>>>>> detected.
>>>>>>>>>>>> However,
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> should be a separate FLIP with further dev
>>>>>> discussions
>>>>>>>> and
>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>>>> from users. I also agree with Becket that different
>>>>>>> users
>>>>>>>>>>> have
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>> requirements, and we should listen to them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>>>>>>>> externally
>>>>>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>>>>>> way
>>>>>>> to
>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>>>>>>>> users
>>>>>>>>>>> may
>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> want to limit the load of the node and do not want to
>>>>>>>> kill
>>>>>>>>>>> all
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> processes on it. Another case is the speculative
>>>>>>>>> execution[1]
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> may also leverage this feature to avoid starting
>>>>>> mirror
>>>>>>>>>>> tasks on
>>>>>>>>>>>>>> slow
>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Zhu
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>>>>>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
>>>>>>>>> 15:56写道：
>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your feedback.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>>>>>> here
>>>>>>>>>>> because
>>>>>>>>>>>> it
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> affect the value and design of the blocklist
>>>>>> mechanism
>>>>>>>>>>> (perhaps
>>>>>>>>>>>> I
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>>>>>>>> FLIP:
>>>>>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>>>>>>> node
>>>>>>>>> as
>>>>>>>>>>>>>> blocked.
>>>>>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>>>>>>>> task
>>>>>>>>>>>> manager
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>> node.
>>>>>>>>>>>>>>>>>> But slots that are already allocated will not be
>>>>>>>> affected.
>>>>>>>>> A
>>>>>>>>>>>>> typical
>>>>>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>>>>>>> In
>>>>>>>>> this
>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> hope
>>>>>>>>>>>>>>>>>> that subsequent resource allocations will not be on
>>>>>> the
>>>>>>>> hot
>>>>>>>>>>>>> machine,
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> tasks currently running on it should not be affected.
>>>>>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>>>>>> task
>>>>>>>>>>> manager
>>>>>>>>>>>> or
>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>>>>>> tasks
>>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> restarted on
>>>>>>>>>>>>>>>>>> non-blocked task managers.
>>>>>>>>>>>>>>>>>>> For the above 2 actions, the former may more
>>>>>>> highlight
>>>>>>>>> the
>>>>>>>>>>>>> meaning
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> this FLIP, because the external system cannot do
>>>>>> that.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>>>>>> basically
>>>>>>>>> agree
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> @Becket
>>>>>>>>>>>>>>>>>> Qin: different users have different answers. Not all
>>>>>>>> users’
>>>>>>>>>>>>>> deployment
>>>>>>>>>>>>>>>>>> environments have a special external system that can
>>>>>>>>> perform
>>>>>>>>>>> the
>>>>>>>>>>>>>>> anomaly
>>>>>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>>>>>>>>>>> auto-detection
>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>> require much extra work on top of manual
>>>>>> specification.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I will answer your other questions one by one.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Yangze
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>>>>>>> the
>>>>>>> `cluster.resource-blocklist.item.timeout-check-interval`
>>>>>>>> to
>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>>>>>>> separate
>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>>> (maybe
>>>>>>>>>>>>>>>>>> BlocklistExceptionListener), and the
>>>>>>>>>>>>> ResourceManagerBlocklistHandler
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> implement it in the future.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Martijn
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>>>>>> by
>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> operators.
>>>>>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>>>>>>>> according
>>>>>>>>>>> to
>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>> experience, users are often the first to perceive the
>>>>>>>>> machine
>>>>>>>>>>>>>> problems
>>>>>>>>>>>>>>>>>> (because of job failover or delay), and they will
>>>>>>> contact
>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> operators
>>>>>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>>>>>>>>>>> machine is
>>>>>>>>>>>>>>>>>> problematic. From this point of view, I think the
>>>>>>> people
>>>>>>>>> who
>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>>>>>>>> performed
>>>>>>>>> by
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Chesnay
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>>>>>>> for
>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED
>>>>>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>>>>>> when
>>>>>>>> the
>>>>>>>>>>> task
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>>>>>>>> release
>>>>>>>>>>> the
>>>>>>>>>>>>> slot
>>>>>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>>>>>>> on
>>>>>>>>> it.
>>>>>>>>>>> If
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>>>>>> needs
>>>>>>> to
>>>>>>>>>>>> retrieve
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>>>>>>>> relatively
>>>>>>>>>>>> large,
>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> think it's worth maintaining the blocklist
>>>>>> information
>>>>>>> on
>>>>>>>>>>> the JM
>>>>>>>>>>>>>> side
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> syncing them.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Роман
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     a) “Probably storing inside Zookeeper/Configmap
>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>>>>> helpful
>>>>>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>>>>>>>>>>> understand
>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>>>>>>>> special
>>>>>>>>>>>>>> treatment
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> required.
>>>>>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>>>>>> sense,
>>>>>>> I
>>>>>>>>> will
>>>>>>>>>>>> add
>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Yang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>>>>>> cannot
>>>>>>>>>>> support
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Looking forward to your further feedback.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
>>>>>>>> 21:09写道：
>>>>>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>>>>>>>>>>> Reliability
>>>>>>>>>>>>>>> Engineer)
>>>>>>>>>>>>>>>>>>>> already has many tools to detect the unstable
>>>>>> nodes,
>>>>>>>>> which
>>>>>>>>>>>>> could
>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> system logs/metrics into consideration.
>>>>>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>>>>>> taint
>>>>>>>> in
>>>>>>>>>>> K8s
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> prevent
>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>> allocations on these unstable nodes.
>>>>>>>>>>>>>>>>>>>> At last, they will evict all the containers and
>>>>>> pods
>>>>>>>>>>> running
>>>>>>>>>>>> on
>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>>>>>>> So
>>>>>>>> I
>>>>>>>>> am
>>>>>>>>>>>>> afraid
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>>>>>>>> could
>>>>>>>>>>> not
>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>>>>>>>>>>> approach(via
>>>>>>>>>>>>>>> *yarn
>>>>>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>>>>>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>>>>>>> which
>>>>>>>>>>> could
>>>>>>>>>>>>>> expose
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>>>>>>> them
>>>>>>>> if
>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>>>> explicitly.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
>>>>>>>> 16:36写道：
>>>>>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>>>>>> and
>>>>>>>>>>> somewhat
>>>>>>>>>>>>>>> related
>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>> design principle about how people should operate
>>>>>>>>> Flink.
>>>>>>>>>>>>>>>>>>>>> I think there are three things involved in this
>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>>>>      a) Detect and report the unstable node.
>>>>>>>>>>>>>>>>>>>>>      b) Collect the information of the unstable
>>>>>>> node
>>>>>>>>> and
>>>>>>>>>>>>> form a
>>>>>>>>>>>>>>>>>> blocklist.
>>>>>>>>>>>>>>>>>>>>>      c) Take the action to block nodes.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> My two cents:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>>>>>>> should
>>>>>>>>> have
>>>>>>>>>>>> c).
>>>>>>>>>>>>> It
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> not only
>>>>>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>>>>>> handy
>>>>>>>> for
>>>>>>>>>>> some
>>>>>>>>>>>>>>> planned
>>>>>>>>>>>>>>>>>>>>> maintenance.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>>>>>> who
>>>>>>>>>>> should be
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> brain
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>>>>>>>>> largely
>>>>>>>>>>>>> depends
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> who we
>>>>>>>>>>>>>>>>>>>>> talk to. Different users would probably give
>>>>>>>> different
>>>>>>>>>>>>> answers.
>>>>>>>>>>>>>>> For
>>>>>>>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>>>>>> who do have a centralized node health management
>>>>>>>>>>> service,
>>>>>>>>>>>> let
>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>> do just
>>>>>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>>>>>>>> Flink
>>>>>>>>>>> would
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>>>>>> it
>>>>>>> to
>>>>>>>>>>> that
>>>>>>>>>>>>>>> service,
>>>>>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>>>>>>> take the command from that service to block the
>>>>>>>>>>> problematic
>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>> On
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>>>>>>>> service,
>>>>>>>>>>>> simply
>>>>>>>>>>>>>>>>> letting
>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>>>>>> nodes
>>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>>>>>> desired
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>>>>>>>> pluggable /
>>>>>>>>>>>>>> optional.
>>>>>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>>>>>>> have
>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>> pluggable
>>>>>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>>>>>> unstable
>>>>>>>>> nodes
>>>>>>>>>>>>>>> actively.
>>>>>>>>>>>>>>>>> (A
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>>>>>>> which
>>>>>>>>>>> can be
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> report
>>>>>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>>>>>>> the
>>>>>>>>>>> scope
>>>>>>>>>>>>>>> relevant
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>>>>>> OK
>>>>>>> to
>>>>>>>>>>> have a
>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>>>>>>> Flink
>>>>>>>> to
>>>>>>>>>>> take
>>>>>>>>>>>>>>> action
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> block
>>>>>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>>>>>>>>> timeout.
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>>>>>>>>>>>>>> ro.v.boyko@gmail.com
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>>>>>>>>>>> development.
>>>>>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>>>>>>> we
>>>>>>>>>>> should
>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>>>> to move
>>>>>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>>>>>> nodes/TMs
>>>>>>>>> from
>>>>>>>>>>>>> active
>>>>>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>>>>>>>> storing
>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>>>>>>>> organizations
>>>>>>>>>>>> don't
>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> expose
>>>>>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>>>>>>>> it's
>>>>>>>>>>>>> necessary
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>>>>>>> for
>>>>>>>>>>>> avoiding
>>>>>>>>>>>>>>>>> incorrect
>>>>>>>>>>>>>>>>>>>>>> automatic behaviour.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>>>>>>>> would
>>>>>>>>> be
>>>>>>>>>>>>> better
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> extend
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>>>>>> *endTimestamp*
>>>>>>>>> field
>>>>>>>>>>> and
>>>>>>>>>>>>>> fill
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>>>>>>> to:
>>>>>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>    Provide the ability to users to setup the
>>>>>>> exact
>>>>>>>>>>> time
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> blocklist end
>>>>>>>>>>>>>>>>>>>>>>    through RestAPI
>>>>>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>    Not being tied to a single value of
>>>>>>>>>>>>>>>>>>>>>>    *cluster.resource-blacklist.item.timeout*
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>>>>>> <
>>>>>>>>>>>>>>>>> chesnay@apache.org>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>>>>>>>> lines
>>>>>>>>> a
>>>>>>>>>>>> bit.
>>>>>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>>>>>>>>> auto-detection
>>>>>>>>>>>> and
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> have an
>>>>>>>>>>>>>>>>>>>>>>> opt-in mechanism
>>>>>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>>>>>> this
>>>>>>>>> sounds
>>>>>>>>>>>> yet
>>>>>>>>>>>>>>> again
>>>>>>>>>>>>>>>>>> like one
>>>>>>>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>>>>>> just
>>>>>>>>> right.
>>>>>>>>>>>>>>>>>>>>>>> An external system can leverage way more
>>>>>>>>> information
>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>>>>>>>> complexity
>>>>>>>>>>> of
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>>>>>>> between
>>>>>>>>>>>>>> components;
>>>>>>>>>>>>>>>>>>>>> adjustments
>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>>> slot and resource protocol.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>>>>>>> resource
>>>>>>>>>>>>> management
>>>>>>>>>>>>>>>>>> perspective,
>>>>>>>>>>>>>>>>>>>>>>> then there
>>>>>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>>>>>> protocol
>>>>>>>> at
>>>>>>>>>>> all
>>>>>>>>>>>> (or
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> fact
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>>>>>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>>>>>>> around
>>>>>>>>>>> blocked
>>>>>>>>>>>>> TMs
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>> place.
>>>>>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>>>>>>> down
>>>>>>>>>>> after
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> RM
>>>>>>>>>>>>>>>>>>>>>>> because of
>>>>>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>>>>>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>>>>>>>> blocking
>>>>>>>>>>> the
>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>>>>>>>>>>> process(es).
>>>>>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>>>>>>>> back
>>>>>>>>>>> up.
>>>>>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>>>>>> the
>>>>>>>> same
>>>>>>>>>>> line
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> thinking to
>>>>>>>>>>>>>>>>>>>>>>> standalone resource management.
>>>>>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>>>>>> process/node
>>>>>>>>>>> manually
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> core
>>>>>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>>>>>>>> understand
>>>>>>>>>>> the
>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> I see
>>>>>>>>>>>>>>>>>>>>>>> value
>>>>>>>>>>>>>>>>>>>>>>>> in the automatic detection and
>>>>>>> blocklisting. I
>>>>>>>>> do
>>>>>>>>>>>> have
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> concerns
>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>>>>>>> blocked
>>>>>>>>>>>>> resources.
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>>>>>>>>> concerns;
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>>>>>>>>> separation
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> concerns,
>>>>>>>>>>>>>>>>>>>>> meaning
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>>>>>>>> managing a
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> there's
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>>>>>>>>>>> introduction of
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> mechanism,
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> latter group now can influence the
>>>>>>>>> responsibility
>>>>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>>>>>>>> group
>>>>>>>>>>>> blocks
>>>>>>>>>>>>>>>>>> something,
>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>>>>>>> paging
>>>>>>>>>>>>> mechanism
>>>>>>>>>>>>>>>>>> triggering
>>>>>>>>>>>>>>>>>>>>>> etc)
>>>>>>>>>>>>>>>>>>>>>>>> which impacts the first group.
>>>>>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>>>>>> go
>>>>>>>>>>> through
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>>>>>> it
>>>>>>>>>>> should
>>>>>>>>>>>>> be? I
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>>>>>>> make
>>>>>>>>>>> sense
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>>>>>>> really
>>>>>>>>>>> small
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> Flink?
>>>>>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>>>>>>> such
>>>>>>>> a
>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>>>>>>> I'm
>>>>>>>>>>>> leaning
>>>>>>>>>>>>>>>>> towards
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>> exposing
>>>>>>>>>>>>>>>>>>>>>>>> this to the user group but having this
>>>>>>>> available
>>>>>>>>>>>>> strictly
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>>>>>> their
>>>>>>>>>>>>>>>>>> paging/metrics/logging
>>>>>>>>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>>>>>>>>>> to take this into account.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>>>>>>>>>>>>>>> karmagyz@gmail.com
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>>>>>>> some
>>>>>>>>>>> cents
>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> user?
>>>>>>>>>>>>>>>>>>>>>>>>> I think the semantics of
>>>>>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>>>>>>> the
>>>>>>>>>>>> timeout
>>>>>>>>>>>>>>>>>> mechanism is
>>>>>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>>>>>> it
>>>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>>>>>> exception
>>>>>>>> of a
>>>>>>>>>>> task
>>>>>>>>>>>>>>>>> manager to
>>>>>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>>>>>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>>>>>> fail
>>>>>>>> in
>>>>>>>>>>> case
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>>>>>> I
>>>>>>>>> don't
>>>>>>>>>>>> mean
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> cover
>>>>>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>>>>>> also
>>>>>>>>> open a
>>>>>>>>>>>>>>>>>> `notifyException`
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>>>>>>>>>>> ResourceManager,
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> slot of
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>>>>>>> released
>>>>>>>>> and
>>>>>>>>>>>>>>> allocated?
>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>> Yangze Guo
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>>>>>> Wang
>>>>>>> <
>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>>>>>> response
>>>>>>>>> your 4
>>>>>>>>>>>>>> remarks:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>>>>>>>>> controversy. I
>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> “BlockList”
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>>>>>>>> good
>>>>>>>>>>> idea.
>>>>>>>>>>>>>> Based
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> above, I
>>>>>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>> <host>/blocklist/node/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>> <host>/blocklist/taskmanager/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>>>>>>>> means
>>>>>>>>>>> that
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>>>>> managers
>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>>>>>>>> these
>>>>>>>>>>> TMs
>>>>>>>>>>>> are
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> available.
>>>>>>>>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>>>>>> these
>>>>>>>> TMs
>>>>>>>>>>> are
>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>> lost,
>>>>>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>>>>>> are
>>>>>>>>> still
>>>>>>>>>>>>>>> registered
>>>>>>>>>>>>>>>>>> in this
>>>>>>>>>>>>>>>>>>>>>>> flink
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>>>>>> once
>>>>>>>> the
>>>>>>>>>>>>>>>>> corresponding
>>>>>>>>>>>>>>>>>>>>>> blocklist
>>>>>>>>>>>>>>>>>>>>>>>>> item
>>>>>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>>>>>>>>>>>>>>> active/non-active
>>>>>>>>>>>>>>>>>>>>> clusters.
>>>>>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>>>>>> TMs
>>>>>>>> may
>>>>>>>>> be
>>>>>>>>>>>>>> released
>>>>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> idle
>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>>>>>> keep
>>>>>>>> it.
>>>>>>>>>>> The
>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>> following:
>>>>>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>>>>>>> adding
>>>>>>>> or
>>>>>>>>>>>>> removing
>>>>>>>>>>>>>>>>> items
>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>>>>>> it
>>>>>>>> to
>>>>>>>>>>>>>>>>> Long.MAX_VALUE .
>>>>>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>>>>>> a
>>>>>>>>>>> period of
>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>> (such as
>>>>>>>>>>>>>>>>>>>>>>> machine
>>>>>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>>>>>> prefer
>>>>>>>> that
>>>>>>>>>>>> Flink
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>>>>>>> user
>>>>>>>> to
>>>>>>>>>>> do
>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> manually.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
>>>>>>>>>>> 于2022年4月27日周三
>>>>>>>>>>>>>>>>> 19:23写道：
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>>>>>> only
>>>>>>>>>>> support
>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>> blocking
>>>>>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>>>>>>> different
>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> also be
>>>>>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>>>>>>>> "bla.cklist"
>>>>>>>>>>> due
>>>>>>>>>>>> to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> controversy
>>>>>>>>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>>>>>>> Ticket
>>>>>>>>>>> about
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> topic a
>>>>>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>>>>>>>> avoid
>>>>>>>>>>> the
>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>> blacklist &
>>>>>>>>>>>>>>>>>>>>>>>>> whitelist
>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>>>>>> "denylist"
>>>>>>> or
>>>>>>>>>>>>>>> "quarantined"
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>>>>>>> slightly
>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> design
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> verbs
>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>>>>>>> anti-pattern
>>>>>>>>> for
>>>>>>>>>>>> REST
>>>>>>>>>>>>>>> APIs.
>>>>>>>>>>>>>>>>>> POST
>>>>>>>>>>>>>>>>>>>>> on a
>>>>>>>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>>>>>>>> items.
>>>>>>>>>>>> DELETE
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>> <host>/quarantine/items/<itemidentifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>>>>>>>>> taskmanagers
>>>>>>>>>>>> and
>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>>>>>>> opinion
>>>>>>>>> on
>>>>>>>>>>>> this?
>>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>>>>>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>> <host>/quarantine/nodes/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>>>>>> <host>/quarantine/taskmanager/<identifier>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>>>>>>>>>>> non-active
>>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>>>>>>> managers,
>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>>>>>>>>>>> minimal,
>>>>>>>>>>>> do
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>>>>>>>> manually
>>>>>>>>>>> we
>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>> delegate
>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>>>>>>>>> behavior
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Konstantin
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>>>>>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>>>>>>> schrieb
>>>>>>>>>>> Lijie
>>>>>>>>>>>>> Wang
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>>>>>>>> cluster
>>>>>>>>>>> node
>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>>>>>> (insufficient
>>>>>>>>>>>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>>>>>>>> abnormalities).
>>>>>>>>>>>> Flink
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> take care
>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>>>>>>> However,
>>>>>>>>> due
>>>>>>>>>>> to
>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>> locality
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> limited
>>>>>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>>>>>> likely
>>>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>>>>> redeployed
>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>>>>>>>> task
>>>>>>>>>>>>>>> abnormalities
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> affect
>>>>>>>>>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>>>>>>>>> progress.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>>>>>> manually
>>>>>>>>>>> identify
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> problematic
>>>>>>>>>>>>>>>>>>>>>>>>> node and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>>>>>>> But
>>>>>>>>>>> this
>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>>>>>> heavy
>>>>>>>>>>> process.
>>>>>>>>>>>>>> Users
>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> contact
>>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>>>>>>>>> operation
>>>>>>>>>>> can
>>>>>>>>>>>>>> even
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> dangerous
>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>>>>>>>>> events.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>>>>>> of
>>>>>>>>>>> problems
>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>>> slow
>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>>>>>>>> propose
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>>>>>>>> blacklist
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>>>>>>>>> problematic
>>>>>>>>>>>>>>> resources.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>>>>>>> FLIP-224[1].
>>>>>>>>>>> Looking
>>>>>>>>>>>>>>> forward
>>>>>>>>>>>>>>>>>> to your
>>>>>>>>>>>>>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Roman Boyko
>>>>>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> https://twitter.com/snntrable
>>>>>>>> https://github.com/knaufk
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> https://twitter.com/snntrable
>>>>>> https://github.com/knaufk
>>>>>>
>>>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi everyone,

I have started a vote for this FLIP [1]. Please cast your vote there or ask
additional questions here. [1]
https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg

Best, Lijie

Lijie Wang <wa...@gmail.com> 于2022年5月19日周四 17:34写道：

> Hi Konstantin,
>
> We found that Flink REST URL does not support the format ":merge" , which
> will be recognized as a parameter in the URL(due to start with a colon).
>
> We will keep the previous way, i.e.
>
> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> and the "id" and "merge" flag are put into the request body.
>
> Best,
> Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：
>
>> Hi Weihua,
>> thanks for feedback.
>>
>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
>> step towards auto-detection.
>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
>> exception of insufficient resources.
>> 3. No. This FLIP won't change the WebUI. The blocklist information can be
>> obtained through REST API and metrics.
>>
>> Best,
>> Lijie
>>
>> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
>>
>>> Hi,
>>> Thanks for creating this FLIP.
>>> We have implemented an automatic blocklist detection mechanism
>>> internally, which is indeed very effective for handling node failures.
>>> Due to the large number of nodes, although SREs already support
>>> automatic offline failure nodes, the detection is not 100% accurate and
>>> there is some delay.
>>> So the blocklist mechanism can make flink job recover from failure much
>>> faster.
>>>
>>> Here are some of my thoughts:
>>> 1. In this FLIP, it needs users to locate machine failure manually,
>>> there is a certain cost of use
>>> 2. What happens if too many nodes are blocked, resulting in insufficient
>>> resources? Will there be a special Exception for the user?
>>> 3. Will we display the blocklist information in the WebUI? The blocklist
>>> is for cluster level, and if multiple users share a cluster, some users may
>>> be a little confused when resources are not enough, or when resources are
>>> applied for more.
>>>
>>> Also, Looking forward to the next FLIP on auto-detection.
>>>
>>> Best,
>>> Weihua
>>>
>>> > 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
>>> >
>>> > Hi Konstantin,
>>> >
>>> > Maybe change it to the following:
>>> >
>>> > 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>> > Merge is not allowed. If the {id} already exists, return error.
>>> Otherwise,
>>> > create a new item.
>>> >
>>> > 2. POST: http://
>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>> > Merge is allowed. If the {id} already exists, merge. Otherwise, create
>>> a
>>> > new item.
>>> >
>>> > WDYT?
>>> >
>>> > Best,
>>> > Lijie
>>> >
>>> > Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
>>> >
>>> >> Hi Lijie,
>>> >>
>>> >> hm, maybe the following is more appropriate in that case
>>> >>
>>> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>> >>
>>> >> Best,
>>> >>
>>> >> Konstantin
>>> >>
>>> >> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>>> >> wangdachui9501@gmail.com>:
>>> >>
>>> >>> Hi Konstantin,
>>> >>> thanks for your feedback.
>>> >>>
>>> >>> From what I understand, PUT should be idempotent. However, we have a
>>> >>> *timeout* field in the request. This means that initiating the same
>>> >> request
>>> >>> at two different times will lead to different resource status
>>> (timestamps
>>> >>> of the items to be removed will be different).
>>> >>>
>>> >>> Should we use PUT in this case? WDYT?
>>> >>>
>>> >>> Best,
>>> >>> Lijie
>>> >>>
>>> >>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>>> >>>
>>> >>>> Hi Lijie,
>>> >>>>
>>> >>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT
>>> on
>>> >> the
>>> >>>> resource?
>>> >>>>
>>> >>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>> >>>>
>>> >>>> Best,
>>> >>>>
>>> >>>> Konstantin
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>>> >>>> wangdachui9501@gmail.com>:
>>> >>>>
>>> >>>>> Hi everyone,
>>> >>>>>
>>> >>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
>>> made
>>> >>> the
>>> >>>>> following changes on REST API:
>>> >>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
>>> >>> one.
>>> >>>> If
>>> >>>>> both are specified, will return error.
>>> >>>>> 2.  If the specified item is already there, the *ADD* operation has
>>> >> two
>>> >>>>> behaviors:  *return error*(default value) or *merge/update*, and we
>>> >>> add a
>>> >>>>> flag to the request body to control it. You can find more details
>>> >>> "Public
>>> >>>>> Interface" section.
>>> >>>>>
>>> >>>>> If there is no more feedback, we will start the vote thread next
>>> >> week.
>>> >>>>>
>>> >>>>> Best,
>>> >>>>> Lijie
>>> >>>>>
>>> >>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>>> >>>>>
>>> >>>>>> Hi Becket Qin,
>>> >>>>>>
>>> >>>>>> Thanks for your suggestions.  I have moved the description of
>>> >>>>>> configurations, metrics and REST API into "Public Interface"
>>> >> section,
>>> >>>> and
>>> >>>>>> made a few updates according to your suggestion.  And in this
>>> FLIP,
>>> >>>> there
>>> >>>>>> no public java Interfaces or pluggables that users need to
>>> >> implement
>>> >>> by
>>> >>>>>> themselves.
>>> >>>>>>
>>> >>>>>> Answers for you questions:
>>> >>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>>> >>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>>> >> items
>>> >>>> can
>>> >>>>>> only be added through the REST API, so these 2 action are
>>> mentioned
>>> >>> in
>>> >>>>> the
>>> >>>>>> REST API part (The REST API part has beed moved to public
>>> interface
>>> >>>> now).
>>> >>>>>> 2. I agree with you. I have changed the "Cause" field to String,
>>> >> and
>>> >>>>> allow
>>> >>>>>> users to specify it via REST API.
>>> >>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>>> >> above,
>>> >>>> we
>>> >>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
>>> ADD
>>> >>>> REST
>>> >>>>>> API to specify when to remove the blocked item. These 2 fields are
>>> >>>>>> optional, if neither is specified, it means that the blocked item
>>> >> is
>>> >>>>>> permanent and will not be removed. If both are specified, the
>>> >> minimum
>>> >>>> of
>>> >>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the
>>> >>> time
>>> >>>> to
>>> >>>>>> remove the blocked item. To keep the configurations more minimal,
>>> >> we
>>> >>>> have
>>> >>>>>> removed the *cluster.resource-blocklist.item.timeout*
>>> configuration
>>> >>>>>> option.
>>> >>>>>> 4. Yes, the block item will be overridden if the specified item
>>> >>> already
>>> >>>>>> exists. The ADD operation is *ADD or UPDATE*.
>>> >>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained
>>> >> in
>>> >>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>>> >>>>> abstracted
>>> >>>>>> to other interfaces) will be propagated to different components.
>>> >>>>>>
>>> >>>>>> Best,
>>> >>>>>> Lijie
>>> >>>>>>
>>> >>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>>> >>>>>>
>>> >>>>>>> Hi Lijie,
>>> >>>>>>>
>>> >>>>>>> Thanks for updating the FLIP. It looks like the public interface
>>> >>>> section
>>> >>>>>>> did not fully reflect all the user sensible behavior and API. Can
>>> >>> you
>>> >>>>> put
>>> >>>>>>> everything that users may be aware of there? That would include
>>> >> the
>>> >>>> REST
>>> >>>>>>> API, metrics, configurations, public java Interfaces or
>>> pluggables
>>> >>>> that
>>> >>>>>>> users may see or implement by themselves, as well as a brief
>>> >> summary
>>> >>>> of
>>> >>>>>>> the
>>> >>>>>>> behavior of the public API.
>>> >>>>>>>
>>> >>>>>>> Besides that, I have a few questions:
>>> >>>>>>>
>>> >>>>>>> 1. According to the conversation in the discussion thread, it
>>> >> looks
>>> >>>> like
>>> >>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>>> >>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
>>> >> can
>>> >>>> you
>>> >>>>>>> add
>>> >>>>>>> that to the public interface as well?
>>> >>>>>>>
>>> >>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>>> >>> Throwable
>>> >>>>> and
>>> >>>>>>> is not reflected in the REST API. Should that be included in the
>>> >>> query
>>> >>>>>>> response? And should we change that field to be a String so users
>>> >>> may
>>> >>>>>>> specify the cause via the REST API when they block some nodes /
>>> >> TMs?
>>> >>>>>>>
>>> >>>>>>> 3. Would it be useful to allow users to have different timeouts
>>> >> for
>>> >>>>>>> different blocked items? So while there is a default timeout,
>>> >> users
>>> >>>> can
>>> >>>>>>> also override it via the REST API when they block an entity.
>>> >>>>>>>
>>> >>>>>>> 4. Regarding the ADD operation, if the specified item is already
>>> >>>> there,
>>> >>>>>>> will the block item be overridden? For example, if the user wants
>>> >> to
>>> >>>>>>> extend
>>> >>>>>>> the timeout of a blocked item, can they just  issue an ADD
>>> command
>>> >>>>> again?
>>> >>>>>>>
>>> >>>>>>> 5. I am not quite familiar with the details of this, but is there
>>> >> a
>>> >>>>> source
>>> >>>>>>> of truth for the blocked list? I think it might be good to have a
>>> >>>> single
>>> >>>>>>> source of truth for the blocked list and just propagate that list
>>> >> to
>>> >>>>>>> different components to take the action of actually blocking the
>>> >>>>> resource.
>>> >>>>>>>
>>> >>>>>>> Thanks,
>>> >>>>>>>
>>> >>>>>>> Jiangjie (Becket) Qin
>>> >>>>>>>
>>> >>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>>> >> wangdachui9501@gmail.com
>>> >>>>
>>> >>>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> Hi everyone,
>>> >>>>>>>>
>>> >>>>>>>> Based on the discussion in the mailing list, I updated the FLIP
>>> >>> doc,
>>> >>>>> the
>>> >>>>>>>> changes include:
>>> >>>>>>>> 1. Changed the description of the motivation section to more
>>> >>> clearly
>>> >>>>>>>> describe the problem this FLIP is trying to solve.
>>> >>>>>>>> 2. Only  *Manually* is supported.
>>> >>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>>> >>>>>>>>
>>> >>>>>>>> Best,
>>> >>>>>>>> Lijie
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>>> >>>>>>>>
>>> >>>>>>>>> Hi Lijie!
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>>> >>>>>>> helpfulhere.”
>>> >>>>>>>>> Can you explain it in detail? I don't fully understand that.
>>> >> In
>>> >>>>>>>> myopinion,
>>> >>>>>>>>> non-active and active are the same, and no special treatment
>>> >>>>>>> isrequired.*
>>> >>>>>>>>>
>>> >>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>>> >>> were
>>> >>>>>>> talking
>>> >>>>>>>>> about the HA mode (but not about Active and Standalone
>>> >>>>>>> ResourceManager).
>>> >>>>>>>>> And the original question was - how to handle the blacklisted
>>> >>>> nodes
>>> >>>>>>> list
>>> >>>>>>>> at
>>> >>>>>>>>> the moment of leader change? Should we simply forget about
>>> >> them
>>> >>> or
>>> >>>>>>> try to
>>> >>>>>>>>> pre-save that list on the remote storage?
>>> >>>>>>>>>
>>> >>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey.wy@gmail.com
>>> >>>
>>> >>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>>> >>>>>>>>>>
>>> >>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>>> >> it
>>> >>> is
>>> >>>>>>> indeed
>>> >>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>>> >>>> could
>>> >>>>>>> not
>>> >>>>>>>>>> support.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> Best,
>>> >>>>>>>>>> Yang
>>> >>>>>>>>>>
>>> >>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
>>> >>>>>>>>>>
>>> >>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> @Jiangang
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>>> >> and
>>> >>>>> mark a
>>> >>>>>>>> node
>>> >>>>>>>>>>> blocked?
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> In fact, we currently plan to not support auto-detection
>>> >> in
>>> >>>> this
>>> >>>>>>>> FLIP.
>>> >>>>>>>>>> The
>>> >>>>>>>>>>> part about auto-detection may be continued in a separate
>>> >>> FLIP
>>> >>>> in
>>> >>>>>>> the
>>> >>>>>>>>>>> future. Some guys have the same concerns as you, and the
>>> >>>>>>> correctness
>>> >>>>>>>>> and
>>> >>>>>>>>>>> necessity of auto-detection may require further discussion
>>> >>> in
>>> >>>>> the
>>> >>>>>>>>> future.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>>> >> node
>>> >>>> and
>>> >>>>>>> the
>>> >>>>>>>>> node
>>> >>>>>>>>>>> should be marked blocked.
>>> >>>>>>>>>>> By design, the blocklist information will be shared among
>>> >>> all
>>> >>>>> jobs
>>> >>>>>>>> in a
>>> >>>>>>>>>>> cluster/session. The JM will sync blocklist information
>>> >> with
>>> >>>> RM.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> @Martijn
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> I agree with Yang Wang on this.
>>> >>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>>> >>>>>>> MARK_BLOCKLISTED(Just
>>> >>>>>>>>>> limits
>>> >>>>>>>>>>> the load of the node and does not  kill all the processes
>>> >> on
>>> >>>> it)
>>> >>>>>>> is
>>> >>>>>>>>> also
>>> >>>>>>>>>>> important, and we think that external systems (*yarn
>>> >> rmadmin
>>> >>>> or
>>> >>>>>>>> kubectl
>>> >>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>>> >>>> only
>>> >>>>>>>>>> *manually*.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>> >>> indeed
>>> >>>>>>> super
>>> >>>>>>>>> hard
>>> >>>>>>>>>>> to get right.
>>> >>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>>> >>>>> concern.
>>> >>>>>>>>>>> However, we currently plan to not support auto-detection
>>> >> in
>>> >>>> this
>>> >>>>>>>> FLIP,
>>> >>>>>>>>>> and
>>> >>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>>> >> FLIP
>>> >>>>>>> provides
>>> >>>>>>>> a
>>> >>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>>> >>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>>> >>>>>>>>>>> the auto-detection may be done by external systems.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best,
>>> >>>>>>>>>>> Lijie
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
>>> >>> 19:04写道：
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>>> If we only support to block nodes manually, then I
>>> >> could
>>> >>>> not
>>> >>>>>>> see
>>> >>>>>>>>>>>> the obvious advantages compared with current SRE's
>>> >>>>> approach(via
>>> >>>>>>>> *yarn
>>> >>>>>>>>>>>> rmadmin or kubectl taint*).
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> I agree with Yang Wang on this.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>> To me this sounds yet again like one of those magical
>>> >>>>>>> mechanisms
>>> >>>>>>>>>> that
>>> >>>>>>>>>>>> will rarely work just right.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>> >>> indeed
>>> >>>>>>> super
>>> >>>>>>>>> hard
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>> get right.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Best regards,
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Martijn
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>>> >>>>>>>> liujiangangpeng@gmail.com
>>> >>>>>>>>>>
>>> >>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>>> >>>>> decrease
>>> >>>>>>>>> great
>>> >>>>>>>>>>> work
>>> >>>>>>>>>>>>> for us. We have implemented the similar feature in our
>>> >>>> inner
>>> >>>>>>> flink
>>> >>>>>>>>>>>>> version.
>>> >>>>>>>>>>>>> Below is something that I care about:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>   1. For auto-detecting, I wonder how to make the
>>> >>> strategy
>>> >>>>> and
>>> >>>>>>>>> mark a
>>> >>>>>>>>>>>>> node
>>> >>>>>>>>>>>>>   blocked? Sometimes the blocked node is hard to be
>>> >>>>> detected,
>>> >>>>>>> for
>>> >>>>>>>>>>>>> example,
>>> >>>>>>>>>>>>>   the upper node or the down node will be blocked when
>>> >>>>> network
>>> >>>>>>>>>>>>> unreachable.
>>> >>>>>>>>>>>>>   2. I see that the strategy is made in JobMaster
>>> >> side.
>>> >>>> How
>>> >>>>>>> about
>>> >>>>>>>>>>>>>   implementing the similar logic in resource manager?
>>> >> In
>>> >>>>>>> session
>>> >>>>>>>>>> mode,
>>> >>>>>>>>>>>>> multi
>>> >>>>>>>>>>>>>   jobs can fail on the same bad node and the node
>>> >> should
>>> >>>> be
>>> >>>>>>>> marked
>>> >>>>>>>>>>>>> blocked.
>>> >>>>>>>>>>>>>   If the job makes the strategy, the node may be not
>>> >>>> marked
>>> >>>>>>>> blocked
>>> >>>>>>>>>> if
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>   fail times don't exceed the threshold.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Thank you for all your feedback!
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>>> >> some
>>> >>> of
>>> >>>>> my
>>> >>>>>>>>>> thoughts:
>>> >>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>>> >>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>>> >>>>>>>>>>>>>> The automatical way should be something built upon
>>> >> the
>>> >>>>>>> blocklist
>>> >>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>>> >>>>>>>> configurable
>>> >>>>>>>>>>>>>> blocklist strategy, but I think we can further
>>> >> decouple
>>> >>>> it
>>> >>>>> by
>>> >>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>>> >>>> suggested,
>>> >>>>>>> which
>>> >>>>>>>>>> just
>>> >>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>>> >>> detected.
>>> >>>>>>>> However,
>>> >>>>>>>>>> it
>>> >>>>>>>>>>>>>> should be a separate FLIP with further dev
>>> >> discussions
>>> >>>> and
>>> >>>>>>>>> feedback
>>> >>>>>>>>>>>>>> from users. I also agree with Becket that different
>>> >>> users
>>> >>>>>>> have
>>> >>>>>>>>>>> different
>>> >>>>>>>>>>>>>> requirements, and we should listen to them.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>>> >>>> externally
>>> >>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>>> >> way
>>> >>> to
>>> >>>>>>> avoid
>>> >>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>>> >>>> users
>>> >>>>>>> may
>>> >>>>>>>>> just
>>> >>>>>>>>>>>>>> want to limit the load of the node and do not want to
>>> >>>> kill
>>> >>>>>>> all
>>> >>>>>>>> the
>>> >>>>>>>>>>>>>> processes on it. Another case is the speculative
>>> >>>>> execution[1]
>>> >>>>>>>>> which
>>> >>>>>>>>>>>>>> may also leverage this feature to avoid starting
>>> >> mirror
>>> >>>>>>> tasks on
>>> >>>>>>>>>> slow
>>> >>>>>>>>>>>>>> nodes.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>>>> Zhu
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> [1]
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
>>> >>>>> 15:56写道：
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Hi everyone,
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Thanks for your feedback.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>>> >> here
>>> >>>>>>> because
>>> >>>>>>>> it
>>> >>>>>>>>>> can
>>> >>>>>>>>>>>>>> affect the value and design of the blocklist
>>> >> mechanism
>>> >>>>>>> (perhaps
>>> >>>>>>>> I
>>> >>>>>>>>>>> should
>>> >>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>>> >>>> FLIP:
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>>> >>> node
>>> >>>>> as
>>> >>>>>>>>>> blocked.
>>> >>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>>> >>>> task
>>> >>>>>>>> manager
>>> >>>>>>>>>> or
>>> >>>>>>>>>>>>> node.
>>> >>>>>>>>>>>>>> But slots that are already allocated will not be
>>> >>>> affected.
>>> >>>>> A
>>> >>>>>>>>> typical
>>> >>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>>> >>> In
>>> >>>>> this
>>> >>>>>>>>> case,
>>> >>>>>>>>>> we
>>> >>>>>>>>>>>>> hope
>>> >>>>>>>>>>>>>> that subsequent resource allocations will not be on
>>> >> the
>>> >>>> hot
>>> >>>>>>>>> machine,
>>> >>>>>>>>>>> but
>>> >>>>>>>>>>>>>> tasks currently running on it should not be affected.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>>> >> task
>>> >>>>>>> manager
>>> >>>>>>>> or
>>> >>>>>>>>>>> node
>>> >>>>>>>>>>>>> as
>>> >>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>>> >> tasks
>>> >>>> will
>>> >>>>>>> be
>>> >>>>>>>>>>>>> restarted on
>>> >>>>>>>>>>>>>> non-blocked task managers.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> For the above 2 actions, the former may more
>>> >>> highlight
>>> >>>>> the
>>> >>>>>>>>> meaning
>>> >>>>>>>>>>> of
>>> >>>>>>>>>>>>>> this FLIP, because the external system cannot do
>>> >> that.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>>> >> basically
>>> >>>>> agree
>>> >>>>>>>> with
>>> >>>>>>>>>>>>> @Becket
>>> >>>>>>>>>>>>>> Qin: different users have different answers. Not all
>>> >>>> users’
>>> >>>>>>>>>> deployment
>>> >>>>>>>>>>>>>> environments have a special external system that can
>>> >>>>> perform
>>> >>>>>>> the
>>> >>>>>>>>>>> anomaly
>>> >>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>>> >>>>>>> auto-detection
>>> >>>>>>>>>>> doesn't
>>> >>>>>>>>>>>>>> require much extra work on top of manual
>>> >> specification.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> I will answer your other questions one by one.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Yangze
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>>> >>> the
>>> >>>>>>>>>>>>>>
>>> >>> `cluster.resource-blocklist.item.timeout-check-interval`
>>> >>>> to
>>> >>>>>>>> users.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>>> >>> separate
>>> >>>>>>>> interface
>>> >>>>>>>>>>>>> (maybe
>>> >>>>>>>>>>>>>> BlocklistExceptionListener), and the
>>> >>>>>>>>> ResourceManagerBlocklistHandler
>>> >>>>>>>>>>> can
>>> >>>>>>>>>>>>>> implement it in the future.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Martijn
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>>> >> by
>>> >>>>>>> cluster
>>> >>>>>>>>>>>>> operators.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>>> >>>> according
>>> >>>>>>> to
>>> >>>>>>>> my
>>> >>>>>>>>>>>>>> experience, users are often the first to perceive the
>>> >>>>> machine
>>> >>>>>>>>>> problems
>>> >>>>>>>>>>>>>> (because of job failover or delay), and they will
>>> >>> contact
>>> >>>>>>>> cluster
>>> >>>>>>>>>>>>> operators
>>> >>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>>> >>>>>>> machine is
>>> >>>>>>>>>>>>>> problematic. From this point of view, I think the
>>> >>> people
>>> >>>>> who
>>> >>>>>>>>> really
>>> >>>>>>>>>>> need
>>> >>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>>> >>>> performed
>>> >>>>> by
>>> >>>>>>>> the
>>> >>>>>>>>>>>>> cluster
>>> >>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Chesnay
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>>> >>> for
>>> >>>>>>>>>>>>> MARK_BLOCKLISTED
>>> >>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>>> >> when
>>> >>>> the
>>> >>>>>>> task
>>> >>>>>>>> is
>>> >>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>>> >>>> release
>>> >>>>>>> the
>>> >>>>>>>>> slot
>>> >>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>>> >>> on
>>> >>>>> it.
>>> >>>>>>> If
>>> >>>>>>>> we
>>> >>>>>>>>>>> only
>>> >>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>>> >> needs
>>> >>> to
>>> >>>>>>>> retrieve
>>> >>>>>>>>>> it
>>> >>>>>>>>>>> by
>>> >>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>>> >>>> relatively
>>> >>>>>>>> large,
>>> >>>>>>>>>> so
>>> >>>>>>>>>>> I
>>> >>>>>>>>>>>>>> think it's worth maintaining the blocklist
>>> >> information
>>> >>> on
>>> >>>>>>> the JM
>>> >>>>>>>>>> side
>>> >>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>> syncing them.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Роман
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>    a) “Probably storing inside Zookeeper/Configmap
>>> >>>> might
>>> >>>>>>> be
>>> >>>>>>>>>> helpful
>>> >>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>>> >>>>>>> understand
>>> >>>>>>>>> that.
>>> >>>>>>>>>>> In
>>> >>>>>>>>>>>>> my
>>> >>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>>> >>>> special
>>> >>>>>>>>>> treatment
>>> >>>>>>>>>>> is
>>> >>>>>>>>>>>>>> required.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>>> >> sense,
>>> >>> I
>>> >>>>> will
>>> >>>>>>>> add
>>> >>>>>>>>> it
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> FLIP.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Yang
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>>> >> cannot
>>> >>>>>>> support
>>> >>>>>>>> the
>>> >>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Looking forward to your further feedback.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Lijie
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
>>> >>>> 21:09写道：
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>>> >>>>>>> operations.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>>> >>>>>>> Reliability
>>> >>>>>>>>>>> Engineer)
>>> >>>>>>>>>>>>>>>> already has many tools to detect the unstable
>>> >> nodes,
>>> >>>>> which
>>> >>>>>>>>> could
>>> >>>>>>>>>>> take
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>> system logs/metrics into consideration.
>>> >>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>>> >> taint
>>> >>>> in
>>> >>>>>>> K8s
>>> >>>>>>>> to
>>> >>>>>>>>>>>>> prevent
>>> >>>>>>>>>>>>>> new
>>> >>>>>>>>>>>>>>>> allocations on these unstable nodes.
>>> >>>>>>>>>>>>>>>> At last, they will evict all the containers and
>>> >> pods
>>> >>>>>>> running
>>> >>>>>>>> on
>>> >>>>>>>>>>> these
>>> >>>>>>>>>>>>>> nodes.
>>> >>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>>> >>> So
>>> >>>> I
>>> >>>>> am
>>> >>>>>>>>> afraid
>>> >>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>>> >>>> could
>>> >>>>>>> not
>>> >>>>>>>>> see
>>> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>> >>>>>>>> approach(via
>>> >>>>>>>>>>> *yarn
>>> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>>> >>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>>> >>> which
>>> >>>>>>> could
>>> >>>>>>>>>> expose
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>>> >>> them
>>> >>>> if
>>> >>>>>>>>> enabled
>>> >>>>>>>>>>>>>> explicitly.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>> Yang
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
>>> >>>> 16:36写道：
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>>> >> and
>>> >>>>>>> somewhat
>>> >>>>>>>>>>> related
>>> >>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>> design principle about how people should operate
>>> >>>>> Flink.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> I think there are three things involved in this
>>> >>>> FLIP.
>>> >>>>>>>>>>>>>>>>>     a) Detect and report the unstable node.
>>> >>>>>>>>>>>>>>>>>     b) Collect the information of the unstable
>>> >>> node
>>> >>>>> and
>>> >>>>>>>>> form a
>>> >>>>>>>>>>>>>> blocklist.
>>> >>>>>>>>>>>>>>>>>     c) Take the action to block nodes.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> My two cents:
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>>> >>> should
>>> >>>>> have
>>> >>>>>>>> c).
>>> >>>>>>>>> It
>>> >>>>>>>>>>> is
>>> >>>>>>>>>>>>>> not only
>>> >>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>>> >> handy
>>> >>>> for
>>> >>>>>>> some
>>> >>>>>>>>>>> planned
>>> >>>>>>>>>>>>>>>>> maintenance.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>>> >> who
>>> >>>>>>> should be
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>> brain
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>>> >>>>> largely
>>> >>>>>>>>> depends
>>> >>>>>>>>>>> on
>>> >>>>>>>>>>>>>> who we
>>> >>>>>>>>>>>>>>>>> talk to. Different users would probably give
>>> >>>> different
>>> >>>>>>>>> answers.
>>> >>>>>>>>>>> For
>>> >>>>>>>>>>>>>> people
>>> >>>>>>>>>>>>>>>>> who do have a centralized node health management
>>> >>>>>>> service,
>>> >>>>>>>> let
>>> >>>>>>>>>>> Flink
>>> >>>>>>>>>>>>>> do just
>>> >>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>>> >>>> Flink
>>> >>>>>>> would
>>> >>>>>>>>> be
>>> >>>>>>>>>>> one
>>> >>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>>> >> it
>>> >>> to
>>> >>>>>>> that
>>> >>>>>>>>>>> service,
>>> >>>>>>>>>>>>>> and then
>>> >>>>>>>>>>>>>>>>> take the command from that service to block the
>>> >>>>>>> problematic
>>> >>>>>>>>>>> nodes.
>>> >>>>>>>>>>>>> On
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>>> >>>> service,
>>> >>>>>>>> simply
>>> >>>>>>>>>>>>> letting
>>> >>>>>>>>>>>>>> Flink
>>> >>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>>> >> nodes
>>> >>>>> might
>>> >>>>>>> be
>>> >>>>>>>>>>> desired
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>>> >>>> pluggable /
>>> >>>>>>>>>> optional.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>>> >>> have
>>> >>>>>>>> something
>>> >>>>>>>>>>>>>> pluggable
>>> >>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>>> >> unstable
>>> >>>>> nodes
>>> >>>>>>>>>>> actively.
>>> >>>>>>>>>>>>> (A
>>> >>>>>>>>>>>>>> more
>>> >>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>>> >>> which
>>> >>>>>>> can be
>>> >>>>>>>>>> used
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> report
>>> >>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>>> >>> the
>>> >>>>>>> scope
>>> >>>>>>>>>>> relevant
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>>> >> OK
>>> >>> to
>>> >>>>>>> have a
>>> >>>>>>>>>>> default
>>> >>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>>> >>> Flink
>>> >>>> to
>>> >>>>>>> take
>>> >>>>>>>>>>> action
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>> block
>>> >>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>>> >>>>> timeout.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>>> >>>>>>>>>> ro.v.boyko@gmail.com
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>>> >>>>>>> development.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>>> >>> we
>>> >>>>>>> should
>>> >>>>>>>>>>> consider
>>> >>>>>>>>>>>>>> to move
>>> >>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>>> >> nodes/TMs
>>> >>>>> from
>>> >>>>>>>>> active
>>> >>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>>> >>>> storing
>>> >>>>>>>> inside
>>> >>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>>> >>>> organizations
>>> >>>>>>>> don't
>>> >>>>>>>>>> want
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>> expose
>>> >>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>>> >>>> it's
>>> >>>>>>>>> necessary
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> have the
>>> >>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>>> >>> for
>>> >>>>>>>> avoiding
>>> >>>>>>>>>>>>> incorrect
>>> >>>>>>>>>>>>>>>>>> automatic behaviour.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>>> >>>> would
>>> >>>>> be
>>> >>>>>>>>> better
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> extend
>>> >>>>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>>> >> *endTimestamp*
>>> >>>>> field
>>> >>>>>>> and
>>> >>>>>>>>>> fill
>>> >>>>>>>>>>> it
>>> >>>>>>>>>>>>>> at the
>>> >>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>>> >>> to:
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>   -
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>   Provide the ability to users to setup the
>>> >>> exact
>>> >>>>>>> time
>>> >>>>>>>> of
>>> >>>>>>>>>>>>>> blocklist end
>>> >>>>>>>>>>>>>>>>>>   through RestAPI
>>> >>>>>>>>>>>>>>>>>>   -
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>   Not being tied to a single value of
>>> >>>>>>>>>>>>>>>>>>   *cluster.resource-blacklist.item.timeout*
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>>> >> <
>>> >>>>>>>>>>>>> chesnay@apache.org>
>>> >>>>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>>> >>>> lines
>>> >>>>> a
>>> >>>>>>>> bit.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>>> >>>>> auto-detection
>>> >>>>>>>> and
>>> >>>>>>>>>> only
>>> >>>>>>>>>>>>>> have an
>>> >>>>>>>>>>>>>>>>>>> opt-in mechanism
>>> >>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>>> >> this
>>> >>>>> sounds
>>> >>>>>>>> yet
>>> >>>>>>>>>>> again
>>> >>>>>>>>>>>>>> like one
>>> >>>>>>>>>>>>>>>>>>> of those
>>> >>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>>> >> just
>>> >>>>> right.
>>> >>>>>>>>>>>>>>>>>>> An external system can leverage way more
>>> >>>>> information
>>> >>>>>>>>> after
>>> >>>>>>>>>>> all.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>>> >>>> complexity
>>> >>>>>>> of
>>> >>>>>>>>> this
>>> >>>>>>>>>>>>>> proposal.
>>> >>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>>> >>> between
>>> >>>>>>>>>> components;
>>> >>>>>>>>>>>>>>>>> adjustments
>>> >>>>>>>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>>>> slot and resource protocol.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>>> >>> resource
>>> >>>>>>>>> management
>>> >>>>>>>>>>>>>> perspective,
>>> >>>>>>>>>>>>>>>>>>> then there
>>> >>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>>> >> protocol
>>> >>>> at
>>> >>>>>>> all
>>> >>>>>>>> (or
>>> >>>>>>>>>> in
>>> >>>>>>>>>>>>> fact
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>>> >>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>>> >>> around
>>> >>>>>>> blocked
>>> >>>>>>>>> TMs
>>> >>>>>>>>>>> in
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> first
>>> >>>>>>>>>>>>>>>>>>> place.
>>> >>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>>> >>> down
>>> >>>>>>> after
>>> >>>>>>>> a
>>> >>>>>>>>>>> while
>>> >>>>>>>>>>>>> by
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> RM
>>> >>>>>>>>>>>>>>>>>>> because of
>>> >>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>>> >>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>>> >>>> blocking
>>> >>>>>>> the
>>> >>>>>>>>>> process
>>> >>>>>>>>>>>>> or
>>> >>>>>>>>>>>>>> node)
>>> >>>>>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>>> >>>>>>> process(es).
>>> >>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>>> >>>> back
>>> >>>>>>> up.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>>> >> the
>>> >>>> same
>>> >>>>>>> line
>>> >>>>>>>>> of
>>> >>>>>>>>>>>>>> thinking to
>>> >>>>>>>>>>>>>>>>>>> standalone resource management.
>>> >>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>>> >> process/node
>>> >>>>>>> manually
>>> >>>>>>>>>>> should
>>> >>>>>>>>>>>>> be
>>> >>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>> core
>>> >>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>>> >>>>>>>>>>>>>>>>>>>> Hi everyone,
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>>> >>>> understand
>>> >>>>>>> the
>>> >>>>>>>>>> problem
>>> >>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>> I see
>>> >>>>>>>>>>>>>>>>>>> value
>>> >>>>>>>>>>>>>>>>>>>> in the automatic detection and
>>> >>> blocklisting. I
>>> >>>>> do
>>> >>>>>>>> have
>>> >>>>>>>>>> some
>>> >>>>>>>>>>>>>> concerns
>>> >>>>>>>>>>>>>>>>>> with
>>> >>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>>> >>> blocked
>>> >>>>>>>>> resources.
>>> >>>>>>>>>> I
>>> >>>>>>>>>>>>> have
>>> >>>>>>>>>>>>>> two
>>> >>>>>>>>>>>>>>>>>>>> concerns;
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>>> >>>>> separation
>>> >>>>>>> of
>>> >>>>>>>>>>>>> concerns,
>>> >>>>>>>>>>>>>>>>> meaning
>>> >>>>>>>>>>>>>>>>>>> that
>>> >>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>>> >>>> managing a
>>> >>>>>>>>> cluster
>>> >>>>>>>>>>> and
>>> >>>>>>>>>>>>>> there's
>>> >>>>>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>>>> user
>>> >>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>>> >>>>>>> introduction of
>>> >>>>>>>>>> this
>>> >>>>>>>>>>>>>> mechanism,
>>> >>>>>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>>> latter group now can influence the
>>> >>>>> responsibility
>>> >>>>>>> of
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>> first
>>> >>>>>>>>>>>>>> group.
>>> >>>>>>>>>>>>>>>>>> So
>>> >>>>>>>>>>>>>>>>>>> it
>>> >>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>>> >>>> group
>>> >>>>>>>> blocks
>>> >>>>>>>>>>>>>> something,
>>> >>>>>>>>>>>>>>>>>> which
>>> >>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>>> >>> paging
>>> >>>>>>>>> mechanism
>>> >>>>>>>>>>>>>> triggering
>>> >>>>>>>>>>>>>>>>>> etc)
>>> >>>>>>>>>>>>>>>>>>>> which impacts the first group.
>>> >>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>>> >> go
>>> >>>>>>> through
>>> >>>>>>>> the
>>> >>>>>>>>>>>>> process
>>> >>>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>>>>>>> manually
>>> >>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>>> >> it
>>> >>>>>>> should
>>> >>>>>>>>> be? I
>>> >>>>>>>>>>> do
>>> >>>>>>>>>>>>>> think
>>> >>>>>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>>> >>> make
>>> >>>>>>> sense
>>> >>>>>>>> to
>>> >>>>>>>>>>>>>> introduce
>>> >>>>>>>>>>>>>>>>> such
>>> >>>>>>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>>> >>> really
>>> >>>>>>> small
>>> >>>>>>>>> user
>>> >>>>>>>>>>>>> group
>>> >>>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>>>>>> Flink?
>>> >>>>>>>>>>>>>>>>>>> We
>>> >>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>>> >>> such
>>> >>>> a
>>> >>>>>>>>> feature.
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>>> >>> I'm
>>> >>>>>>>> leaning
>>> >>>>>>>>>>>>> towards
>>> >>>>>>>>>>>>>> not
>>> >>>>>>>>>>>>>>>>>>> exposing
>>> >>>>>>>>>>>>>>>>>>>> this to the user group but having this
>>> >>>> available
>>> >>>>>>>>> strictly
>>> >>>>>>>>>>> for
>>> >>>>>>>>>>>>>> cluster
>>> >>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>>> >> their
>>> >>>>>>>>>>>>>> paging/metrics/logging
>>> >>>>>>>>>>>>>>>>>>> system
>>> >>>>>>>>>>>>>>>>>>>> to take this into account.
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Best regards,
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Martijn Visser
>>> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>>> >>>>>>>>>>> karmagyz@gmail.com
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>>> >>> some
>>> >>>>>>> cents
>>> >>>>>>>>> here:
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>>> >>>>>>>>>> to
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>> user?
>>> >>>>>>>>>>>>>>>>>>>>> I think the semantics of
>>> >>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>>> >>>>>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>>> >>> the
>>> >>>>>>>> timeout
>>> >>>>>>>>>>>>>> mechanism is
>>> >>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>>> >> it
>>> >>>>> will
>>> >>>>>>> be
>>> >>>>>>>>> very
>>> >>>>>>>>>>>>>> confusing
>>> >>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>>> >> exception
>>> >>>> of a
>>> >>>>>>> task
>>> >>>>>>>>>>>>> manager to
>>> >>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>>> >>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>>> >> fail
>>> >>>> in
>>> >>>>>>> case
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>> target
>>> >>>>>>>>>>>>>> task
>>> >>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>>> >> I
>>> >>>>> don't
>>> >>>>>>>> mean
>>> >>>>>>>>> we
>>> >>>>>>>>>>>>> need
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> cover
>>> >>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>>> >> also
>>> >>>>> open a
>>> >>>>>>>>>>>>>> `notifyException`
>>> >>>>>>>>>>>>>>>>> in
>>> >>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>>> >>>>>>> ResourceManager,
>>> >>>>>>>>> will
>>> >>>>>>>>>>> the
>>> >>>>>>>>>>>>>> slot of
>>> >>>>>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>>> >>> released
>>> >>>>> and
>>> >>>>>>>>>>> allocated?
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>>>>>>> Yangze Guo
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>>> >> Wang
>>> >>> <
>>> >>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
>>> >>>>>>>>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>>> >> response
>>> >>>>> your 4
>>> >>>>>>>>>> remarks:
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>>> >>>>> controversy. I
>>> >>>>>>>>> think
>>> >>>>>>>>>>>>>> “BlockList”
>>> >>>>>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>>>>>>> good
>>> >>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>>> >>>> good
>>> >>>>>>> idea.
>>> >>>>>>>>>> Based
>>> >>>>>>>>>>>>> on
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>> above, I
>>> >>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >> <host>/blocklist/node/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>>>> <host>/blocklist/taskmanager/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>>> >>>> means
>>> >>>>>>> that
>>> >>>>>>>>> all
>>> >>>>>>>>>>>>> task
>>> >>>>>>>>>>>>>>>>> managers
>>> >>>>>>>>>>>>>>>>>>> on
>>> >>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>>> >>>> these
>>> >>>>>>> TMs
>>> >>>>>>>> are
>>> >>>>>>>>>> not
>>> >>>>>>>>>>>>>>>>> available.
>>> >>>>>>>>>>>>>>>>>>> This
>>> >>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>>> >> these
>>> >>>> TMs
>>> >>>>>>> are
>>> >>>>>>>>> not
>>> >>>>>>>>>>>>> really
>>> >>>>>>>>>>>>>> lost,
>>> >>>>>>>>>>>>>>>>>>> they
>>> >>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>>> >> are
>>> >>>>> still
>>> >>>>>>>>>>> registered
>>> >>>>>>>>>>>>>> in this
>>> >>>>>>>>>>>>>>>>>>> flink
>>> >>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>>> >> once
>>> >>>> the
>>> >>>>>>>>>>>>> corresponding
>>> >>>>>>>>>>>>>>>>>> blocklist
>>> >>>>>>>>>>>>>>>>>>>>> item
>>> >>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>>> >>>>>>>>>>> active/non-active
>>> >>>>>>>>>>>>>>>>> clusters.
>>> >>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>>> >> TMs
>>> >>>> may
>>> >>>>> be
>>> >>>>>>>>>> released
>>> >>>>>>>>>>>>> due
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> idle
>>> >>>>>>>>>>>>>>>>>>>>>> timeouts.
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>>> >> keep
>>> >>>> it.
>>> >>>>>>> The
>>> >>>>>>>>>> reasons
>>> >>>>>>>>>>>>> are
>>> >>>>>>>>>>>>>> as
>>> >>>>>>>>>>>>>>>>>>>>> following:
>>> >>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>>> >>> adding
>>> >>>> or
>>> >>>>>>>>> removing
>>> >>>>>>>>>>>>> items
>>> >>>>>>>>>>>>>> via
>>> >>>>>>>>>>>>>>>>>> REST
>>> >>>>>>>>>>>>>>>>>>>>> API,
>>> >>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>>> >> it
>>> >>>> to
>>> >>>>>>>>>>>>> Long.MAX_VALUE .
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>>> >> a
>>> >>>>>>> period of
>>> >>>>>>>>>> time
>>> >>>>>>>>>>>>>> (such as
>>> >>>>>>>>>>>>>>>>>>> machine
>>> >>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>>> >> prefer
>>> >>>> that
>>> >>>>>>>> Flink
>>> >>>>>>>>>> can
>>> >>>>>>>>>>> do
>>> >>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>>> >>> user
>>> >>>> to
>>> >>>>>>> do
>>> >>>>>>>> it
>>> >>>>>>>>>>>>> manually.
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Lijie
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
>>> >>>>>>> 于2022年4月27日周三
>>> >>>>>>>>>>>>> 19:23写道：
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>>> >> only
>>> >>>>>>> support
>>> >>>>>>>>>>> manually
>>> >>>>>>>>>>>>>> blocking
>>> >>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>>> >>> different
>>> >>>>>>>>> strategies
>>> >>>>>>>>>>> can
>>> >>>>>>>>>>>>>> also be
>>> >>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>>> >>>> "bla.cklist"
>>> >>>>>>> due
>>> >>>>>>>> to
>>> >>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> controversy
>>> >>>>>>>>>>>>>>>>>>>>> around
>>> >>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>>> >>> Ticket
>>> >>>>>>> about
>>> >>>>>>>>> this
>>> >>>>>>>>>>>>> topic a
>>> >>>>>>>>>>>>>>>>> while
>>> >>>>>>>>>>>>>>>>>>>>> back
>>> >>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>>> >>>> avoid
>>> >>>>>>> the
>>> >>>>>>>>> term
>>> >>>>>>>>>>>>>> blacklist &
>>> >>>>>>>>>>>>>>>>>>>>> whitelist
>>> >>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>>> >> "denylist"
>>> >>> or
>>> >>>>>>>>>>> "quarantined"
>>> >>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>>> >>> slightly
>>> >>>>>>>> different
>>> >>>>>>>>>>>>> design
>>> >>>>>>>>>>>>>> as
>>> >>>>>>>>>>>>>>>>> verbs
>>> >>>>>>>>>>>>>>>>>>>>> like
>>> >>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>>> >>> anti-pattern
>>> >>>>> for
>>> >>>>>>>> REST
>>> >>>>>>>>>>> APIs.
>>> >>>>>>>>>>>>>> POST
>>> >>>>>>>>>>>>>>>>> on a
>>> >>>>>>>>>>>>>>>>>>>>> list
>>> >>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>>> >>>> items.
>>> >>>>>>>> DELETE
>>> >>>>>>>>> on
>>> >>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>> individual
>>> >>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>>>> <host>/quarantine/items/<itemidentifier>
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>>> >>>>> taskmanagers
>>> >>>>>>>> and
>>> >>>>>>>>>>> nodes
>>> >>>>>>>>>>>>> in
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>> REST
>>> >>>>>>>>>>>>>>>>>>>>> API
>>> >>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>>> >>> opinion
>>> >>>>> on
>>> >>>>>>>> this?
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>> <host>/quarantine/nodes/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>>>>>> <host>/quarantine/taskmanager/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>>> >>>>>>> non-active
>>> >>>>>>>>>>>>> resource
>>> >>>>>>>>>>>>>>>>>> managers,
>>> >>>>>>>>>>>>>>>>>>>>> i.e.
>>> >>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>>> >>>>>>> minimal,
>>> >>>>>>>> do
>>> >>>>>>>>> we
>>> >>>>>>>>>>>>> need
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>> timeout
>>> >>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>>> >>>> manually
>>> >>>>>>> we
>>> >>>>>>>>> could
>>> >>>>>>>>>>>>>> delegate
>>> >>>>>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>>> >>>>> behavior
>>> >>>>>>>>> would
>>> >>>>>>>>>>>>> better
>>> >>>>>>>>>>>>>> fit
>>> >>>>>>>>>>>>>>>>>> into
>>> >>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Konstantin
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> [1]
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>>> >>>>>>>>>>>>>>>>>>>>>>> [2]
>>> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>>> >>> schrieb
>>> >>>>>>> Lijie
>>> >>>>>>>>> Wang
>>> >>>>>>>>>> <
>>> >>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>>> >>>> cluster
>>> >>>>>>> node
>>> >>>>>>>>>>> issues
>>> >>>>>>>>>>>>>>>>>>>>> (insufficient
>>> >>>>>>>>>>>>>>>>>>>>>>> disk
>>> >>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>>> >>>> abnormalities).
>>> >>>>>>>> Flink
>>> >>>>>>>>>> will
>>> >>>>>>>>>>>>>> take care
>>> >>>>>>>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>>> >>> However,
>>> >>>>> due
>>> >>>>>>> to
>>> >>>>>>>>> data
>>> >>>>>>>>>>>>>> locality
>>> >>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>>>> limited
>>> >>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>>> >> likely
>>> >>>> to
>>> >>>>> be
>>> >>>>>>>>>>> redeployed
>>> >>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>>> same
>>> >>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>>> >>>> task
>>> >>>>>>>>>>> abnormalities
>>> >>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>> affect
>>> >>>>>>>>>>>>>>>>>>>>> job
>>> >>>>>>>>>>>>>>>>>>>>>>>> progress.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>>> >> manually
>>> >>>>>>> identify
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>>> problematic
>>> >>>>>>>>>>>>>>>>>>>>> node and
>>> >>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>>> >>> But
>>> >>>>>>> this
>>> >>>>>>>>>>> approach
>>> >>>>>>>>>>>>> has
>>> >>>>>>>>>>>>>>>>>>>>> following
>>> >>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>>> >> heavy
>>> >>>>>>> process.
>>> >>>>>>>>>> Users
>>> >>>>>>>>>>>>> may
>>> >>>>>>>>>>>>>> need
>>> >>>>>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>>>>>>>> contact
>>> >>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>>> >>>>> operation
>>> >>>>>>> can
>>> >>>>>>>>>> even
>>> >>>>>>>>>>> be
>>> >>>>>>>>>>>>>>>>> dangerous
>>> >>>>>>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>>>> not
>>> >>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>>> >>>>> events.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>>> >> of
>>> >>>>>>> problems
>>> >>>>>>>>>>> manually
>>> >>>>>>>>>>>>>> would
>>> >>>>>>>>>>>>>>>>> be
>>> >>>>>>>>>>>>>>>>>>>>> slow
>>> >>>>>>>>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>>> >>>> propose
>>> >>>>>>> to
>>> >>>>>>>>>>>>> introduce a
>>> >>>>>>>>>>>>>>>>>> blacklist
>>> >>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>>> >>>>> problematic
>>> >>>>>>>>>>> resources.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>>> >>> FLIP-224[1].
>>> >>>>>>> Looking
>>> >>>>>>>>>>> forward
>>> >>>>>>>>>>>>>> to your
>>> >>>>>>>>>>>>>>>>>>>>>>> feedback.
>>> >>>>>>>>>>>>>>>>>>>>>>>> [1]
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> --
>>> >>>>>>>>> Best regards,
>>> >>>>>>>>> Roman Boyko
>>> >>>>>>>>> e.: ro.v.boyko@gmail.com
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> https://twitter.com/snntrable
>>> >>>> https://github.com/knaufk
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> https://twitter.com/snntrable
>>> >> https://github.com/knaufk
>>> >>
>>>
>>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi Konstantin,

We found that Flink REST URL does not support the format ":merge" , which
will be recognized as a parameter in the URL(due to start with a colon).

We will keep the previous way, i.e.

POST: http://{jm_rest_address:port}/blocklist/taskmanagers
and the "id" and "merge" flag are put into the request body.

Best,
Lijie

Lijie Wang <wa...@gmail.com> 于2022年5月18日周三 09:35写道：

> Hi Weihua,
> thanks for feedback.
>
> 1. Yes, only *Manually* is supported in this FLIP, but it's the first step
> towards auto-detection.
> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> exception of insufficient resources.
> 3. No. This FLIP won't change the WebUI. The blocklist information can be
> obtained through REST API and metrics.
>
> Best,
> Lijie
>
> Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：
>
>> Hi,
>> Thanks for creating this FLIP.
>> We have implemented an automatic blocklist detection mechanism
>> internally, which is indeed very effective for handling node failures.
>> Due to the large number of nodes, although SREs already support automatic
>> offline failure nodes, the detection is not 100% accurate and there is some
>> delay.
>> So the blocklist mechanism can make flink job recover from failure much
>> faster.
>>
>> Here are some of my thoughts:
>> 1. In this FLIP, it needs users to locate machine failure manually, there
>> is a certain cost of use
>> 2. What happens if too many nodes are blocked, resulting in insufficient
>> resources? Will there be a special Exception for the user?
>> 3. Will we display the blocklist information in the WebUI? The blocklist
>> is for cluster level, and if multiple users share a cluster, some users may
>> be a little confused when resources are not enough, or when resources are
>> applied for more.
>>
>> Also, Looking forward to the next FLIP on auto-detection.
>>
>> Best,
>> Weihua
>>
>> > 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
>> >
>> > Hi Konstantin,
>> >
>> > Maybe change it to the following:
>> >
>> > 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>> > Merge is not allowed. If the {id} already exists, return error.
>> Otherwise,
>> > create a new item.
>> >
>> > 2. POST: http://
>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>> > Merge is allowed. If the {id} already exists, merge. Otherwise, create a
>> > new item.
>> >
>> > WDYT?
>> >
>> > Best,
>> > Lijie
>> >
>> > Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
>> >
>> >> Hi Lijie,
>> >>
>> >> hm, maybe the following is more appropriate in that case
>> >>
>> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>> >>
>> >> Best,
>> >>
>> >> Konstantin
>> >>
>> >> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>> >> wangdachui9501@gmail.com>:
>> >>
>> >>> Hi Konstantin,
>> >>> thanks for your feedback.
>> >>>
>> >>> From what I understand, PUT should be idempotent. However, we have a
>> >>> *timeout* field in the request. This means that initiating the same
>> >> request
>> >>> at two different times will lead to different resource status
>> (timestamps
>> >>> of the items to be removed will be different).
>> >>>
>> >>> Should we use PUT in this case? WDYT?
>> >>>
>> >>> Best,
>> >>> Lijie
>> >>>
>> >>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>> >>>
>> >>>> Hi Lijie,
>> >>>>
>> >>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT on
>> >> the
>> >>>> resource?
>> >>>>
>> >>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>> >>>>
>> >>>> Best,
>> >>>>
>> >>>> Konstantin
>> >>>>
>> >>>>
>> >>>>
>> >>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>> >>>> wangdachui9501@gmail.com>:
>> >>>>
>> >>>>> Hi everyone,
>> >>>>>
>> >>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and made
>> >>> the
>> >>>>> following changes on REST API:
>> >>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
>> >>> one.
>> >>>> If
>> >>>>> both are specified, will return error.
>> >>>>> 2.  If the specified item is already there, the *ADD* operation has
>> >> two
>> >>>>> behaviors:  *return error*(default value) or *merge/update*, and we
>> >>> add a
>> >>>>> flag to the request body to control it. You can find more details
>> >>> "Public
>> >>>>> Interface" section.
>> >>>>>
>> >>>>> If there is no more feedback, we will start the vote thread next
>> >> week.
>> >>>>>
>> >>>>> Best,
>> >>>>> Lijie
>> >>>>>
>> >>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>> >>>>>
>> >>>>>> Hi Becket Qin,
>> >>>>>>
>> >>>>>> Thanks for your suggestions.  I have moved the description of
>> >>>>>> configurations, metrics and REST API into "Public Interface"
>> >> section,
>> >>>> and
>> >>>>>> made a few updates according to your suggestion.  And in this FLIP,
>> >>>> there
>> >>>>>> no public java Interfaces or pluggables that users need to
>> >> implement
>> >>> by
>> >>>>>> themselves.
>> >>>>>>
>> >>>>>> Answers for you questions:
>> >>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>> >>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>> >> items
>> >>>> can
>> >>>>>> only be added through the REST API, so these 2 action are mentioned
>> >>> in
>> >>>>> the
>> >>>>>> REST API part (The REST API part has beed moved to public interface
>> >>>> now).
>> >>>>>> 2. I agree with you. I have changed the "Cause" field to String,
>> >> and
>> >>>>> allow
>> >>>>>> users to specify it via REST API.
>> >>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>> >> above,
>> >>>> we
>> >>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the ADD
>> >>>> REST
>> >>>>>> API to specify when to remove the blocked item. These 2 fields are
>> >>>>>> optional, if neither is specified, it means that the blocked item
>> >> is
>> >>>>>> permanent and will not be removed. If both are specified, the
>> >> minimum
>> >>>> of
>> >>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the
>> >>> time
>> >>>> to
>> >>>>>> remove the blocked item. To keep the configurations more minimal,
>> >> we
>> >>>> have
>> >>>>>> removed the *cluster.resource-blocklist.item.timeout* configuration
>> >>>>>> option.
>> >>>>>> 4. Yes, the block item will be overridden if the specified item
>> >>> already
>> >>>>>> exists. The ADD operation is *ADD or UPDATE*.
>> >>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained
>> >> in
>> >>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>> >>>>> abstracted
>> >>>>>> to other interfaces) will be propagated to different components.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Lijie
>> >>>>>>
>> >>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>> >>>>>>
>> >>>>>>> Hi Lijie,
>> >>>>>>>
>> >>>>>>> Thanks for updating the FLIP. It looks like the public interface
>> >>>> section
>> >>>>>>> did not fully reflect all the user sensible behavior and API. Can
>> >>> you
>> >>>>> put
>> >>>>>>> everything that users may be aware of there? That would include
>> >> the
>> >>>> REST
>> >>>>>>> API, metrics, configurations, public java Interfaces or pluggables
>> >>>> that
>> >>>>>>> users may see or implement by themselves, as well as a brief
>> >> summary
>> >>>> of
>> >>>>>>> the
>> >>>>>>> behavior of the public API.
>> >>>>>>>
>> >>>>>>> Besides that, I have a few questions:
>> >>>>>>>
>> >>>>>>> 1. According to the conversation in the discussion thread, it
>> >> looks
>> >>>> like
>> >>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>> >>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
>> >> can
>> >>>> you
>> >>>>>>> add
>> >>>>>>> that to the public interface as well?
>> >>>>>>>
>> >>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>> >>> Throwable
>> >>>>> and
>> >>>>>>> is not reflected in the REST API. Should that be included in the
>> >>> query
>> >>>>>>> response? And should we change that field to be a String so users
>> >>> may
>> >>>>>>> specify the cause via the REST API when they block some nodes /
>> >> TMs?
>> >>>>>>>
>> >>>>>>> 3. Would it be useful to allow users to have different timeouts
>> >> for
>> >>>>>>> different blocked items? So while there is a default timeout,
>> >> users
>> >>>> can
>> >>>>>>> also override it via the REST API when they block an entity.
>> >>>>>>>
>> >>>>>>> 4. Regarding the ADD operation, if the specified item is already
>> >>>> there,
>> >>>>>>> will the block item be overridden? For example, if the user wants
>> >> to
>> >>>>>>> extend
>> >>>>>>> the timeout of a blocked item, can they just  issue an ADD command
>> >>>>> again?
>> >>>>>>>
>> >>>>>>> 5. I am not quite familiar with the details of this, but is there
>> >> a
>> >>>>> source
>> >>>>>>> of truth for the blocked list? I think it might be good to have a
>> >>>> single
>> >>>>>>> source of truth for the blocked list and just propagate that list
>> >> to
>> >>>>>>> different components to take the action of actually blocking the
>> >>>>> resource.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>>
>> >>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>
>> >>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>> >> wangdachui9501@gmail.com
>> >>>>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Hi everyone,
>> >>>>>>>>
>> >>>>>>>> Based on the discussion in the mailing list, I updated the FLIP
>> >>> doc,
>> >>>>> the
>> >>>>>>>> changes include:
>> >>>>>>>> 1. Changed the description of the motivation section to more
>> >>> clearly
>> >>>>>>>> describe the problem this FLIP is trying to solve.
>> >>>>>>>> 2. Only  *Manually* is supported.
>> >>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> Lijie
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>> >>>>>>>>
>> >>>>>>>>> Hi Lijie!
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>> >>>>>>> helpfulhere.”
>> >>>>>>>>> Can you explain it in detail? I don't fully understand that.
>> >> In
>> >>>>>>>> myopinion,
>> >>>>>>>>> non-active and active are the same, and no special treatment
>> >>>>>>> isrequired.*
>> >>>>>>>>>
>> >>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>> >>> were
>> >>>>>>> talking
>> >>>>>>>>> about the HA mode (but not about Active and Standalone
>> >>>>>>> ResourceManager).
>> >>>>>>>>> And the original question was - how to handle the blacklisted
>> >>>> nodes
>> >>>>>>> list
>> >>>>>>>> at
>> >>>>>>>>> the moment of leader change? Should we simply forget about
>> >> them
>> >>> or
>> >>>>>>> try to
>> >>>>>>>>> pre-save that list on the remote storage?
>> >>>>>>>>>
>> >>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey.wy@gmail.com
>> >>>
>> >>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>> >>>>>>>>>>
>> >>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>> >> it
>> >>> is
>> >>>>>>> indeed
>> >>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>> >>>> could
>> >>>>>>> not
>> >>>>>>>>>> support.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Yang
>> >>>>>>>>>>
>> >>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
>> >>>>>>>>>>
>> >>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Jiangang
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>> >> and
>> >>>>> mark a
>> >>>>>>>> node
>> >>>>>>>>>>> blocked?
>> >>>>>>>>>>>
>> >>>>>>>>>>> In fact, we currently plan to not support auto-detection
>> >> in
>> >>>> this
>> >>>>>>>> FLIP.
>> >>>>>>>>>> The
>> >>>>>>>>>>> part about auto-detection may be continued in a separate
>> >>> FLIP
>> >>>> in
>> >>>>>>> the
>> >>>>>>>>>>> future. Some guys have the same concerns as you, and the
>> >>>>>>> correctness
>> >>>>>>>>> and
>> >>>>>>>>>>> necessity of auto-detection may require further discussion
>> >>> in
>> >>>>> the
>> >>>>>>>>> future.
>> >>>>>>>>>>>
>> >>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>> >> node
>> >>>> and
>> >>>>>>> the
>> >>>>>>>>> node
>> >>>>>>>>>>> should be marked blocked.
>> >>>>>>>>>>> By design, the blocklist information will be shared among
>> >>> all
>> >>>>> jobs
>> >>>>>>>> in a
>> >>>>>>>>>>> cluster/session. The JM will sync blocklist information
>> >> with
>> >>>> RM.
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Martijn
>> >>>>>>>>>>>
>> >>>>>>>>>>>> I agree with Yang Wang on this.
>> >>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>> >>>>>>> MARK_BLOCKLISTED(Just
>> >>>>>>>>>> limits
>> >>>>>>>>>>> the load of the node and does not  kill all the processes
>> >> on
>> >>>> it)
>> >>>>>>> is
>> >>>>>>>>> also
>> >>>>>>>>>>> important, and we think that external systems (*yarn
>> >> rmadmin
>> >>>> or
>> >>>>>>>> kubectl
>> >>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>> >>>> only
>> >>>>>>>>>> *manually*.
>> >>>>>>>>>>>
>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>> >>> indeed
>> >>>>>>> super
>> >>>>>>>>> hard
>> >>>>>>>>>>> to get right.
>> >>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>> >>>>> concern.
>> >>>>>>>>>>> However, we currently plan to not support auto-detection
>> >> in
>> >>>> this
>> >>>>>>>> FLIP,
>> >>>>>>>>>> and
>> >>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>> >> FLIP
>> >>>>>>> provides
>> >>>>>>>> a
>> >>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>> >>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>> >>>>>>>>>>> the auto-detection may be done by external systems.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>> Lijie
>> >>>>>>>>>>>
>> >>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
>> >>> 19:04写道：
>> >>>>>>>>>>>
>> >>>>>>>>>>>>> If we only support to block nodes manually, then I
>> >> could
>> >>>> not
>> >>>>>>> see
>> >>>>>>>>>>>> the obvious advantages compared with current SRE's
>> >>>>> approach(via
>> >>>>>>>> *yarn
>> >>>>>>>>>>>> rmadmin or kubectl taint*).
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I agree with Yang Wang on this.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> To me this sounds yet again like one of those magical
>> >>>>>>> mechanisms
>> >>>>>>>>>> that
>> >>>>>>>>>>>> will rarely work just right.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>> >>> indeed
>> >>>>>>> super
>> >>>>>>>>> hard
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> get right.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Martijn
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>> >>>>>>>> liujiangangpeng@gmail.com
>> >>>>>>>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>> >>>>> decrease
>> >>>>>>>>> great
>> >>>>>>>>>>> work
>> >>>>>>>>>>>>> for us. We have implemented the similar feature in our
>> >>>> inner
>> >>>>>>> flink
>> >>>>>>>>>>>>> version.
>> >>>>>>>>>>>>> Below is something that I care about:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>   1. For auto-detecting, I wonder how to make the
>> >>> strategy
>> >>>>> and
>> >>>>>>>>> mark a
>> >>>>>>>>>>>>> node
>> >>>>>>>>>>>>>   blocked? Sometimes the blocked node is hard to be
>> >>>>> detected,
>> >>>>>>> for
>> >>>>>>>>>>>>> example,
>> >>>>>>>>>>>>>   the upper node or the down node will be blocked when
>> >>>>> network
>> >>>>>>>>>>>>> unreachable.
>> >>>>>>>>>>>>>   2. I see that the strategy is made in JobMaster
>> >> side.
>> >>>> How
>> >>>>>>> about
>> >>>>>>>>>>>>>   implementing the similar logic in resource manager?
>> >> In
>> >>>>>>> session
>> >>>>>>>>>> mode,
>> >>>>>>>>>>>>> multi
>> >>>>>>>>>>>>>   jobs can fail on the same bad node and the node
>> >> should
>> >>>> be
>> >>>>>>>> marked
>> >>>>>>>>>>>>> blocked.
>> >>>>>>>>>>>>>   If the job makes the strategy, the node may be not
>> >>>> marked
>> >>>>>>>> blocked
>> >>>>>>>>>> if
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>   fail times don't exceed the threshold.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thank you for all your feedback!
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>> >> some
>> >>> of
>> >>>>> my
>> >>>>>>>>>> thoughts:
>> >>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>> >>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>> >>>>>>>>>>>>>> The automatical way should be something built upon
>> >> the
>> >>>>>>> blocklist
>> >>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>> >>>>>>>> configurable
>> >>>>>>>>>>>>>> blocklist strategy, but I think we can further
>> >> decouple
>> >>>> it
>> >>>>> by
>> >>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>> >>>> suggested,
>> >>>>>>> which
>> >>>>>>>>>> just
>> >>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>> >>> detected.
>> >>>>>>>> However,
>> >>>>>>>>>> it
>> >>>>>>>>>>>>>> should be a separate FLIP with further dev
>> >> discussions
>> >>>> and
>> >>>>>>>>> feedback
>> >>>>>>>>>>>>>> from users. I also agree with Becket that different
>> >>> users
>> >>>>>>> have
>> >>>>>>>>>>> different
>> >>>>>>>>>>>>>> requirements, and we should listen to them.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>> >>>> externally
>> >>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>> >> way
>> >>> to
>> >>>>>>> avoid
>> >>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>> >>>> users
>> >>>>>>> may
>> >>>>>>>>> just
>> >>>>>>>>>>>>>> want to limit the load of the node and do not want to
>> >>>> kill
>> >>>>>>> all
>> >>>>>>>> the
>> >>>>>>>>>>>>>> processes on it. Another case is the speculative
>> >>>>> execution[1]
>> >>>>>>>>> which
>> >>>>>>>>>>>>>> may also leverage this feature to avoid starting
>> >> mirror
>> >>>>>>> tasks on
>> >>>>>>>>>> slow
>> >>>>>>>>>>>>>> nodes.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>> Zhu
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
>> >>>>> 15:56写道：
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks for your feedback.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>> >> here
>> >>>>>>> because
>> >>>>>>>> it
>> >>>>>>>>>> can
>> >>>>>>>>>>>>>> affect the value and design of the blocklist
>> >> mechanism
>> >>>>>>> (perhaps
>> >>>>>>>> I
>> >>>>>>>>>>> should
>> >>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>> >>>> FLIP:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>> >>> node
>> >>>>> as
>> >>>>>>>>>> blocked.
>> >>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>> >>>> task
>> >>>>>>>> manager
>> >>>>>>>>>> or
>> >>>>>>>>>>>>> node.
>> >>>>>>>>>>>>>> But slots that are already allocated will not be
>> >>>> affected.
>> >>>>> A
>> >>>>>>>>> typical
>> >>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>> >>> In
>> >>>>> this
>> >>>>>>>>> case,
>> >>>>>>>>>> we
>> >>>>>>>>>>>>> hope
>> >>>>>>>>>>>>>> that subsequent resource allocations will not be on
>> >> the
>> >>>> hot
>> >>>>>>>>> machine,
>> >>>>>>>>>>> but
>> >>>>>>>>>>>>>> tasks currently running on it should not be affected.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>> >> task
>> >>>>>>> manager
>> >>>>>>>> or
>> >>>>>>>>>>> node
>> >>>>>>>>>>>>> as
>> >>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>> >> tasks
>> >>>> will
>> >>>>>>> be
>> >>>>>>>>>>>>> restarted on
>> >>>>>>>>>>>>>> non-blocked task managers.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> For the above 2 actions, the former may more
>> >>> highlight
>> >>>>> the
>> >>>>>>>>> meaning
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>>> this FLIP, because the external system cannot do
>> >> that.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>> >> basically
>> >>>>> agree
>> >>>>>>>> with
>> >>>>>>>>>>>>> @Becket
>> >>>>>>>>>>>>>> Qin: different users have different answers. Not all
>> >>>> users’
>> >>>>>>>>>> deployment
>> >>>>>>>>>>>>>> environments have a special external system that can
>> >>>>> perform
>> >>>>>>> the
>> >>>>>>>>>>> anomaly
>> >>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>> >>>>>>> auto-detection
>> >>>>>>>>>>> doesn't
>> >>>>>>>>>>>>>> require much extra work on top of manual
>> >> specification.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I will answer your other questions one by one.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Yangze
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>> >>> the
>> >>>>>>>>>>>>>>
>> >>> `cluster.resource-blocklist.item.timeout-check-interval`
>> >>>> to
>> >>>>>>>> users.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>> >>> separate
>> >>>>>>>> interface
>> >>>>>>>>>>>>> (maybe
>> >>>>>>>>>>>>>> BlocklistExceptionListener), and the
>> >>>>>>>>> ResourceManagerBlocklistHandler
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>>> implement it in the future.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Martijn
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>> >> by
>> >>>>>>> cluster
>> >>>>>>>>>>>>> operators.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>> >>>> according
>> >>>>>>> to
>> >>>>>>>> my
>> >>>>>>>>>>>>>> experience, users are often the first to perceive the
>> >>>>> machine
>> >>>>>>>>>> problems
>> >>>>>>>>>>>>>> (because of job failover or delay), and they will
>> >>> contact
>> >>>>>>>> cluster
>> >>>>>>>>>>>>> operators
>> >>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>> >>>>>>> machine is
>> >>>>>>>>>>>>>> problematic. From this point of view, I think the
>> >>> people
>> >>>>> who
>> >>>>>>>>> really
>> >>>>>>>>>>> need
>> >>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>> >>>> performed
>> >>>>> by
>> >>>>>>>> the
>> >>>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Chesnay
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>> >>> for
>> >>>>>>>>>>>>> MARK_BLOCKLISTED
>> >>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>> >> when
>> >>>> the
>> >>>>>>> task
>> >>>>>>>> is
>> >>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>> >>>> release
>> >>>>>>> the
>> >>>>>>>>> slot
>> >>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>> >>> on
>> >>>>> it.
>> >>>>>>> If
>> >>>>>>>> we
>> >>>>>>>>>>> only
>> >>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>> >> needs
>> >>> to
>> >>>>>>>> retrieve
>> >>>>>>>>>> it
>> >>>>>>>>>>> by
>> >>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>> >>>> relatively
>> >>>>>>>> large,
>> >>>>>>>>>> so
>> >>>>>>>>>>> I
>> >>>>>>>>>>>>>> think it's worth maintaining the blocklist
>> >> information
>> >>> on
>> >>>>>>> the JM
>> >>>>>>>>>> side
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>> syncing them.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Роман
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>    a) “Probably storing inside Zookeeper/Configmap
>> >>>> might
>> >>>>>>> be
>> >>>>>>>>>> helpful
>> >>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>> >>>>>>> understand
>> >>>>>>>>> that.
>> >>>>>>>>>>> In
>> >>>>>>>>>>>>> my
>> >>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>> >>>> special
>> >>>>>>>>>> treatment
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>> required.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>> >> sense,
>> >>> I
>> >>>>> will
>> >>>>>>>> add
>> >>>>>>>>> it
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> FLIP.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Yang
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>> >> cannot
>> >>>>>>> support
>> >>>>>>>> the
>> >>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Looking forward to your further feedback.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
>> >>>> 21:09写道：
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>> >>>>>>> operations.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>> >>>>>>> Reliability
>> >>>>>>>>>>> Engineer)
>> >>>>>>>>>>>>>>>> already has many tools to detect the unstable
>> >> nodes,
>> >>>>> which
>> >>>>>>>>> could
>> >>>>>>>>>>> take
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> system logs/metrics into consideration.
>> >>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>> >> taint
>> >>>> in
>> >>>>>>> K8s
>> >>>>>>>> to
>> >>>>>>>>>>>>> prevent
>> >>>>>>>>>>>>>> new
>> >>>>>>>>>>>>>>>> allocations on these unstable nodes.
>> >>>>>>>>>>>>>>>> At last, they will evict all the containers and
>> >> pods
>> >>>>>>> running
>> >>>>>>>> on
>> >>>>>>>>>>> these
>> >>>>>>>>>>>>>> nodes.
>> >>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>> >>> So
>> >>>> I
>> >>>>> am
>> >>>>>>>>> afraid
>> >>>>>>>>>>>>> this
>> >>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>> >>>> could
>> >>>>>>> not
>> >>>>>>>>> see
>> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>> >>>>>>>> approach(via
>> >>>>>>>>>>> *yarn
>> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>> >>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>> >>> which
>> >>>>>>> could
>> >>>>>>>>>> expose
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>> >>> them
>> >>>> if
>> >>>>>>>>> enabled
>> >>>>>>>>>>>>>> explicitly.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>> Yang
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
>> >>>> 16:36写道：
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>> >> and
>> >>>>>>> somewhat
>> >>>>>>>>>>> related
>> >>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>> design principle about how people should operate
>> >>>>> Flink.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I think there are three things involved in this
>> >>>> FLIP.
>> >>>>>>>>>>>>>>>>>     a) Detect and report the unstable node.
>> >>>>>>>>>>>>>>>>>     b) Collect the information of the unstable
>> >>> node
>> >>>>> and
>> >>>>>>>>> form a
>> >>>>>>>>>>>>>> blocklist.
>> >>>>>>>>>>>>>>>>>     c) Take the action to block nodes.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> My two cents:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>> >>> should
>> >>>>> have
>> >>>>>>>> c).
>> >>>>>>>>> It
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>> not only
>> >>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>> >> handy
>> >>>> for
>> >>>>>>> some
>> >>>>>>>>>>> planned
>> >>>>>>>>>>>>>>>>> maintenance.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>> >> who
>> >>>>>>> should be
>> >>>>>>>>> the
>> >>>>>>>>>>>>> brain
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>> >>>>> largely
>> >>>>>>>>> depends
>> >>>>>>>>>>> on
>> >>>>>>>>>>>>>> who we
>> >>>>>>>>>>>>>>>>> talk to. Different users would probably give
>> >>>> different
>> >>>>>>>>> answers.
>> >>>>>>>>>>> For
>> >>>>>>>>>>>>>> people
>> >>>>>>>>>>>>>>>>> who do have a centralized node health management
>> >>>>>>> service,
>> >>>>>>>> let
>> >>>>>>>>>>> Flink
>> >>>>>>>>>>>>>> do just
>> >>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>> >>>> Flink
>> >>>>>>> would
>> >>>>>>>>> be
>> >>>>>>>>>>> one
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>> >> it
>> >>> to
>> >>>>>>> that
>> >>>>>>>>>>> service,
>> >>>>>>>>>>>>>> and then
>> >>>>>>>>>>>>>>>>> take the command from that service to block the
>> >>>>>>> problematic
>> >>>>>>>>>>> nodes.
>> >>>>>>>>>>>>> On
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>> >>>> service,
>> >>>>>>>> simply
>> >>>>>>>>>>>>> letting
>> >>>>>>>>>>>>>> Flink
>> >>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>> >> nodes
>> >>>>> might
>> >>>>>>> be
>> >>>>>>>>>>> desired
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>> >>>> pluggable /
>> >>>>>>>>>> optional.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>> >>> have
>> >>>>>>>> something
>> >>>>>>>>>>>>>> pluggable
>> >>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>> >> unstable
>> >>>>> nodes
>> >>>>>>>>>>> actively.
>> >>>>>>>>>>>>> (A
>> >>>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>> >>> which
>> >>>>>>> can be
>> >>>>>>>>>> used
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> report
>> >>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>> >>> the
>> >>>>>>> scope
>> >>>>>>>>>>> relevant
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>> >> OK
>> >>> to
>> >>>>>>> have a
>> >>>>>>>>>>> default
>> >>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>> >>> Flink
>> >>>> to
>> >>>>>>> take
>> >>>>>>>>>>> action
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>> block
>> >>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>> >>>>> timeout.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>> >>>>>>>>>> ro.v.boyko@gmail.com
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>> >>>>>>> development.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>> >>> we
>> >>>>>>> should
>> >>>>>>>>>>> consider
>> >>>>>>>>>>>>>> to move
>> >>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>> >> nodes/TMs
>> >>>>> from
>> >>>>>>>>> active
>> >>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>> >>>> storing
>> >>>>>>>> inside
>> >>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>> >>>> organizations
>> >>>>>>>> don't
>> >>>>>>>>>> want
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>> expose
>> >>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>> >>>> it's
>> >>>>>>>>> necessary
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> have the
>> >>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>> >>> for
>> >>>>>>>> avoiding
>> >>>>>>>>>>>>> incorrect
>> >>>>>>>>>>>>>>>>>> automatic behaviour.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>> >>>> would
>> >>>>> be
>> >>>>>>>>> better
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> extend
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>> >> *endTimestamp*
>> >>>>> field
>> >>>>>>> and
>> >>>>>>>>>> fill
>> >>>>>>>>>>> it
>> >>>>>>>>>>>>>> at the
>> >>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>> >>> to:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>   -
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>   Provide the ability to users to setup the
>> >>> exact
>> >>>>>>> time
>> >>>>>>>> of
>> >>>>>>>>>>>>>> blocklist end
>> >>>>>>>>>>>>>>>>>>   through RestAPI
>> >>>>>>>>>>>>>>>>>>   -
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>   Not being tied to a single value of
>> >>>>>>>>>>>>>>>>>>   *cluster.resource-blacklist.item.timeout*
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>> >> <
>> >>>>>>>>>>>>> chesnay@apache.org>
>> >>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>> >>>> lines
>> >>>>> a
>> >>>>>>>> bit.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>> >>>>> auto-detection
>> >>>>>>>> and
>> >>>>>>>>>> only
>> >>>>>>>>>>>>>> have an
>> >>>>>>>>>>>>>>>>>>> opt-in mechanism
>> >>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>> >> this
>> >>>>> sounds
>> >>>>>>>> yet
>> >>>>>>>>>>> again
>> >>>>>>>>>>>>>> like one
>> >>>>>>>>>>>>>>>>>>> of those
>> >>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>> >> just
>> >>>>> right.
>> >>>>>>>>>>>>>>>>>>> An external system can leverage way more
>> >>>>> information
>> >>>>>>>>> after
>> >>>>>>>>>>> all.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>> >>>> complexity
>> >>>>>>> of
>> >>>>>>>>> this
>> >>>>>>>>>>>>>> proposal.
>> >>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>> >>> between
>> >>>>>>>>>> components;
>> >>>>>>>>>>>>>>>>> adjustments
>> >>>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>> slot and resource protocol.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>> >>> resource
>> >>>>>>>>> management
>> >>>>>>>>>>>>>> perspective,
>> >>>>>>>>>>>>>>>>>>> then there
>> >>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>> >> protocol
>> >>>> at
>> >>>>>>> all
>> >>>>>>>> (or
>> >>>>>>>>>> in
>> >>>>>>>>>>>>> fact
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>> >>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>> >>> around
>> >>>>>>> blocked
>> >>>>>>>>> TMs
>> >>>>>>>>>>> in
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> first
>> >>>>>>>>>>>>>>>>>>> place.
>> >>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>> >>> down
>> >>>>>>> after
>> >>>>>>>> a
>> >>>>>>>>>>> while
>> >>>>>>>>>>>>> by
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> RM
>> >>>>>>>>>>>>>>>>>>> because of
>> >>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>> >>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>> >>>> blocking
>> >>>>>>> the
>> >>>>>>>>>> process
>> >>>>>>>>>>>>> or
>> >>>>>>>>>>>>>> node)
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>> >>>>>>> process(es).
>> >>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>> >>>> back
>> >>>>>>> up.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>> >> the
>> >>>> same
>> >>>>>>> line
>> >>>>>>>>> of
>> >>>>>>>>>>>>>> thinking to
>> >>>>>>>>>>>>>>>>>>> standalone resource management.
>> >>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>> >> process/node
>> >>>>>>> manually
>> >>>>>>>>>>> should
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>> core
>> >>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>> >>>>>>>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>> >>>> understand
>> >>>>>>> the
>> >>>>>>>>>> problem
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>> I see
>> >>>>>>>>>>>>>>>>>>> value
>> >>>>>>>>>>>>>>>>>>>> in the automatic detection and
>> >>> blocklisting. I
>> >>>>> do
>> >>>>>>>> have
>> >>>>>>>>>> some
>> >>>>>>>>>>>>>> concerns
>> >>>>>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>> >>> blocked
>> >>>>>>>>> resources.
>> >>>>>>>>>> I
>> >>>>>>>>>>>>> have
>> >>>>>>>>>>>>>> two
>> >>>>>>>>>>>>>>>>>>>> concerns;
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>> >>>>> separation
>> >>>>>>> of
>> >>>>>>>>>>>>> concerns,
>> >>>>>>>>>>>>>>>>> meaning
>> >>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>> >>>> managing a
>> >>>>>>>>> cluster
>> >>>>>>>>>>> and
>> >>>>>>>>>>>>>> there's
>> >>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>> >>>>>>> introduction of
>> >>>>>>>>>> this
>> >>>>>>>>>>>>>> mechanism,
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> latter group now can influence the
>> >>>>> responsibility
>> >>>>>>> of
>> >>>>>>>>> the
>> >>>>>>>>>>>>> first
>> >>>>>>>>>>>>>> group.
>> >>>>>>>>>>>>>>>>>> So
>> >>>>>>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>> >>>> group
>> >>>>>>>> blocks
>> >>>>>>>>>>>>>> something,
>> >>>>>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>> >>> paging
>> >>>>>>>>> mechanism
>> >>>>>>>>>>>>>> triggering
>> >>>>>>>>>>>>>>>>>> etc)
>> >>>>>>>>>>>>>>>>>>>> which impacts the first group.
>> >>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>> >> go
>> >>>>>>> through
>> >>>>>>>> the
>> >>>>>>>>>>>>> process
>> >>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>> manually
>> >>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>> >> it
>> >>>>>>> should
>> >>>>>>>>> be? I
>> >>>>>>>>>>> do
>> >>>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>> >>> make
>> >>>>>>> sense
>> >>>>>>>> to
>> >>>>>>>>>>>>>> introduce
>> >>>>>>>>>>>>>>>>> such
>> >>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>> >>> really
>> >>>>>>> small
>> >>>>>>>>> user
>> >>>>>>>>>>>>> group
>> >>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> Flink?
>> >>>>>>>>>>>>>>>>>>> We
>> >>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>> >>> such
>> >>>> a
>> >>>>>>>>> feature.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>> >>> I'm
>> >>>>>>>> leaning
>> >>>>>>>>>>>>> towards
>> >>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>> exposing
>> >>>>>>>>>>>>>>>>>>>> this to the user group but having this
>> >>>> available
>> >>>>>>>>> strictly
>> >>>>>>>>>>> for
>> >>>>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>> >> their
>> >>>>>>>>>>>>>> paging/metrics/logging
>> >>>>>>>>>>>>>>>>>>> system
>> >>>>>>>>>>>>>>>>>>>> to take this into account.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Martijn Visser
>> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>> >>>>>>>>>>> karmagyz@gmail.com
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>> >>> some
>> >>>>>>> cents
>> >>>>>>>>> here:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>> >>>>>>>>>> to
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>> user?
>> >>>>>>>>>>>>>>>>>>>>> I think the semantics of
>> >>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>> >>> the
>> >>>>>>>> timeout
>> >>>>>>>>>>>>>> mechanism is
>> >>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>> >> it
>> >>>>> will
>> >>>>>>> be
>> >>>>>>>>> very
>> >>>>>>>>>>>>>> confusing
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>> >> exception
>> >>>> of a
>> >>>>>>> task
>> >>>>>>>>>>>>> manager to
>> >>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>> >>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>> >> fail
>> >>>> in
>> >>>>>>> case
>> >>>>>>>>> the
>> >>>>>>>>>>>>> target
>> >>>>>>>>>>>>>> task
>> >>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>> >> I
>> >>>>> don't
>> >>>>>>>> mean
>> >>>>>>>>> we
>> >>>>>>>>>>>>> need
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> cover
>> >>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>> >> also
>> >>>>> open a
>> >>>>>>>>>>>>>> `notifyException`
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>> >>>>>>> ResourceManager,
>> >>>>>>>>> will
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> slot of
>> >>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>> >>> released
>> >>>>> and
>> >>>>>>>>>>> allocated?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>> Yangze Guo
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>> >> Wang
>> >>> <
>> >>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>> >> response
>> >>>>> your 4
>> >>>>>>>>>> remarks:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>> >>>>> controversy. I
>> >>>>>>>>> think
>> >>>>>>>>>>>>>> “BlockList”
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>> good
>> >>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>> >>>> good
>> >>>>>>> idea.
>> >>>>>>>>>> Based
>> >>>>>>>>>>>>> on
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>> above, I
>> >>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>> >> <host>/blocklist/node/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>> <host>/blocklist/taskmanager/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>> >>>> means
>> >>>>>>> that
>> >>>>>>>>> all
>> >>>>>>>>>>>>> task
>> >>>>>>>>>>>>>>>>> managers
>> >>>>>>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>> >>>> these
>> >>>>>>> TMs
>> >>>>>>>> are
>> >>>>>>>>>> not
>> >>>>>>>>>>>>>>>>> available.
>> >>>>>>>>>>>>>>>>>>> This
>> >>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>> >> these
>> >>>> TMs
>> >>>>>>> are
>> >>>>>>>>> not
>> >>>>>>>>>>>>> really
>> >>>>>>>>>>>>>> lost,
>> >>>>>>>>>>>>>>>>>>> they
>> >>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>> >> are
>> >>>>> still
>> >>>>>>>>>>> registered
>> >>>>>>>>>>>>>> in this
>> >>>>>>>>>>>>>>>>>>> flink
>> >>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>> >> once
>> >>>> the
>> >>>>>>>>>>>>> corresponding
>> >>>>>>>>>>>>>>>>>> blocklist
>> >>>>>>>>>>>>>>>>>>>>> item
>> >>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>> >>>>>>>>>>> active/non-active
>> >>>>>>>>>>>>>>>>> clusters.
>> >>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>> >> TMs
>> >>>> may
>> >>>>> be
>> >>>>>>>>>> released
>> >>>>>>>>>>>>> due
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> idle
>> >>>>>>>>>>>>>>>>>>>>>> timeouts.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>> >> keep
>> >>>> it.
>> >>>>>>> The
>> >>>>>>>>>> reasons
>> >>>>>>>>>>>>> are
>> >>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>>>>> following:
>> >>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>> >>> adding
>> >>>> or
>> >>>>>>>>> removing
>> >>>>>>>>>>>>> items
>> >>>>>>>>>>>>>> via
>> >>>>>>>>>>>>>>>>>> REST
>> >>>>>>>>>>>>>>>>>>>>> API,
>> >>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>> >> it
>> >>>> to
>> >>>>>>>>>>>>> Long.MAX_VALUE .
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>> >> a
>> >>>>>>> period of
>> >>>>>>>>>> time
>> >>>>>>>>>>>>>> (such as
>> >>>>>>>>>>>>>>>>>>> machine
>> >>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>> >> prefer
>> >>>> that
>> >>>>>>>> Flink
>> >>>>>>>>>> can
>> >>>>>>>>>>> do
>> >>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>> >>> user
>> >>>> to
>> >>>>>>> do
>> >>>>>>>> it
>> >>>>>>>>>>>>> manually.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
>> >>>>>>> 于2022年4月27日周三
>> >>>>>>>>>>>>> 19:23写道：
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>> >> only
>> >>>>>>> support
>> >>>>>>>>>>> manually
>> >>>>>>>>>>>>>> blocking
>> >>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>> >>> different
>> >>>>>>>>> strategies
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>>> also be
>> >>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>> >>>> "bla.cklist"
>> >>>>>>> due
>> >>>>>>>> to
>> >>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> controversy
>> >>>>>>>>>>>>>>>>>>>>> around
>> >>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>> >>> Ticket
>> >>>>>>> about
>> >>>>>>>>> this
>> >>>>>>>>>>>>> topic a
>> >>>>>>>>>>>>>>>>> while
>> >>>>>>>>>>>>>>>>>>>>> back
>> >>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>> >>>> avoid
>> >>>>>>> the
>> >>>>>>>>> term
>> >>>>>>>>>>>>>> blacklist &
>> >>>>>>>>>>>>>>>>>>>>> whitelist
>> >>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>> >> "denylist"
>> >>> or
>> >>>>>>>>>>> "quarantined"
>> >>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>> >>> slightly
>> >>>>>>>> different
>> >>>>>>>>>>>>> design
>> >>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>> verbs
>> >>>>>>>>>>>>>>>>>>>>> like
>> >>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>> >>> anti-pattern
>> >>>>> for
>> >>>>>>>> REST
>> >>>>>>>>>>> APIs.
>> >>>>>>>>>>>>>> POST
>> >>>>>>>>>>>>>>>>> on a
>> >>>>>>>>>>>>>>>>>>>>> list
>> >>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>> >>>> items.
>> >>>>>>>> DELETE
>> >>>>>>>>> on
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> individual
>> >>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>> <host>/quarantine/items/<itemidentifier>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>> >>>>> taskmanagers
>> >>>>>>>> and
>> >>>>>>>>>>> nodes
>> >>>>>>>>>>>>> in
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> REST
>> >>>>>>>>>>>>>>>>>>>>> API
>> >>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>> >>> opinion
>> >>>>> on
>> >>>>>>>> this?
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>> <host>/quarantine/nodes/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>>>> <host>/quarantine/taskmanager/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>> >>>>>>> non-active
>> >>>>>>>>>>>>> resource
>> >>>>>>>>>>>>>>>>>> managers,
>> >>>>>>>>>>>>>>>>>>>>> i.e.
>> >>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>> >>>>>>> minimal,
>> >>>>>>>> do
>> >>>>>>>>> we
>> >>>>>>>>>>>>> need
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>> timeout
>> >>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>> >>>> manually
>> >>>>>>> we
>> >>>>>>>>> could
>> >>>>>>>>>>>>>> delegate
>> >>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>> >>>>> behavior
>> >>>>>>>>> would
>> >>>>>>>>>>>>> better
>> >>>>>>>>>>>>>> fit
>> >>>>>>>>>>>>>>>>>> into
>> >>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Konstantin
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>> >>>>>>>>>>>>>>>>>>>>>>> [2]
>> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>> >>> schrieb
>> >>>>>>> Lijie
>> >>>>>>>>> Wang
>> >>>>>>>>>> <
>> >>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>> >>>> cluster
>> >>>>>>> node
>> >>>>>>>>>>> issues
>> >>>>>>>>>>>>>>>>>>>>> (insufficient
>> >>>>>>>>>>>>>>>>>>>>>>> disk
>> >>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>> >>>> abnormalities).
>> >>>>>>>> Flink
>> >>>>>>>>>> will
>> >>>>>>>>>>>>>> take care
>> >>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>> >>> However,
>> >>>>> due
>> >>>>>>> to
>> >>>>>>>>> data
>> >>>>>>>>>>>>>> locality
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> limited
>> >>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>> >> likely
>> >>>> to
>> >>>>> be
>> >>>>>>>>>>> redeployed
>> >>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>> same
>> >>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>> >>>> task
>> >>>>>>>>>>> abnormalities
>> >>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>> affect
>> >>>>>>>>>>>>>>>>>>>>> job
>> >>>>>>>>>>>>>>>>>>>>>>>> progress.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>> >> manually
>> >>>>>>> identify
>> >>>>>>>>> the
>> >>>>>>>>>>>>>> problematic
>> >>>>>>>>>>>>>>>>>>>>> node and
>> >>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>> >>> But
>> >>>>>>> this
>> >>>>>>>>>>> approach
>> >>>>>>>>>>>>> has
>> >>>>>>>>>>>>>>>>>>>>> following
>> >>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>> >> heavy
>> >>>>>>> process.
>> >>>>>>>>>> Users
>> >>>>>>>>>>>>> may
>> >>>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>> contact
>> >>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>> >>>>> operation
>> >>>>>>> can
>> >>>>>>>>>> even
>> >>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>> dangerous
>> >>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>> >>>>> events.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>> >> of
>> >>>>>>> problems
>> >>>>>>>>>>> manually
>> >>>>>>>>>>>>>> would
>> >>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>> slow
>> >>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>> >>>> propose
>> >>>>>>> to
>> >>>>>>>>>>>>> introduce a
>> >>>>>>>>>>>>>>>>>> blacklist
>> >>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>> >>>>> problematic
>> >>>>>>>>>>> resources.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>> >>> FLIP-224[1].
>> >>>>>>> Looking
>> >>>>>>>>>>> forward
>> >>>>>>>>>>>>>> to your
>> >>>>>>>>>>>>>>>>>>>>>>> feedback.
>> >>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best regards,
>> >>>>>>>>> Roman Boyko
>> >>>>>>>>> e.: ro.v.boyko@gmail.com
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> https://twitter.com/snntrable
>> >>>> https://github.com/knaufk
>> >>>>
>> >>>
>> >>
>> >>
>> >> --
>> >> https://twitter.com/snntrable
>> >> https://github.com/knaufk
>> >>
>>
>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi Weihua,
thanks for feedback.

1. Yes, only *Manually* is supported in this FLIP, but it's the first step
towards auto-detection.
2. We wii print the blocked nodes in logs. Maybe also put it into the
exception of insufficient resources.
3. No. This FLIP won't change the WebUI. The blocklist information can be
obtained through REST API and metrics.

Best,
Lijie

Weihua Hu <hu...@gmail.com> 于2022年5月17日周二 21:41写道：

> Hi,
> Thanks for creating this FLIP.
> We have implemented an automatic blocklist detection mechanism internally,
> which is indeed very effective for handling node failures.
> Due to the large number of nodes, although SREs already support automatic
> offline failure nodes, the detection is not 100% accurate and there is some
> delay.
> So the blocklist mechanism can make flink job recover from failure much
> faster.
>
> Here are some of my thoughts:
> 1. In this FLIP, it needs users to locate machine failure manually, there
> is a certain cost of use
> 2. What happens if too many nodes are blocked, resulting in insufficient
> resources? Will there be a special Exception for the user?
> 3. Will we display the blocklist information in the WebUI? The blocklist
> is for cluster level, and if multiple users share a cluster, some users may
> be a little confused when resources are not enough, or when resources are
> applied for more.
>
> Also, Looking forward to the next FLIP on auto-detection.
>
> Best,
> Weihua
>
> > 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> >
> > Hi Konstantin,
> >
> > Maybe change it to the following:
> >
> > 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> > Merge is not allowed. If the {id} already exists, return error.
> Otherwise,
> > create a new item.
> >
> > 2. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> > Merge is allowed. If the {id} already exists, merge. Otherwise, create a
> > new item.
> >
> > WDYT?
> >
> > Best,
> > Lijie
> >
> > Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> >
> >> Hi Lijie,
> >>
> >> hm, maybe the following is more appropriate in that case
> >>
> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> >>
> >> Best,
> >>
> >> Konstantin
> >>
> >> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> >> wangdachui9501@gmail.com>:
> >>
> >>> Hi Konstantin,
> >>> thanks for your feedback.
> >>>
> >>> From what I understand, PUT should be idempotent. However, we have a
> >>> *timeout* field in the request. This means that initiating the same
> >> request
> >>> at two different times will lead to different resource status
> (timestamps
> >>> of the items to be removed will be different).
> >>>
> >>> Should we use PUT in this case? WDYT?
> >>>
> >>> Best,
> >>> Lijie
> >>>
> >>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> >>>
> >>>> Hi Lijie,
> >>>>
> >>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT on
> >> the
> >>>> resource?
> >>>>
> >>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >>>>
> >>>> Best,
> >>>>
> >>>> Konstantin
> >>>>
> >>>>
> >>>>
> >>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> >>>> wangdachui9501@gmail.com>:
> >>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and made
> >>> the
> >>>>> following changes on REST API:
> >>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
> >>> one.
> >>>> If
> >>>>> both are specified, will return error.
> >>>>> 2.  If the specified item is already there, the *ADD* operation has
> >> two
> >>>>> behaviors:  *return error*(default value) or *merge/update*, and we
> >>> add a
> >>>>> flag to the request body to control it. You can find more details
> >>> "Public
> >>>>> Interface" section.
> >>>>>
> >>>>> If there is no more feedback, we will start the vote thread next
> >> week.
> >>>>>
> >>>>> Best,
> >>>>> Lijie
> >>>>>
> >>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> >>>>>
> >>>>>> Hi Becket Qin,
> >>>>>>
> >>>>>> Thanks for your suggestions.  I have moved the description of
> >>>>>> configurations, metrics and REST API into "Public Interface"
> >> section,
> >>>> and
> >>>>>> made a few updates according to your suggestion.  And in this FLIP,
> >>>> there
> >>>>>> no public java Interfaces or pluggables that users need to
> >> implement
> >>> by
> >>>>>> themselves.
> >>>>>>
> >>>>>> Answers for you questions:
> >>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> >>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> >> items
> >>>> can
> >>>>>> only be added through the REST API, so these 2 action are mentioned
> >>> in
> >>>>> the
> >>>>>> REST API part (The REST API part has beed moved to public interface
> >>>> now).
> >>>>>> 2. I agree with you. I have changed the "Cause" field to String,
> >> and
> >>>>> allow
> >>>>>> users to specify it via REST API.
> >>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
> >> above,
> >>>> we
> >>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the ADD
> >>>> REST
> >>>>>> API to specify when to remove the blocked item. These 2 fields are
> >>>>>> optional, if neither is specified, it means that the blocked item
> >> is
> >>>>>> permanent and will not be removed. If both are specified, the
> >> minimum
> >>>> of
> >>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the
> >>> time
> >>>> to
> >>>>>> remove the blocked item. To keep the configurations more minimal,
> >> we
> >>>> have
> >>>>>> removed the *cluster.resource-blocklist.item.timeout* configuration
> >>>>>> option.
> >>>>>> 4. Yes, the block item will be overridden if the specified item
> >>> already
> >>>>>> exists. The ADD operation is *ADD or UPDATE*.
> >>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained
> >> in
> >>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> >>>>> abstracted
> >>>>>> to other interfaces) will be propagated to different components.
> >>>>>>
> >>>>>> Best,
> >>>>>> Lijie
> >>>>>>
> >>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> >>>>>>
> >>>>>>> Hi Lijie,
> >>>>>>>
> >>>>>>> Thanks for updating the FLIP. It looks like the public interface
> >>>> section
> >>>>>>> did not fully reflect all the user sensible behavior and API. Can
> >>> you
> >>>>> put
> >>>>>>> everything that users may be aware of there? That would include
> >> the
> >>>> REST
> >>>>>>> API, metrics, configurations, public java Interfaces or pluggables
> >>>> that
> >>>>>>> users may see or implement by themselves, as well as a brief
> >> summary
> >>>> of
> >>>>>>> the
> >>>>>>> behavior of the public API.
> >>>>>>>
> >>>>>>> Besides that, I have a few questions:
> >>>>>>>
> >>>>>>> 1. According to the conversation in the discussion thread, it
> >> looks
> >>>> like
> >>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
> >>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
> >> can
> >>>> you
> >>>>>>> add
> >>>>>>> that to the public interface as well?
> >>>>>>>
> >>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
> >>> Throwable
> >>>>> and
> >>>>>>> is not reflected in the REST API. Should that be included in the
> >>> query
> >>>>>>> response? And should we change that field to be a String so users
> >>> may
> >>>>>>> specify the cause via the REST API when they block some nodes /
> >> TMs?
> >>>>>>>
> >>>>>>> 3. Would it be useful to allow users to have different timeouts
> >> for
> >>>>>>> different blocked items? So while there is a default timeout,
> >> users
> >>>> can
> >>>>>>> also override it via the REST API when they block an entity.
> >>>>>>>
> >>>>>>> 4. Regarding the ADD operation, if the specified item is already
> >>>> there,
> >>>>>>> will the block item be overridden? For example, if the user wants
> >> to
> >>>>>>> extend
> >>>>>>> the timeout of a blocked item, can they just  issue an ADD command
> >>>>> again?
> >>>>>>>
> >>>>>>> 5. I am not quite familiar with the details of this, but is there
> >> a
> >>>>> source
> >>>>>>> of truth for the blocked list? I think it might be good to have a
> >>>> single
> >>>>>>> source of truth for the blocked list and just propagate that list
> >> to
> >>>>>>> different components to take the action of actually blocking the
> >>>>> resource.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jiangjie (Becket) Qin
> >>>>>>>
> >>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> >> wangdachui9501@gmail.com
> >>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> Based on the discussion in the mailing list, I updated the FLIP
> >>> doc,
> >>>>> the
> >>>>>>>> changes include:
> >>>>>>>> 1. Changed the description of the motivation section to more
> >>> clearly
> >>>>>>>> describe the problem this FLIP is trying to solve.
> >>>>>>>> 2. Only  *Manually* is supported.
> >>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Lijie
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> >>>>>>>>
> >>>>>>>>> Hi Lijie!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
> >>>>>>> helpfulhere.”
> >>>>>>>>> Can you explain it in detail? I don't fully understand that.
> >> In
> >>>>>>>> myopinion,
> >>>>>>>>> non-active and active are the same, and no special treatment
> >>>>>>> isrequired.*
> >>>>>>>>>
> >>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
> >>> were
> >>>>>>> talking
> >>>>>>>>> about the HA mode (but not about Active and Standalone
> >>>>>>> ResourceManager).
> >>>>>>>>> And the original question was - how to handle the blacklisted
> >>>> nodes
> >>>>>>> list
> >>>>>>>> at
> >>>>>>>>> the moment of leader change? Should we simply forget about
> >> them
> >>> or
> >>>>>>> try to
> >>>>>>>>> pre-save that list on the remote storage?
> >>>>>>>>>
> >>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey.wy@gmail.com
> >>>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
> >>>>>>>>>>
> >>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> >> it
> >>> is
> >>>>>>> indeed
> >>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
> >>>> could
> >>>>>>> not
> >>>>>>>>>> support.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Yang
> >>>>>>>>>>
> >>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> >>>>>>>>>>
> >>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
> >>>>>>>>>>>
> >>>>>>>>>>> @Jiangang
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
> >> and
> >>>>> mark a
> >>>>>>>> node
> >>>>>>>>>>> blocked?
> >>>>>>>>>>>
> >>>>>>>>>>> In fact, we currently plan to not support auto-detection
> >> in
> >>>> this
> >>>>>>>> FLIP.
> >>>>>>>>>> The
> >>>>>>>>>>> part about auto-detection may be continued in a separate
> >>> FLIP
> >>>> in
> >>>>>>> the
> >>>>>>>>>>> future. Some guys have the same concerns as you, and the
> >>>>>>> correctness
> >>>>>>>>> and
> >>>>>>>>>>> necessity of auto-detection may require further discussion
> >>> in
> >>>>> the
> >>>>>>>>> future.
> >>>>>>>>>>>
> >>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
> >> node
> >>>> and
> >>>>>>> the
> >>>>>>>>> node
> >>>>>>>>>>> should be marked blocked.
> >>>>>>>>>>> By design, the blocklist information will be shared among
> >>> all
> >>>>> jobs
> >>>>>>>> in a
> >>>>>>>>>>> cluster/session. The JM will sync blocklist information
> >> with
> >>>> RM.
> >>>>>>>>>>>
> >>>>>>>>>>> @Martijn
> >>>>>>>>>>>
> >>>>>>>>>>>> I agree with Yang Wang on this.
> >>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
> >>>>>>> MARK_BLOCKLISTED(Just
> >>>>>>>>>> limits
> >>>>>>>>>>> the load of the node and does not  kill all the processes
> >> on
> >>>> it)
> >>>>>>> is
> >>>>>>>>> also
> >>>>>>>>>>> important, and we think that external systems (*yarn
> >> rmadmin
> >>>> or
> >>>>>>>> kubectl
> >>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
> >>>> only
> >>>>>>>>>> *manually*.
> >>>>>>>>>>>
> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> >>> indeed
> >>>>>>> super
> >>>>>>>>> hard
> >>>>>>>>>>> to get right.
> >>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
> >>>>> concern.
> >>>>>>>>>>> However, we currently plan to not support auto-detection
> >> in
> >>>> this
> >>>>>>>> FLIP,
> >>>>>>>>>> and
> >>>>>>>>>>> only *manually*. In addition, I'd like to say that the
> >> FLIP
> >>>>>>> provides
> >>>>>>>> a
> >>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
> >>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> >>>>>>>>>>> the auto-detection may be done by external systems.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Lijie
> >>>>>>>>>>>
> >>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> >>> 19:04写道：
> >>>>>>>>>>>
> >>>>>>>>>>>>> If we only support to block nodes manually, then I
> >> could
> >>>> not
> >>>>>>> see
> >>>>>>>>>>>> the obvious advantages compared with current SRE's
> >>>>> approach(via
> >>>>>>>> *yarn
> >>>>>>>>>>>> rmadmin or kubectl taint*).
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree with Yang Wang on this.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> To me this sounds yet again like one of those magical
> >>>>>>> mechanisms
> >>>>>>>>>> that
> >>>>>>>>>>>> will rarely work just right.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
> >>> indeed
> >>>>>>> super
> >>>>>>>>> hard
> >>>>>>>>>>> to
> >>>>>>>>>>>> get right.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Martijn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> >>>>>>>> liujiangangpeng@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
> >>>>> decrease
> >>>>>>>>> great
> >>>>>>>>>>> work
> >>>>>>>>>>>>> for us. We have implemented the similar feature in our
> >>>> inner
> >>>>>>> flink
> >>>>>>>>>>>>> version.
> >>>>>>>>>>>>> Below is something that I care about:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>   1. For auto-detecting, I wonder how to make the
> >>> strategy
> >>>>> and
> >>>>>>>>> mark a
> >>>>>>>>>>>>> node
> >>>>>>>>>>>>>   blocked? Sometimes the blocked node is hard to be
> >>>>> detected,
> >>>>>>> for
> >>>>>>>>>>>>> example,
> >>>>>>>>>>>>>   the upper node or the down node will be blocked when
> >>>>> network
> >>>>>>>>>>>>> unreachable.
> >>>>>>>>>>>>>   2. I see that the strategy is made in JobMaster
> >> side.
> >>>> How
> >>>>>>> about
> >>>>>>>>>>>>>   implementing the similar logic in resource manager?
> >> In
> >>>>>>> session
> >>>>>>>>>> mode,
> >>>>>>>>>>>>> multi
> >>>>>>>>>>>>>   jobs can fail on the same bad node and the node
> >> should
> >>>> be
> >>>>>>>> marked
> >>>>>>>>>>>>> blocked.
> >>>>>>>>>>>>>   If the job makes the strategy, the node may be not
> >>>> marked
> >>>>>>>> blocked
> >>>>>>>>>> if
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>   fail times don't exceed the threshold.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thank you for all your feedback!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
> >> some
> >>> of
> >>>>> my
> >>>>>>>>>> thoughts:
> >>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
> >>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
> >>>>>>>>>>>>>> The automatical way should be something built upon
> >> the
> >>>>>>> blocklist
> >>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
> >>>>>>>> configurable
> >>>>>>>>>>>>>> blocklist strategy, but I think we can further
> >> decouple
> >>>> it
> >>>>> by
> >>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
> >>>> suggested,
> >>>>>>> which
> >>>>>>>>>> just
> >>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
> >>> detected.
> >>>>>>>> However,
> >>>>>>>>>> it
> >>>>>>>>>>>>>> should be a separate FLIP with further dev
> >> discussions
> >>>> and
> >>>>>>>>> feedback
> >>>>>>>>>>>>>> from users. I also agree with Becket that different
> >>> users
> >>>>>>> have
> >>>>>>>>>>> different
> >>>>>>>>>>>>>> requirements, and we should listen to them.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
> >>>> externally
> >>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
> >> way
> >>> to
> >>>>>>> avoid
> >>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
> >>>> users
> >>>>>>> may
> >>>>>>>>> just
> >>>>>>>>>>>>>> want to limit the load of the node and do not want to
> >>>> kill
> >>>>>>> all
> >>>>>>>> the
> >>>>>>>>>>>>>> processes on it. Another case is the speculative
> >>>>> execution[1]
> >>>>>>>>> which
> >>>>>>>>>>>>>> may also leverage this feature to avoid starting
> >> mirror
> >>>>>>> tasks on
> >>>>>>>>>> slow
> >>>>>>>>>>>>>> nodes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Zhu
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> >>>>> 15:56写道：
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for your feedback.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
> >> here
> >>>>>>> because
> >>>>>>>> it
> >>>>>>>>>> can
> >>>>>>>>>>>>>> affect the value and design of the blocklist
> >> mechanism
> >>>>>>> (perhaps
> >>>>>>>> I
> >>>>>>>>>>> should
> >>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
> >>>> FLIP:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
> >>> node
> >>>>> as
> >>>>>>>>>> blocked.
> >>>>>>>>>>>>>> Future slots should not be allocated from the blocked
> >>>> task
> >>>>>>>> manager
> >>>>>>>>>> or
> >>>>>>>>>>>>> node.
> >>>>>>>>>>>>>> But slots that are already allocated will not be
> >>>> affected.
> >>>>> A
> >>>>>>>>> typical
> >>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
> >>> In
> >>>>> this
> >>>>>>>>> case,
> >>>>>>>>>> we
> >>>>>>>>>>>>> hope
> >>>>>>>>>>>>>> that subsequent resource allocations will not be on
> >> the
> >>>> hot
> >>>>>>>>> machine,
> >>>>>>>>>>> but
> >>>>>>>>>>>>>> tasks currently running on it should not be affected.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> >> task
> >>>>>>> manager
> >>>>>>>> or
> >>>>>>>>>>> node
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
> >> tasks
> >>>> will
> >>>>>>> be
> >>>>>>>>>>>>> restarted on
> >>>>>>>>>>>>>> non-blocked task managers.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For the above 2 actions, the former may more
> >>> highlight
> >>>>> the
> >>>>>>>>> meaning
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> this FLIP, because the external system cannot do
> >> that.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
> >> basically
> >>>>> agree
> >>>>>>>> with
> >>>>>>>>>>>>> @Becket
> >>>>>>>>>>>>>> Qin: different users have different answers. Not all
> >>>> users’
> >>>>>>>>>> deployment
> >>>>>>>>>>>>>> environments have a special external system that can
> >>>>> perform
> >>>>>>> the
> >>>>>>>>>>> anomaly
> >>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
> >>>>>>> auto-detection
> >>>>>>>>>>> doesn't
> >>>>>>>>>>>>>> require much extra work on top of manual
> >> specification.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I will answer your other questions one by one.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Yangze
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
> >>> the
> >>>>>>>>>>>>>>
> >>> `cluster.resource-blocklist.item.timeout-check-interval`
> >>>> to
> >>>>>>>> users.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
> >>> separate
> >>>>>>>> interface
> >>>>>>>>>>>>> (maybe
> >>>>>>>>>>>>>> BlocklistExceptionListener), and the
> >>>>>>>>> ResourceManagerBlocklistHandler
> >>>>>>>>>>> can
> >>>>>>>>>>>>>> implement it in the future.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Martijn
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a) I also think the manual blocking should be done
> >> by
> >>>>>>> cluster
> >>>>>>>>>>>>> operators.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
> >>>> according
> >>>>>>> to
> >>>>>>>> my
> >>>>>>>>>>>>>> experience, users are often the first to perceive the
> >>>>> machine
> >>>>>>>>>> problems
> >>>>>>>>>>>>>> (because of job failover or delay), and they will
> >>> contact
> >>>>>>>> cluster
> >>>>>>>>>>>>> operators
> >>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
> >>>>>>> machine is
> >>>>>>>>>>>>>> problematic. From this point of view, I think the
> >>> people
> >>>>> who
> >>>>>>>>> really
> >>>>>>>>>>> need
> >>>>>>>>>>>>>> the manual blocking are the users, and it’s just
> >>>> performed
> >>>>> by
> >>>>>>>> the
> >>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Chesnay
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
> >>> for
> >>>>>>>>>>>>> MARK_BLOCKLISTED
> >>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
> >> when
> >>>> the
> >>>>>>> task
> >>>>>>>> is
> >>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> >>>> release
> >>>>>>> the
> >>>>>>>>> slot
> >>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
> >>> on
> >>>>> it.
> >>>>>>> If
> >>>>>>>> we
> >>>>>>>>>>> only
> >>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
> >> needs
> >>> to
> >>>>>>>> retrieve
> >>>>>>>>>> it
> >>>>>>>>>>> by
> >>>>>>>>>>>>>> RPC. I think the performance overhead of that is
> >>>> relatively
> >>>>>>>> large,
> >>>>>>>>>> so
> >>>>>>>>>>> I
> >>>>>>>>>>>>>> think it's worth maintaining the blocklist
> >> information
> >>> on
> >>>>>>> the JM
> >>>>>>>>>> side
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>> syncing them.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Роман
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>    a) “Probably storing inside Zookeeper/Configmap
> >>>> might
> >>>>>>> be
> >>>>>>>>>> helpful
> >>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
> >>>>>>> understand
> >>>>>>>>> that.
> >>>>>>>>>>> In
> >>>>>>>>>>>>> my
> >>>>>>>>>>>>>> opinion, non-active and active are the same, and no
> >>>> special
> >>>>>>>>>> treatment
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> required.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
> >> sense,
> >>> I
> >>>>> will
> >>>>>>>> add
> >>>>>>>>> it
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Yang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
> >> cannot
> >>>>>>> support
> >>>>>>>> the
> >>>>>>>>>>>>>> MARK_BLOCKLISTED action.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Looking forward to your further feedback.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Lijie
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
> >>>> 21:09写道：
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
> >>>>>>> operations.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
> >>>>>>> Reliability
> >>>>>>>>>>> Engineer)
> >>>>>>>>>>>>>>>> already has many tools to detect the unstable
> >> nodes,
> >>>>> which
> >>>>>>>>> could
> >>>>>>>>>>> take
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> system logs/metrics into consideration.
> >>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
> >> taint
> >>>> in
> >>>>>>> K8s
> >>>>>>>> to
> >>>>>>>>>>>>> prevent
> >>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>> allocations on these unstable nodes.
> >>>>>>>>>>>>>>>> At last, they will evict all the containers and
> >> pods
> >>>>>>> running
> >>>>>>>> on
> >>>>>>>>>>> these
> >>>>>>>>>>>>>> nodes.
> >>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
> >>> So
> >>>> I
> >>>>> am
> >>>>>>>>> afraid
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
> >>>> could
> >>>>>>> not
> >>>>>>>>> see
> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
> >>>>>>>> approach(via
> >>>>>>>>>>> *yarn
> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
> >>>>>>>>>>>>>>>> At least, we need to have a pluggable component
> >>> which
> >>>>>>> could
> >>>>>>>>>> expose
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> potential unstable nodes automatically and block
> >>> them
> >>>> if
> >>>>>>>>> enabled
> >>>>>>>>>>>>>> explicitly.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Yang
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> >>>> 16:36写道：
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
> >> and
> >>>>>>> somewhat
> >>>>>>>>>>> related
> >>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>> design principle about how people should operate
> >>>>> Flink.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think there are three things involved in this
> >>>> FLIP.
> >>>>>>>>>>>>>>>>>     a) Detect and report the unstable node.
> >>>>>>>>>>>>>>>>>     b) Collect the information of the unstable
> >>> node
> >>>>> and
> >>>>>>>>> form a
> >>>>>>>>>>>>>> blocklist.
> >>>>>>>>>>>>>>>>>     c) Take the action to block nodes.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> My two cents:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
> >>> should
> >>>>> have
> >>>>>>>> c).
> >>>>>>>>> It
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> not only
> >>>>>>>>>>>>>>>>> useful for cases of node failures, but also
> >> handy
> >>>> for
> >>>>>>> some
> >>>>>>>>>>> planned
> >>>>>>>>>>>>>>>>> maintenance.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
> >> who
> >>>>>>> should be
> >>>>>>>>> the
> >>>>>>>>>>>>> brain
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> make the decision to block a node. I think this
> >>>>> largely
> >>>>>>>>> depends
> >>>>>>>>>>> on
> >>>>>>>>>>>>>> who we
> >>>>>>>>>>>>>>>>> talk to. Different users would probably give
> >>>> different
> >>>>>>>>> answers.
> >>>>>>>>>>> For
> >>>>>>>>>>>>>> people
> >>>>>>>>>>>>>>>>> who do have a centralized node health management
> >>>>>>> service,
> >>>>>>>> let
> >>>>>>>>>>> Flink
> >>>>>>>>>>>>>> do just
> >>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
> >>>> Flink
> >>>>>>> would
> >>>>>>>>> be
> >>>>>>>>>>> one
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
> >> it
> >>> to
> >>>>>>> that
> >>>>>>>>>>> service,
> >>>>>>>>>>>>>> and then
> >>>>>>>>>>>>>>>>> take the command from that service to block the
> >>>>>>> problematic
> >>>>>>>>>>> nodes.
> >>>>>>>>>>>>> On
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> other hand, for users who do not have such a
> >>>> service,
> >>>>>>>> simply
> >>>>>>>>>>>>> letting
> >>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
> >> nodes
> >>>>> might
> >>>>>>> be
> >>>>>>>>>>> desired
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
> >>>> pluggable /
> >>>>>>>>>> optional.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
> >>> have
> >>>>>>>> something
> >>>>>>>>>>>>>> pluggable
> >>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
> >> unstable
> >>>>> nodes
> >>>>>>>>>>> actively.
> >>>>>>>>>>>>> (A
> >>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
> >>> which
> >>>>>>> can be
> >>>>>>>>>> used
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> report
> >>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
> >>> the
> >>>>>>> scope
> >>>>>>>>>>> relevant
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
> >> OK
> >>> to
> >>>>>>> have a
> >>>>>>>>>>> default
> >>>>>>>>>>>>>>>>> implementation of a reporter which just tells
> >>> Flink
> >>>> to
> >>>>>>> take
> >>>>>>>>>>> action
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> block
> >>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
> >>>>> timeout.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> >>>>>>>>>> ro.v.boyko@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
> >>>>>>> development.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
> >>> we
> >>>>>>> should
> >>>>>>>>>>> consider
> >>>>>>>>>>>>>> to move
> >>>>>>>>>>>>>>>>>> somehow the information of blocklisted
> >> nodes/TMs
> >>>>> from
> >>>>>>>>> active
> >>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
> >>>> storing
> >>>>>>>> inside
> >>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
> >>>> organizations
> >>>>>>>> don't
> >>>>>>>>>> want
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> expose
> >>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
> >>>> it's
> >>>>>>>>> necessary
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> have the
> >>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
> >>> for
> >>>>>>>> avoiding
> >>>>>>>>>>>>> incorrect
> >>>>>>>>>>>>>>>>>> automatic behaviour.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
> >>>> would
> >>>>> be
> >>>>>>>>> better
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> extend
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
> >> *endTimestamp*
> >>>>> field
> >>>>>>> and
> >>>>>>>>>> fill
> >>>>>>>>>>> it
> >>>>>>>>>>>>>> at the
> >>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
> >>> to:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>   -
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>   Provide the ability to users to setup the
> >>> exact
> >>>>>>> time
> >>>>>>>> of
> >>>>>>>>>>>>>> blocklist end
> >>>>>>>>>>>>>>>>>>   through RestAPI
> >>>>>>>>>>>>>>>>>>   -
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>   Not being tied to a single value of
> >>>>>>>>>>>>>>>>>>   *cluster.resource-blacklist.item.timeout*
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> >> <
> >>>>>>>>>>>>> chesnay@apache.org>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
> >>>> lines
> >>>>> a
> >>>>>>>> bit.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
> >>>>> auto-detection
> >>>>>>>> and
> >>>>>>>>>> only
> >>>>>>>>>>>>>> have an
> >>>>>>>>>>>>>>>>>>> opt-in mechanism
> >>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
> >> this
> >>>>> sounds
> >>>>>>>> yet
> >>>>>>>>>>> again
> >>>>>>>>>>>>>> like one
> >>>>>>>>>>>>>>>>>>> of those
> >>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
> >> just
> >>>>> right.
> >>>>>>>>>>>>>>>>>>> An external system can leverage way more
> >>>>> information
> >>>>>>>>> after
> >>>>>>>>>>> all.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
> >>>> complexity
> >>>>>>> of
> >>>>>>>>> this
> >>>>>>>>>>>>>> proposal.
> >>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
> >>> between
> >>>>>>>>>> components;
> >>>>>>>>>>>>>>>>> adjustments
> >>>>>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>> slot and resource protocol.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> If we look at it purely from an active
> >>> resource
> >>>>>>>>> management
> >>>>>>>>>>>>>> perspective,
> >>>>>>>>>>>>>>>>>>> then there
> >>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
> >> protocol
> >>>> at
> >>>>>>> all
> >>>>>>>> (or
> >>>>>>>>>> in
> >>>>>>>>>>>>> fact
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> anything in the JobMaster),
> >>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
> >>> around
> >>>>>>> blocked
> >>>>>>>>> TMs
> >>>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> first
> >>>>>>>>>>>>>>>>>>> place.
> >>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
> >>> down
> >>>>>>> after
> >>>>>>>> a
> >>>>>>>>>>> while
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> RM
> >>>>>>>>>>>>>>>>>>> because of
> >>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
> >>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
> >>>> blocking
> >>>>>>> the
> >>>>>>>>>> process
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>>> node)
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
> >>>>>>> process(es).
> >>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
> >>>> back
> >>>>>>> up.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
> >> the
> >>>> same
> >>>>>>> line
> >>>>>>>>> of
> >>>>>>>>>>>>>> thinking to
> >>>>>>>>>>>>>>>>>>> standalone resource management.
> >>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
> >> process/node
> >>>>>>> manually
> >>>>>>>>>>> should
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> core
> >>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
> >>>>>>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
> >>>> understand
> >>>>>>> the
> >>>>>>>>>> problem
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>> I see
> >>>>>>>>>>>>>>>>>>> value
> >>>>>>>>>>>>>>>>>>>> in the automatic detection and
> >>> blocklisting. I
> >>>>> do
> >>>>>>>> have
> >>>>>>>>>> some
> >>>>>>>>>>>>>> concerns
> >>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
> >>> blocked
> >>>>>>>>> resources.
> >>>>>>>>>> I
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>>>>> concerns;
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
> >>>>> separation
> >>>>>>> of
> >>>>>>>>>>>>> concerns,
> >>>>>>>>>>>>>>>>> meaning
> >>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
> >>>> managing a
> >>>>>>>>> cluster
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> there's
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
> >>>>>>> introduction of
> >>>>>>>>>> this
> >>>>>>>>>>>>>> mechanism,
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> latter group now can influence the
> >>>>> responsibility
> >>>>>>> of
> >>>>>>>>> the
> >>>>>>>>>>>>> first
> >>>>>>>>>>>>>> group.
> >>>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
> >>>> group
> >>>>>>>> blocks
> >>>>>>>>>>>>>> something,
> >>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
> >>> paging
> >>>>>>>>> mechanism
> >>>>>>>>>>>>>> triggering
> >>>>>>>>>>>>>>>>>> etc)
> >>>>>>>>>>>>>>>>>>>> which impacts the first group.
> >>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
> >> go
> >>>>>>> through
> >>>>>>>> the
> >>>>>>>>>>>>> process
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> manually
> >>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
> >> it
> >>>>>>> should
> >>>>>>>>> be? I
> >>>>>>>>>>> do
> >>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
> >>> make
> >>>>>>> sense
> >>>>>>>> to
> >>>>>>>>>>>>>> introduce
> >>>>>>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
> >>> really
> >>>>>>> small
> >>>>>>>>> user
> >>>>>>>>>>>>> group
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> Flink?
> >>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
> >>> such
> >>>> a
> >>>>>>>>> feature.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
> >>> I'm
> >>>>>>>> leaning
> >>>>>>>>>>>>> towards
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>> exposing
> >>>>>>>>>>>>>>>>>>>> this to the user group but having this
> >>>> available
> >>>>>>>>> strictly
> >>>>>>>>>>> for
> >>>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>>>>> operators. They could then also set up
> >> their
> >>>>>>>>>>>>>> paging/metrics/logging
> >>>>>>>>>>>>>>>>>>> system
> >>>>>>>>>>>>>>>>>>>> to take this into account.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Martijn Visser
> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> >>>>>>>>>>> karmagyz@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
> >>> some
> >>>>>>> cents
> >>>>>>>>> here:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
> >>>>>>>>>> to
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> user?
> >>>>>>>>>>>>>>>>>>>>> I think the semantics of
> >>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
> >>> the
> >>>>>>>> timeout
> >>>>>>>>>>>>>> mechanism is
> >>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
> >> it
> >>>>> will
> >>>>>>> be
> >>>>>>>>> very
> >>>>>>>>>>>>>> confusing
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
> >> exception
> >>>> of a
> >>>>>>> task
> >>>>>>>>>>>>> manager to
> >>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
> >>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
> >> fail
> >>>> in
> >>>>>>> case
> >>>>>>>>> the
> >>>>>>>>>>>>> target
> >>>>>>>>>>>>>> task
> >>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
> >> I
> >>>>> don't
> >>>>>>>> mean
> >>>>>>>>> we
> >>>>>>>>>>>>> need
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> cover
> >>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
> >> also
> >>>>> open a
> >>>>>>>>>>>>>> `notifyException`
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
> >>>>>>> ResourceManager,
> >>>>>>>>> will
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> slot of
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
> >>> released
> >>>>> and
> >>>>>>>>>>> allocated?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>> Yangze Guo
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
> >> Wang
> >>> <
> >>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
> >>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
> >> response
> >>>>> your 4
> >>>>>>>>>> remarks:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
> >>>>> controversy. I
> >>>>>>>>> think
> >>>>>>>>>>>>>> “BlockList”
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
> >>>> good
> >>>>>>> idea.
> >>>>>>>>>> Based
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> above, I
> >>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> DELETE
> >> <host>/blocklist/node/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>> <host>/blocklist/taskmanager/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
> >>>> means
> >>>>>>> that
> >>>>>>>>> all
> >>>>>>>>>>>>> task
> >>>>>>>>>>>>>>>>> managers
> >>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
> >>>> these
> >>>>>>> TMs
> >>>>>>>> are
> >>>>>>>>>> not
> >>>>>>>>>>>>>>>>> available.
> >>>>>>>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
> >> these
> >>>> TMs
> >>>>>>> are
> >>>>>>>>> not
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>>> lost,
> >>>>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
> >> are
> >>>>> still
> >>>>>>>>>>> registered
> >>>>>>>>>>>>>> in this
> >>>>>>>>>>>>>>>>>>> flink
> >>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
> >> once
> >>>> the
> >>>>>>>>>>>>> corresponding
> >>>>>>>>>>>>>>>>>> blocklist
> >>>>>>>>>>>>>>>>>>>>> item
> >>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
> >>>>>>>>>>> active/non-active
> >>>>>>>>>>>>>>>>> clusters.
> >>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
> >> TMs
> >>>> may
> >>>>> be
> >>>>>>>>>> released
> >>>>>>>>>>>>> due
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> idle
> >>>>>>>>>>>>>>>>>>>>>> timeouts.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
> >> keep
> >>>> it.
> >>>>>>> The
> >>>>>>>>>> reasons
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>> following:
> >>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
> >>> adding
> >>>> or
> >>>>>>>>> removing
> >>>>>>>>>>>>> items
> >>>>>>>>>>>>>> via
> >>>>>>>>>>>>>>>>>> REST
> >>>>>>>>>>>>>>>>>>>>> API,
> >>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
> >> it
> >>>> to
> >>>>>>>>>>>>> Long.MAX_VALUE .
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
> >> a
> >>>>>>> period of
> >>>>>>>>>> time
> >>>>>>>>>>>>>> (such as
> >>>>>>>>>>>>>>>>>>> machine
> >>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
> >> prefer
> >>>> that
> >>>>>>>> Flink
> >>>>>>>>>> can
> >>>>>>>>>>> do
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
> >>> user
> >>>> to
> >>>>>>> do
> >>>>>>>> it
> >>>>>>>>>>>>> manually.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Lijie
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
> >>>>>>> 于2022年4月27日周三
> >>>>>>>>>>>>> 19:23写道：
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
> >> only
> >>>>>>> support
> >>>>>>>>>>> manually
> >>>>>>>>>>>>>> blocking
> >>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
> >>> different
> >>>>>>>>> strategies
> >>>>>>>>>>> can
> >>>>>>>>>>>>>> also be
> >>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> A few remarks:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
> >>>> "bla.cklist"
> >>>>>>> due
> >>>>>>>> to
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> controversy
> >>>>>>>>>>>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
> >>> Ticket
> >>>>>>> about
> >>>>>>>>> this
> >>>>>>>>>>>>> topic a
> >>>>>>>>>>>>>>>>> while
> >>>>>>>>>>>>>>>>>>>>> back
> >>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
> >>>> avoid
> >>>>>>> the
> >>>>>>>>> term
> >>>>>>>>>>>>>> blacklist &
> >>>>>>>>>>>>>>>>>>>>> whitelist
> >>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
> >> "denylist"
> >>> or
> >>>>>>>>>>> "quarantined"
> >>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
> >>> slightly
> >>>>>>>> different
> >>>>>>>>>>>>> design
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> verbs
> >>>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
> >>> anti-pattern
> >>>>> for
> >>>>>>>> REST
> >>>>>>>>>>> APIs.
> >>>>>>>>>>>>>> POST
> >>>>>>>>>>>>>>>>> on a
> >>>>>>>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
> >>>> items.
> >>>>>>>> DELETE
> >>>>>>>>> on
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
> >>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>> <host>/quarantine/items/<itemidentifier>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
> >>>>> taskmanagers
> >>>>>>>> and
> >>>>>>>>>>> nodes
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> REST
> >>>>>>>>>>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
> >>> opinion
> >>>>> on
> >>>>>>>> this?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
> >>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>> <host>/quarantine/nodes/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>> DELETE
> >>>>>>> <host>/quarantine/taskmanager/<identifier>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
> >>>>>>> non-active
> >>>>>>>>>>>>> resource
> >>>>>>>>>>>>>>>>>> managers,
> >>>>>>>>>>>>>>>>>>>>> i.e.
> >>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
> >>>>>>> minimal,
> >>>>>>>> do
> >>>>>>>>> we
> >>>>>>>>>>>>> need
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> timeout
> >>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
> >>>> manually
> >>>>>>> we
> >>>>>>>>> could
> >>>>>>>>>>>>>> delegate
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
> >>>>> behavior
> >>>>>>>>> would
> >>>>>>>>>>>>> better
> >>>>>>>>>>>>>> fit
> >>>>>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Konstantin
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >>>>>>>>>>>>>>>>>>>>>>> [2]
> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> >>> schrieb
> >>>>>>> Lijie
> >>>>>>>>> Wang
> >>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
> >>>> cluster
> >>>>>>> node
> >>>>>>>>>>> issues
> >>>>>>>>>>>>>>>>>>>>> (insufficient
> >>>>>>>>>>>>>>>>>>>>>>> disk
> >>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
> >>>> abnormalities).
> >>>>>>>> Flink
> >>>>>>>>>> will
> >>>>>>>>>>>>>> take care
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
> >>> However,
> >>>>> due
> >>>>>>> to
> >>>>>>>>> data
> >>>>>>>>>>>>>> locality
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> limited
> >>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
> >> likely
> >>>> to
> >>>>> be
> >>>>>>>>>>> redeployed
> >>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
> >>>> task
> >>>>>>>>>>> abnormalities
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> affect
> >>>>>>>>>>>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>>>>>>>>> progress.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
> >> manually
> >>>>>>> identify
> >>>>>>>>> the
> >>>>>>>>>>>>>> problematic
> >>>>>>>>>>>>>>>>>>>>> node and
> >>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
> >>> But
> >>>>>>> this
> >>>>>>>>>>> approach
> >>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>> following
> >>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
> >> heavy
> >>>>>>> process.
> >>>>>>>>>> Users
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> contact
> >>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
> >>>>> operation
> >>>>>>> can
> >>>>>>>>>> even
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> dangerous
> >>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
> >>>>> events.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
> >> of
> >>>>>>> problems
> >>>>>>>>>>> manually
> >>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>> slow
> >>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
> >>>> propose
> >>>>>>> to
> >>>>>>>>>>>>> introduce a
> >>>>>>>>>>>>>>>>>> blacklist
> >>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
> >>>>> problematic
> >>>>>>>>>>> resources.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
> >>> FLIP-224[1].
> >>>>>>> Looking
> >>>>>>>>>>> forward
> >>>>>>>>>>>>>> to your
> >>>>>>>>>>>>>>>>>>>>>>> feedback.
> >>>>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Lijie
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best regards,
> >>>>>>>>> Roman Boyko
> >>>>>>>>> e.: ro.v.boyko@gmail.com
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> https://twitter.com/snntrable
> >>>> https://github.com/knaufk
> >>>>
> >>>
> >>
> >>
> >> --
> >> https://twitter.com/snntrable
> >> https://github.com/knaufk
> >>
>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Weihua Hu <hu...@gmail.com>.

Hi,
Thanks for creating this FLIP.
We have implemented an automatic blocklist detection mechanism internally, which is indeed very effective for handling node failures. 
Due to the large number of nodes, although SREs already support automatic offline failure nodes, the detection is not 100% accurate and there is some delay.
So the blocklist mechanism can make flink job recover from failure much faster.

Here are some of my thoughts:
1. In this FLIP, it needs users to locate machine failure manually, there is a certain cost of use
2. What happens if too many nodes are blocked, resulting in insufficient resources? Will there be a special Exception for the user?
3. Will we display the blocklist information in the WebUI? The blocklist is for cluster level, and if multiple users share a cluster, some users may be a little confused when resources are not enough, or when resources are applied for more.

Also, Looking forward to the next FLIP on auto-detection.

Best,
Weihua

> 2022年5月16日 下午11:22，Lijie Wang <wa...@gmail.com> 写道：
> 
> Hi Konstantin,
> 
> Maybe change it to the following:
> 
> 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> Merge is not allowed. If the {id} already exists, return error. Otherwise,
> create a new item.
> 
> 2. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
> Merge is allowed. If the {id} already exists, merge. Otherwise, create a
> new item.
> 
> WDYT?
> 
> Best,
> Lijie
> 
> Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：
> 
>> Hi Lijie,
>> 
>> hm, maybe the following is more appropriate in that case
>> 
>> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>> 
>> Best,
>> 
>> Konstantin
>> 
>> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>> wangdachui9501@gmail.com>:
>> 
>>> Hi Konstantin,
>>> thanks for your feedback.
>>> 
>>> From what I understand, PUT should be idempotent. However, we have a
>>> *timeout* field in the request. This means that initiating the same
>> request
>>> at two different times will lead to different resource status (timestamps
>>> of the items to be removed will be different).
>>> 
>>> Should we use PUT in this case? WDYT?
>>> 
>>> Best,
>>> Lijie
>>> 
>>> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>>> 
>>>> Hi Lijie,
>>>> 
>>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT on
>> the
>>>> resource?
>>>> 
>>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>>> 
>>>> Best,
>>>> 
>>>> Konstantin
>>>> 
>>>> 
>>>> 
>>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>>>> wangdachui9501@gmail.com>:
>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and made
>>> the
>>>>> following changes on REST API:
>>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
>>> one.
>>>> If
>>>>> both are specified, will return error.
>>>>> 2.  If the specified item is already there, the *ADD* operation has
>> two
>>>>> behaviors:  *return error*(default value) or *merge/update*, and we
>>> add a
>>>>> flag to the request body to control it. You can find more details
>>> "Public
>>>>> Interface" section.
>>>>> 
>>>>> If there is no more feedback, we will start the vote thread next
>> week.
>>>>> 
>>>>> Best,
>>>>> Lijie
>>>>> 
>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>>>>> 
>>>>>> Hi Becket Qin,
>>>>>> 
>>>>>> Thanks for your suggestions.  I have moved the description of
>>>>>> configurations, metrics and REST API into "Public Interface"
>> section,
>>>> and
>>>>>> made a few updates according to your suggestion.  And in this FLIP,
>>>> there
>>>>>> no public java Interfaces or pluggables that users need to
>> implement
>>> by
>>>>>> themselves.
>>>>>> 
>>>>>> Answers for you questions:
>>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>> items
>>>> can
>>>>>> only be added through the REST API, so these 2 action are mentioned
>>> in
>>>>> the
>>>>>> REST API part (The REST API part has beed moved to public interface
>>>> now).
>>>>>> 2. I agree with you. I have changed the "Cause" field to String,
>> and
>>>>> allow
>>>>>> users to specify it via REST API.
>>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>> above,
>>>> we
>>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the ADD
>>>> REST
>>>>>> API to specify when to remove the blocked item. These 2 fields are
>>>>>> optional, if neither is specified, it means that the blocked item
>> is
>>>>>> permanent and will not be removed. If both are specified, the
>> minimum
>>>> of
>>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the
>>> time
>>>> to
>>>>>> remove the blocked item. To keep the configurations more minimal,
>> we
>>>> have
>>>>>> removed the *cluster.resource-blocklist.item.timeout* configuration
>>>>>> option.
>>>>>> 4. Yes, the block item will be overridden if the specified item
>>> already
>>>>>> exists. The ADD operation is *ADD or UPDATE*.
>>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained
>> in
>>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>>>>> abstracted
>>>>>> to other interfaces) will be propagated to different components.
>>>>>> 
>>>>>> Best,
>>>>>> Lijie
>>>>>> 
>>>>>> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>>>>>> 
>>>>>>> Hi Lijie,
>>>>>>> 
>>>>>>> Thanks for updating the FLIP. It looks like the public interface
>>>> section
>>>>>>> did not fully reflect all the user sensible behavior and API. Can
>>> you
>>>>> put
>>>>>>> everything that users may be aware of there? That would include
>> the
>>>> REST
>>>>>>> API, metrics, configurations, public java Interfaces or pluggables
>>>> that
>>>>>>> users may see or implement by themselves, as well as a brief
>> summary
>>>> of
>>>>>>> the
>>>>>>> behavior of the public API.
>>>>>>> 
>>>>>>> Besides that, I have a few questions:
>>>>>>> 
>>>>>>> 1. According to the conversation in the discussion thread, it
>> looks
>>>> like
>>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
>> can
>>>> you
>>>>>>> add
>>>>>>> that to the public interface as well?
>>>>>>> 
>>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>>> Throwable
>>>>> and
>>>>>>> is not reflected in the REST API. Should that be included in the
>>> query
>>>>>>> response? And should we change that field to be a String so users
>>> may
>>>>>>> specify the cause via the REST API when they block some nodes /
>> TMs?
>>>>>>> 
>>>>>>> 3. Would it be useful to allow users to have different timeouts
>> for
>>>>>>> different blocked items? So while there is a default timeout,
>> users
>>>> can
>>>>>>> also override it via the REST API when they block an entity.
>>>>>>> 
>>>>>>> 4. Regarding the ADD operation, if the specified item is already
>>>> there,
>>>>>>> will the block item be overridden? For example, if the user wants
>> to
>>>>>>> extend
>>>>>>> the timeout of a blocked item, can they just  issue an ADD command
>>>>> again?
>>>>>>> 
>>>>>>> 5. I am not quite familiar with the details of this, but is there
>> a
>>>>> source
>>>>>>> of truth for the blocked list? I think it might be good to have a
>>>> single
>>>>>>> source of truth for the blocked list and just propagate that list
>> to
>>>>>>> different components to take the action of actually blocking the
>>>>> resource.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jiangjie (Becket) Qin
>>>>>>> 
>>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>> wangdachui9501@gmail.com
>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi everyone,
>>>>>>>> 
>>>>>>>> Based on the discussion in the mailing list, I updated the FLIP
>>> doc,
>>>>> the
>>>>>>>> changes include:
>>>>>>>> 1. Changed the description of the motivation section to more
>>> clearly
>>>>>>>> describe the problem this FLIP is trying to solve.
>>>>>>>> 2. Only  *Manually* is supported.
>>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Lijie
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>>>>>>>> 
>>>>>>>>> Hi Lijie!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>>>>>>> helpfulhere.”
>>>>>>>>> Can you explain it in detail? I don't fully understand that.
>> In
>>>>>>>> myopinion,
>>>>>>>>> non-active and active are the same, and no special treatment
>>>>>>> isrequired.*
>>>>>>>>> 
>>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>>> were
>>>>>>> talking
>>>>>>>>> about the HA mode (but not about Active and Standalone
>>>>>>> ResourceManager).
>>>>>>>>> And the original question was - how to handle the blacklisted
>>>> nodes
>>>>>>> list
>>>>>>>> at
>>>>>>>>> the moment of leader change? Should we simply forget about
>> them
>>> or
>>>>>>> try to
>>>>>>>>> pre-save that list on the remote storage?
>>>>>>>>> 
>>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey.wy@gmail.com
>>> 
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>>>>>>>>>> 
>>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>> it
>>> is
>>>>>>> indeed
>>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>>>> could
>>>>>>> not
>>>>>>>>>> support.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Yang
>>>>>>>>>> 
>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
>>>>>>>>>> 
>>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>>>>>>>>>>> 
>>>>>>>>>>> @Jiangang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>> and
>>>>> mark a
>>>>>>>> node
>>>>>>>>>>> blocked?
>>>>>>>>>>> 
>>>>>>>>>>> In fact, we currently plan to not support auto-detection
>> in
>>>> this
>>>>>>>> FLIP.
>>>>>>>>>> The
>>>>>>>>>>> part about auto-detection may be continued in a separate
>>> FLIP
>>>> in
>>>>>>> the
>>>>>>>>>>> future. Some guys have the same concerns as you, and the
>>>>>>> correctness
>>>>>>>>> and
>>>>>>>>>>> necessity of auto-detection may require further discussion
>>> in
>>>>> the
>>>>>>>>> future.
>>>>>>>>>>> 
>>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>> node
>>>> and
>>>>>>> the
>>>>>>>>> node
>>>>>>>>>>> should be marked blocked.
>>>>>>>>>>> By design, the blocklist information will be shared among
>>> all
>>>>> jobs
>>>>>>>> in a
>>>>>>>>>>> cluster/session. The JM will sync blocklist information
>> with
>>>> RM.
>>>>>>>>>>> 
>>>>>>>>>>> @Martijn
>>>>>>>>>>> 
>>>>>>>>>>>> I agree with Yang Wang on this.
>>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>>>>>>> MARK_BLOCKLISTED(Just
>>>>>>>>>> limits
>>>>>>>>>>> the load of the node and does not  kill all the processes
>> on
>>>> it)
>>>>>>> is
>>>>>>>>> also
>>>>>>>>>>> important, and we think that external systems (*yarn
>> rmadmin
>>>> or
>>>>>>>> kubectl
>>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>>>> only
>>>>>>>>>> *manually*.
>>>>>>>>>>> 
>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>> indeed
>>>>>>> super
>>>>>>>>> hard
>>>>>>>>>>> to get right.
>>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>>>>> concern.
>>>>>>>>>>> However, we currently plan to not support auto-detection
>> in
>>>> this
>>>>>>>> FLIP,
>>>>>>>>>> and
>>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>> FLIP
>>>>>>> provides
>>>>>>>> a
>>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>>>>>>>>>>> the auto-detection may be done by external systems.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Lijie
>>>>>>>>>>> 
>>>>>>>>>>> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
>>> 19:04写道：
>>>>>>>>>>> 
>>>>>>>>>>>>> If we only support to block nodes manually, then I
>> could
>>>> not
>>>>>>> see
>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>>>> approach(via
>>>>>>>> *yarn
>>>>>>>>>>>> rmadmin or kubectl taint*).
>>>>>>>>>>>> 
>>>>>>>>>>>> I agree with Yang Wang on this.
>>>>>>>>>>>> 
>>>>>>>>>>>>> To me this sounds yet again like one of those magical
>>>>>>> mechanisms
>>>>>>>>>> that
>>>>>>>>>>>> will rarely work just right.
>>>>>>>>>>>> 
>>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>> indeed
>>>>>>> super
>>>>>>>>> hard
>>>>>>>>>>> to
>>>>>>>>>>>> get right.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> Martijn
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>>>>>>>> liujiangangpeng@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>>>>> decrease
>>>>>>>>> great
>>>>>>>>>>> work
>>>>>>>>>>>>> for us. We have implemented the similar feature in our
>>>> inner
>>>>>>> flink
>>>>>>>>>>>>> version.
>>>>>>>>>>>>> Below is something that I care about:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   1. For auto-detecting, I wonder how to make the
>>> strategy
>>>>> and
>>>>>>>>> mark a
>>>>>>>>>>>>> node
>>>>>>>>>>>>>   blocked? Sometimes the blocked node is hard to be
>>>>> detected,
>>>>>>> for
>>>>>>>>>>>>> example,
>>>>>>>>>>>>>   the upper node or the down node will be blocked when
>>>>> network
>>>>>>>>>>>>> unreachable.
>>>>>>>>>>>>>   2. I see that the strategy is made in JobMaster
>> side.
>>>> How
>>>>>>> about
>>>>>>>>>>>>>   implementing the similar logic in resource manager?
>> In
>>>>>>> session
>>>>>>>>>> mode,
>>>>>>>>>>>>> multi
>>>>>>>>>>>>>   jobs can fail on the same bad node and the node
>> should
>>>> be
>>>>>>>> marked
>>>>>>>>>>>>> blocked.
>>>>>>>>>>>>>   If the job makes the strategy, the node may be not
>>>> marked
>>>>>>>> blocked
>>>>>>>>>> if
>>>>>>>>>>>>> the
>>>>>>>>>>>>>   fail times don't exceed the threshold.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you for all your feedback!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>> some
>>> of
>>>>> my
>>>>>>>>>> thoughts:
>>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>>>>>>>>>>>>>> The automatical way should be something built upon
>> the
>>>>>>> blocklist
>>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>>>>>>>> configurable
>>>>>>>>>>>>>> blocklist strategy, but I think we can further
>> decouple
>>>> it
>>>>> by
>>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>>>> suggested,
>>>>>>> which
>>>>>>>>>> just
>>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>>> detected.
>>>>>>>> However,
>>>>>>>>>> it
>>>>>>>>>>>>>> should be a separate FLIP with further dev
>> discussions
>>>> and
>>>>>>>>> feedback
>>>>>>>>>>>>>> from users. I also agree with Becket that different
>>> users
>>>>>>> have
>>>>>>>>>>> different
>>>>>>>>>>>>>> requirements, and we should listen to them.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>>>> externally
>>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>> way
>>> to
>>>>>>> avoid
>>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>>>> users
>>>>>>> may
>>>>>>>>> just
>>>>>>>>>>>>>> want to limit the load of the node and do not want to
>>>> kill
>>>>>>> all
>>>>>>>> the
>>>>>>>>>>>>>> processes on it. Another case is the speculative
>>>>> execution[1]
>>>>>>>>> which
>>>>>>>>>>>>>> may also leverage this feature to avoid starting
>> mirror
>>>>>>> tasks on
>>>>>>>>>> slow
>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Zhu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
>>>>> 15:56写道：
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for your feedback.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>> here
>>>>>>> because
>>>>>>>> it
>>>>>>>>>> can
>>>>>>>>>>>>>> affect the value and design of the blocklist
>> mechanism
>>>>>>> (perhaps
>>>>>>>> I
>>>>>>>>>>> should
>>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>>>> FLIP:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>>> node
>>>>> as
>>>>>>>>>> blocked.
>>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>>>> task
>>>>>>>> manager
>>>>>>>>>> or
>>>>>>>>>>>>> node.
>>>>>>>>>>>>>> But slots that are already allocated will not be
>>>> affected.
>>>>> A
>>>>>>>>> typical
>>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>>> In
>>>>> this
>>>>>>>>> case,
>>>>>>>>>> we
>>>>>>>>>>>>> hope
>>>>>>>>>>>>>> that subsequent resource allocations will not be on
>> the
>>>> hot
>>>>>>>>> machine,
>>>>>>>>>>> but
>>>>>>>>>>>>>> tasks currently running on it should not be affected.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>> task
>>>>>>> manager
>>>>>>>> or
>>>>>>>>>>> node
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>> tasks
>>>> will
>>>>>>> be
>>>>>>>>>>>>> restarted on
>>>>>>>>>>>>>> non-blocked task managers.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For the above 2 actions, the former may more
>>> highlight
>>>>> the
>>>>>>>>> meaning
>>>>>>>>>>> of
>>>>>>>>>>>>>> this FLIP, because the external system cannot do
>> that.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>> basically
>>>>> agree
>>>>>>>> with
>>>>>>>>>>>>> @Becket
>>>>>>>>>>>>>> Qin: different users have different answers. Not all
>>>> users’
>>>>>>>>>> deployment
>>>>>>>>>>>>>> environments have a special external system that can
>>>>> perform
>>>>>>> the
>>>>>>>>>>> anomaly
>>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>>>>>>> auto-detection
>>>>>>>>>>> doesn't
>>>>>>>>>>>>>> require much extra work on top of manual
>> specification.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I will answer your other questions one by one.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> @Yangze
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>>> the
>>>>>>>>>>>>>> 
>>> `cluster.resource-blocklist.item.timeout-check-interval`
>>>> to
>>>>>>>> users.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>>> separate
>>>>>>>> interface
>>>>>>>>>>>>> (maybe
>>>>>>>>>>>>>> BlocklistExceptionListener), and the
>>>>>>>>> ResourceManagerBlocklistHandler
>>>>>>>>>>> can
>>>>>>>>>>>>>> implement it in the future.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> @Martijn
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>> by
>>>>>>> cluster
>>>>>>>>>>>>> operators.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>>>> according
>>>>>>> to
>>>>>>>> my
>>>>>>>>>>>>>> experience, users are often the first to perceive the
>>>>> machine
>>>>>>>>>> problems
>>>>>>>>>>>>>> (because of job failover or delay), and they will
>>> contact
>>>>>>>> cluster
>>>>>>>>>>>>> operators
>>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>>>>>>> machine is
>>>>>>>>>>>>>> problematic. From this point of view, I think the
>>> people
>>>>> who
>>>>>>>>> really
>>>>>>>>>>> need
>>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>>>> performed
>>>>> by
>>>>>>>> the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> @Chesnay
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>>> for
>>>>>>>>>>>>> MARK_BLOCKLISTED
>>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>> when
>>>> the
>>>>>>> task
>>>>>>>> is
>>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>>>> release
>>>>>>> the
>>>>>>>>> slot
>>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>>> on
>>>>> it.
>>>>>>> If
>>>>>>>> we
>>>>>>>>>>> only
>>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>> needs
>>> to
>>>>>>>> retrieve
>>>>>>>>>> it
>>>>>>>>>>> by
>>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>>>> relatively
>>>>>>>> large,
>>>>>>>>>> so
>>>>>>>>>>> I
>>>>>>>>>>>>>> think it's worth maintaining the blocklist
>> information
>>> on
>>>>>>> the JM
>>>>>>>>>> side
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> syncing them.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> @Роман
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    a) “Probably storing inside Zookeeper/Configmap
>>>> might
>>>>>>> be
>>>>>>>>>> helpful
>>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>>>>>>> understand
>>>>>>>>> that.
>>>>>>>>>>> In
>>>>>>>>>>>>> my
>>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>>>> special
>>>>>>>>>> treatment
>>>>>>>>>>> is
>>>>>>>>>>>>>> required.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>> sense,
>>> I
>>>>> will
>>>>>>>> add
>>>>>>>>> it
>>>>>>>>>>> to
>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> @Yang
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>> cannot
>>>>>>> support
>>>>>>>> the
>>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Looking forward to your further feedback.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yang Wang <da...@gmail.com> 于2022年5月3日周二
>>>> 21:09写道：
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>>>>>>> operations.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>>>>>>> Reliability
>>>>>>>>>>> Engineer)
>>>>>>>>>>>>>>>> already has many tools to detect the unstable
>> nodes,
>>>>> which
>>>>>>>>> could
>>>>>>>>>>> take
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> system logs/metrics into consideration.
>>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>> taint
>>>> in
>>>>>>> K8s
>>>>>>>> to
>>>>>>>>>>>>> prevent
>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>> allocations on these unstable nodes.
>>>>>>>>>>>>>>>> At last, they will evict all the containers and
>> pods
>>>>>>> running
>>>>>>>> on
>>>>>>>>>>> these
>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>>> So
>>>> I
>>>>> am
>>>>>>>>> afraid
>>>>>>>>>>>>> this
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>>>> could
>>>>>>> not
>>>>>>>>> see
>>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>>>>>>> approach(via
>>>>>>>>>>> *yarn
>>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>>> which
>>>>>>> could
>>>>>>>>>> expose
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>>> them
>>>> if
>>>>>>>>> enabled
>>>>>>>>>>>>>> explicitly.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Becket Qin <be...@gmail.com> 于2022年5月2日周一
>>>> 16:36写道：
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>> and
>>>>>>> somewhat
>>>>>>>>>>> related
>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>> design principle about how people should operate
>>>>> Flink.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I think there are three things involved in this
>>>> FLIP.
>>>>>>>>>>>>>>>>>     a) Detect and report the unstable node.
>>>>>>>>>>>>>>>>>     b) Collect the information of the unstable
>>> node
>>>>> and
>>>>>>>>> form a
>>>>>>>>>>>>>> blocklist.
>>>>>>>>>>>>>>>>>     c) Take the action to block nodes.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> My two cents:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>>> should
>>>>> have
>>>>>>>> c).
>>>>>>>>> It
>>>>>>>>>>> is
>>>>>>>>>>>>>> not only
>>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>> handy
>>>> for
>>>>>>> some
>>>>>>>>>>> planned
>>>>>>>>>>>>>>>>> maintenance.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>> who
>>>>>>> should be
>>>>>>>>> the
>>>>>>>>>>>>> brain
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>>>>> largely
>>>>>>>>> depends
>>>>>>>>>>> on
>>>>>>>>>>>>>> who we
>>>>>>>>>>>>>>>>> talk to. Different users would probably give
>>>> different
>>>>>>>>> answers.
>>>>>>>>>>> For
>>>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>> who do have a centralized node health management
>>>>>>> service,
>>>>>>>> let
>>>>>>>>>>> Flink
>>>>>>>>>>>>>> do just
>>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>>>> Flink
>>>>>>> would
>>>>>>>>> be
>>>>>>>>>>> one
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>> it
>>> to
>>>>>>> that
>>>>>>>>>>> service,
>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>>> take the command from that service to block the
>>>>>>> problematic
>>>>>>>>>>> nodes.
>>>>>>>>>>>>> On
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>>>> service,
>>>>>>>> simply
>>>>>>>>>>>>> letting
>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>> nodes
>>>>> might
>>>>>>> be
>>>>>>>>>>> desired
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>>>> pluggable /
>>>>>>>>>> optional.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>>> have
>>>>>>>> something
>>>>>>>>>>>>>> pluggable
>>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>> unstable
>>>>> nodes
>>>>>>>>>>> actively.
>>>>>>>>>>>>> (A
>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>>> which
>>>>>>> can be
>>>>>>>>>> used
>>>>>>>>>>> to
>>>>>>>>>>>>>> report
>>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>>> the
>>>>>>> scope
>>>>>>>>>>> relevant
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>> OK
>>> to
>>>>>>> have a
>>>>>>>>>>> default
>>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>>> Flink
>>>> to
>>>>>>> take
>>>>>>>>>>> action
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> block
>>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>>>>> timeout.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>>>>>>>>>> ro.v.boyko@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>>>>>>> development.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>>> we
>>>>>>> should
>>>>>>>>>>> consider
>>>>>>>>>>>>>> to move
>>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>> nodes/TMs
>>>>> from
>>>>>>>>> active
>>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>>>> storing
>>>>>>>> inside
>>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>>>> organizations
>>>>>>>> don't
>>>>>>>>>> want
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> expose
>>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>>>> it's
>>>>>>>>> necessary
>>>>>>>>>>> to
>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>>> for
>>>>>>>> avoiding
>>>>>>>>>>>>> incorrect
>>>>>>>>>>>>>>>>>> automatic behaviour.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>>>> would
>>>>> be
>>>>>>>>> better
>>>>>>>>>>> to
>>>>>>>>>>>>>> extend
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>> *endTimestamp*
>>>>> field
>>>>>>> and
>>>>>>>>>> fill
>>>>>>>>>>> it
>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>>> to:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   -
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   Provide the ability to users to setup the
>>> exact
>>>>>>> time
>>>>>>>> of
>>>>>>>>>>>>>> blocklist end
>>>>>>>>>>>>>>>>>>   through RestAPI
>>>>>>>>>>>>>>>>>>   -
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   Not being tied to a single value of
>>>>>>>>>>>>>>>>>>   *cluster.resource-blacklist.item.timeout*
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>> <
>>>>>>>>>>>>> chesnay@apache.org>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>>>> lines
>>>>> a
>>>>>>>> bit.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>>>>> auto-detection
>>>>>>>> and
>>>>>>>>>> only
>>>>>>>>>>>>>> have an
>>>>>>>>>>>>>>>>>>> opt-in mechanism
>>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>> this
>>>>> sounds
>>>>>>>> yet
>>>>>>>>>>> again
>>>>>>>>>>>>>> like one
>>>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>> just
>>>>> right.
>>>>>>>>>>>>>>>>>>> An external system can leverage way more
>>>>> information
>>>>>>>>> after
>>>>>>>>>>> all.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>>>> complexity
>>>>>>> of
>>>>>>>>> this
>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>>> between
>>>>>>>>>> components;
>>>>>>>>>>>>>>>>> adjustments
>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>> slot and resource protocol.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>>> resource
>>>>>>>>> management
>>>>>>>>>>>>>> perspective,
>>>>>>>>>>>>>>>>>>> then there
>>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>> protocol
>>>> at
>>>>>>> all
>>>>>>>> (or
>>>>>>>>>> in
>>>>>>>>>>>>> fact
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>>> around
>>>>>>> blocked
>>>>>>>>> TMs
>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>> place.
>>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>>> down
>>>>>>> after
>>>>>>>> a
>>>>>>>>>>> while
>>>>>>>>>>>>> by
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> RM
>>>>>>>>>>>>>>>>>>> because of
>>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>>>> blocking
>>>>>>> the
>>>>>>>>>> process
>>>>>>>>>>>>> or
>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>>>>>>> process(es).
>>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>>>> back
>>>>>>> up.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>> the
>>>> same
>>>>>>> line
>>>>>>>>> of
>>>>>>>>>>>>>> thinking to
>>>>>>>>>>>>>>>>>>> standalone resource management.
>>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>> process/node
>>>>>>> manually
>>>>>>>>>>> should
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> core
>>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>>>> understand
>>>>>>> the
>>>>>>>>>> problem
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> I see
>>>>>>>>>>>>>>>>>>> value
>>>>>>>>>>>>>>>>>>>> in the automatic detection and
>>> blocklisting. I
>>>>> do
>>>>>>>> have
>>>>>>>>>> some
>>>>>>>>>>>>>> concerns
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>>> blocked
>>>>>>>>> resources.
>>>>>>>>>> I
>>>>>>>>>>>>> have
>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>>>>> concerns;
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>>>>> separation
>>>>>>> of
>>>>>>>>>>>>> concerns,
>>>>>>>>>>>>>>>>> meaning
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>>>> managing a
>>>>>>>>> cluster
>>>>>>>>>>> and
>>>>>>>>>>>>>> there's
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>>>>>>> introduction of
>>>>>>>>>> this
>>>>>>>>>>>>>> mechanism,
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> latter group now can influence the
>>>>> responsibility
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>> first
>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>>>> group
>>>>>>>> blocks
>>>>>>>>>>>>>> something,
>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>>> paging
>>>>>>>>> mechanism
>>>>>>>>>>>>>> triggering
>>>>>>>>>>>>>>>>>> etc)
>>>>>>>>>>>>>>>>>>>> which impacts the first group.
>>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>> go
>>>>>>> through
>>>>>>>> the
>>>>>>>>>>>>> process
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>> it
>>>>>>> should
>>>>>>>>> be? I
>>>>>>>>>>> do
>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>>> make
>>>>>>> sense
>>>>>>>> to
>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>>> really
>>>>>>> small
>>>>>>>>> user
>>>>>>>>>>>>> group
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> Flink?
>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>>> such
>>>> a
>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>>> I'm
>>>>>>>> leaning
>>>>>>>>>>>>> towards
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> exposing
>>>>>>>>>>>>>>>>>>>> this to the user group but having this
>>>> available
>>>>>>>>> strictly
>>>>>>>>>>> for
>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>> their
>>>>>>>>>>>>>> paging/metrics/logging
>>>>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>>>>>> to take this into account.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Martijn Visser
>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>>>>>>>>>>> karmagyz@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>>> some
>>>>>>> cents
>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>>>>>>>>>> to
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> user?
>>>>>>>>>>>>>>>>>>>>> I think the semantics of
>>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>>> the
>>>>>>>> timeout
>>>>>>>>>>>>>> mechanism is
>>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>> it
>>>>> will
>>>>>>> be
>>>>>>>>> very
>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>> exception
>>>> of a
>>>>>>> task
>>>>>>>>>>>>> manager to
>>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>> fail
>>>> in
>>>>>>> case
>>>>>>>>> the
>>>>>>>>>>>>> target
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>> I
>>>>> don't
>>>>>>>> mean
>>>>>>>>> we
>>>>>>>>>>>>> need
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> cover
>>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>> also
>>>>> open a
>>>>>>>>>>>>>> `notifyException`
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>>>>>>> ResourceManager,
>>>>>>>>> will
>>>>>>>>>>> the
>>>>>>>>>>>>>> slot of
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>>> released
>>>>> and
>>>>>>>>>>> allocated?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Yangze Guo
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>> Wang
>>> <
>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>> response
>>>>> your 4
>>>>>>>>>> remarks:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>>>>> controversy. I
>>>>>>>>> think
>>>>>>>>>>>>>> “BlockList”
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>>>> good
>>>>>>> idea.
>>>>>>>>>> Based
>>>>>>>>>>>>> on
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> above, I
>>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> DELETE
>> <host>/blocklist/node/<identifier>
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>> <host>/blocklist/taskmanager/<identifier>
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>>>> means
>>>>>>> that
>>>>>>>>> all
>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>>> managers
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>>>> these
>>>>>>> TMs
>>>>>>>> are
>>>>>>>>>> not
>>>>>>>>>>>>>>>>> available.
>>>>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>> these
>>>> TMs
>>>>>>> are
>>>>>>>>> not
>>>>>>>>>>>>> really
>>>>>>>>>>>>>> lost,
>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>> are
>>>>> still
>>>>>>>>>>> registered
>>>>>>>>>>>>>> in this
>>>>>>>>>>>>>>>>>>> flink
>>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>> once
>>>> the
>>>>>>>>>>>>> corresponding
>>>>>>>>>>>>>>>>>> blocklist
>>>>>>>>>>>>>>>>>>>>> item
>>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>>>>>>>>>>> active/non-active
>>>>>>>>>>>>>>>>> clusters.
>>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>> TMs
>>>> may
>>>>> be
>>>>>>>>>> released
>>>>>>>>>>>>> due
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> idle
>>>>>>>>>>>>>>>>>>>>>> timeouts.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>> keep
>>>> it.
>>>>>>> The
>>>>>>>>>> reasons
>>>>>>>>>>>>> are
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> following:
>>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>>> adding
>>>> or
>>>>>>>>> removing
>>>>>>>>>>>>> items
>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>> it
>>>> to
>>>>>>>>>>>>> Long.MAX_VALUE .
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>> a
>>>>>>> period of
>>>>>>>>>> time
>>>>>>>>>>>>>> (such as
>>>>>>>>>>>>>>>>>>> machine
>>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>> prefer
>>>> that
>>>>>>>> Flink
>>>>>>>>>> can
>>>>>>>>>>> do
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>>> user
>>>> to
>>>>>>> do
>>>>>>>> it
>>>>>>>>>>>>> manually.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kn...@apache.org>
>>>>>>> 于2022年4月27日周三
>>>>>>>>>>>>> 19:23写道：
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>> only
>>>>>>> support
>>>>>>>>>>> manually
>>>>>>>>>>>>>> blocking
>>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>>> different
>>>>>>>>> strategies
>>>>>>>>>>> can
>>>>>>>>>>>>>> also be
>>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>>>> "bla.cklist"
>>>>>>> due
>>>>>>>> to
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> controversy
>>>>>>>>>>>>>>>>>>>>> around
>>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>>> Ticket
>>>>>>> about
>>>>>>>>> this
>>>>>>>>>>>>> topic a
>>>>>>>>>>>>>>>>> while
>>>>>>>>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>>>> avoid
>>>>>>> the
>>>>>>>>> term
>>>>>>>>>>>>>> blacklist &
>>>>>>>>>>>>>>>>>>>>> whitelist
>>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>> "denylist"
>>> or
>>>>>>>>>>> "quarantined"
>>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>>> slightly
>>>>>>>> different
>>>>>>>>>>>>> design
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> verbs
>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>>> anti-pattern
>>>>> for
>>>>>>>> REST
>>>>>>>>>>> APIs.
>>>>>>>>>>>>>> POST
>>>>>>>>>>>>>>>>> on a
>>>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>>>> items.
>>>>>>>> DELETE
>>>>>>>>> on
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>> <host>/quarantine/items/<itemidentifier>
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>>>>> taskmanagers
>>>>>>>> and
>>>>>>>>>>> nodes
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> REST
>>>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>>> opinion
>>>>> on
>>>>>>>> this?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>> <host>/quarantine/nodes/<identifier>
>>>>>>>>>>>>>>>>>>>>>>> DELETE
>>>>>>> <host>/quarantine/taskmanager/<identifier>
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>>>>>>> non-active
>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>>> managers,
>>>>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>>>>>>> minimal,
>>>>>>>> do
>>>>>>>>> we
>>>>>>>>>>>>> need
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>>>> manually
>>>>>>> we
>>>>>>>>> could
>>>>>>>>>>>>>> delegate
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>>>>> behavior
>>>>>>>>> would
>>>>>>>>>>>>> better
>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Konstantin
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>>> schrieb
>>>>>>> Lijie
>>>>>>>>> Wang
>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> wangdachui9501@gmail.com>:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>>>> cluster
>>>>>>> node
>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>> (insufficient
>>>>>>>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>>>> abnormalities).
>>>>>>>> Flink
>>>>>>>>>> will
>>>>>>>>>>>>>> take care
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>>> However,
>>>>> due
>>>>>>> to
>>>>>>>>> data
>>>>>>>>>>>>>> locality
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> limited
>>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>> likely
>>>> to
>>>>> be
>>>>>>>>>>> redeployed
>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>>>> task
>>>>>>>>>>> abnormalities
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> affect
>>>>>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>>>>> progress.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>> manually
>>>>>>> identify
>>>>>>>>> the
>>>>>>>>>>>>>> problematic
>>>>>>>>>>>>>>>>>>>>> node and
>>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>>> But
>>>>>>> this
>>>>>>>>>>> approach
>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>> heavy
>>>>>>> process.
>>>>>>>>>> Users
>>>>>>>>>>>>> may
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> contact
>>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>>>>> operation
>>>>>>> can
>>>>>>>>>> even
>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> dangerous
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>>>>> events.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>> of
>>>>>>> problems
>>>>>>>>>>> manually
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> slow
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>>>> propose
>>>>>>> to
>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>>>>>>> blacklist
>>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>>>>> problematic
>>>>>>>>>>> resources.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>>> FLIP-224[1].
>>>>>>> Looking
>>>>>>>>>>> forward
>>>>>>>>>>>>>> to your
>>>>>>>>>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Roman Boyko
>>>>>>>>> e.: ro.v.boyko@gmail.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> https://twitter.com/snntrable
>>>> https://github.com/knaufk
>>>> 
>>> 
>> 
>> 
>> --
>> https://twitter.com/snntrable
>> https://github.com/knaufk
>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi Konstantin,

Maybe change it to the following:

1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
Merge is not allowed. If the {id} already exists, return error. Otherwise,
create a new item.

2. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
Merge is allowed. If the {id} already exists, merge. Otherwise, create a
new item.

WDYT?

Best,
Lijie

Konstantin Knauf <kn...@apache.org> 于2022年5月16日周一 20:07写道：

> Hi Lijie,
>
> hm, maybe the following is more appropriate in that case
>
> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>
> Best,
>
> Konstantin
>
> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
> wangdachui9501@gmail.com>:
>
> > Hi Konstantin,
> > thanks for your feedback.
> >
> > From what I understand, PUT should be idempotent. However, we have a
> > *timeout* field in the request. This means that initiating the same
> request
> > at two different times will lead to different resource status (timestamps
> > of the items to be removed will be different).
> >
> > Should we use PUT in this case? WDYT?
> >
> > Best,
> > Lijie
> >
> > Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
> >
> > > Hi Lijie,
> > >
> > > wouldn't the REST API-idiomatic way for an update/replace be a PUT on
> the
> > > resource?
> > >
> > > PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> > >
> > > Best,
> > >
> > > Konstantin
> > >
> > >
> > >
> > > Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> > > wangdachui9501@gmail.com>:
> > >
> > > > Hi everyone,
> > > >
> > > > I've had an offline discussion with Becket Qin and Zhu Zhu, and made
> > the
> > > > following changes on REST API:
> > > > 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
> > one.
> > > If
> > > > both are specified, will return error.
> > > > 2.  If the specified item is already there, the *ADD* operation has
> two
> > > > behaviors:  *return error*(default value) or *merge/update*, and we
> > add a
> > > > flag to the request body to control it. You can find more details
> > "Public
> > > > Interface" section.
> > > >
> > > > If there is no more feedback, we will start the vote thread next
> week.
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> > > >
> > > > > Hi Becket Qin,
> > > > >
> > > > > Thanks for your suggestions.  I have moved the description of
> > > > > configurations, metrics and REST API into "Public Interface"
> section,
> > > and
> > > > > made a few updates according to your suggestion.  And in this FLIP,
> > > there
> > > > > no public java Interfaces or pluggables that users need to
> implement
> > by
> > > > > themselves.
> > > > >
> > > > > Answers for you questions:
> > > > > 1. Yes, there 2 block actions: MARK_BLOCKED and.
> > > > > MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
> items
> > > can
> > > > > only be added through the REST API, so these 2 action are mentioned
> > in
> > > > the
> > > > > REST API part (The REST API part has beed moved to public interface
> > > now).
> > > > > 2. I agree with you. I have changed the "Cause" field to String,
> and
> > > > allow
> > > > > users to specify it via REST API.
> > > > > 3. Yes, it is useful to allow different timeouts. As mentioned
> above,
> > > we
> > > > > will introduce 2 fields : *timeout* and *endTimestamp* into the ADD
> > > REST
> > > > > API to specify when to remove the blocked item. These 2 fields are
> > > > > optional, if neither is specified, it means that the blocked item
> is
> > > > > permanent and will not be removed. If both are specified, the
> minimum
> > > of
> > > > > *currentTimestamp+tiemout *and* endTimestamp* will be used as the
> > time
> > > to
> > > > > remove the blocked item. To keep the configurations more minimal,
> we
> > > have
> > > > > removed the *cluster.resource-blocklist.item.timeout* configuration
> > > > > option.
> > > > > 4. Yes, the block item will be overridden if the specified item
> > already
> > > > > exists. The ADD operation is *ADD or UPDATE*.
> > > > > 5. Yes. On JM/RM side, all the blocklist information is maintained
> in
> > > > > JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> > > > abstracted
> > > > > to other interfaces) will be propagated to different components.
> > > > >
> > > > > Best,
> > > > > Lijie
> > > > >
> > > > > Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> > > > >
> > > > >> Hi Lijie,
> > > > >>
> > > > >> Thanks for updating the FLIP. It looks like the public interface
> > > section
> > > > >> did not fully reflect all the user sensible behavior and API. Can
> > you
> > > > put
> > > > >> everything that users may be aware of there? That would include
> the
> > > REST
> > > > >> API, metrics, configurations, public java Interfaces or pluggables
> > > that
> > > > >> users may see or implement by themselves, as well as a brief
> summary
> > > of
> > > > >> the
> > > > >> behavior of the public API.
> > > > >>
> > > > >> Besides that, I have a few questions:
> > > > >>
> > > > >> 1. According to the conversation in the discussion thread, it
> looks
> > > like
> > > > >> the BlockAction will have "MARK_BLOCKLISTED" and
> > > > >> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
> can
> > > you
> > > > >> add
> > > > >> that to the public interface as well?
> > > > >>
> > > > >> 2. At this point, the "Cause" field in the BlockingItem is a
> > Throwable
> > > > and
> > > > >> is not reflected in the REST API. Should that be included in the
> > query
> > > > >> response? And should we change that field to be a String so users
> > may
> > > > >> specify the cause via the REST API when they block some nodes /
> TMs?
> > > > >>
> > > > >> 3. Would it be useful to allow users to have different timeouts
> for
> > > > >> different blocked items? So while there is a default timeout,
> users
> > > can
> > > > >> also override it via the REST API when they block an entity.
> > > > >>
> > > > >> 4. Regarding the ADD operation, if the specified item is already
> > > there,
> > > > >> will the block item be overridden? For example, if the user wants
> to
> > > > >> extend
> > > > >> the timeout of a blocked item, can they just  issue an ADD command
> > > > again?
> > > > >>
> > > > >> 5. I am not quite familiar with the details of this, but is there
> a
> > > > source
> > > > >> of truth for the blocked list? I think it might be good to have a
> > > single
> > > > >> source of truth for the blocked list and just propagate that list
> to
> > > > >> different components to take the action of actually blocking the
> > > > resource.
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Jiangjie (Becket) Qin
> > > > >>
> > > > >> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
> wangdachui9501@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hi everyone,
> > > > >> >
> > > > >> > Based on the discussion in the mailing list, I updated the FLIP
> > doc,
> > > > the
> > > > >> > changes include:
> > > > >> > 1. Changed the description of the motivation section to more
> > clearly
> > > > >> > describe the problem this FLIP is trying to solve.
> > > > >> > 2. Only  *Manually* is supported.
> > > > >> > 3. Adopted some suggestions, such as *endTimestamp*.
> > > > >> >
> > > > >> > Best,
> > > > >> > Lijie
> > > > >> >
> > > > >> >
> > > > >> > Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> > > > >> >
> > > > >> > > Hi Lijie!
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > *a) “Probably storing inside Zookeeper/Configmap might be
> > > > >> helpfulhere.”
> > > > >> > > Can you explain it in detail? I don't fully understand that.
> In
> > > > >> > myopinion,
> > > > >> > > non-active and active are the same, and no special treatment
> > > > >> isrequired.*
> > > > >> > >
> > > > >> > > Sorry this was a misunderstanding from my side. I thought we
> > were
> > > > >> talking
> > > > >> > > about the HA mode (but not about Active and Standalone
> > > > >> ResourceManager).
> > > > >> > > And the original question was - how to handle the blacklisted
> > > nodes
> > > > >> list
> > > > >> > at
> > > > >> > > the moment of leader change? Should we simply forget about
> them
> > or
> > > > >> try to
> > > > >> > > pre-save that list on the remote storage?
> > > > >> > >
> > > > >> > > On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey.wy@gmail.com
> >
> > > > wrote:
> > > > >> > >
> > > > >> > > > Thanks Lijie and ZhuZhu for the explanation.
> > > > >> > > >
> > > > >> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level,
> it
> > is
> > > > >> indeed
> > > > >> > > > some functionalities the external tools(e.g. kubectl taint)
> > > could
> > > > >> not
> > > > >> > > > support.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Yang
> > > > >> > > >
> > > > >> > > > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> > > > >> > > >
> > > > >> > > > > Thanks for your feedback, Jiangang and Martijn.
> > > > >> > > > >
> > > > >> > > > > @Jiangang
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > > For auto-detecting, I wonder how to make the strategy
> and
> > > > mark a
> > > > >> > node
> > > > >> > > > > blocked?
> > > > >> > > > >
> > > > >> > > > > In fact, we currently plan to not support auto-detection
> in
> > > this
> > > > >> > FLIP.
> > > > >> > > > The
> > > > >> > > > > part about auto-detection may be continued in a separate
> > FLIP
> > > in
> > > > >> the
> > > > >> > > > > future. Some guys have the same concerns as you, and the
> > > > >> correctness
> > > > >> > > and
> > > > >> > > > > necessity of auto-detection may require further discussion
> > in
> > > > the
> > > > >> > > future.
> > > > >> > > > >
> > > > >> > > > > > In session mode, multi jobs can fail on the same bad
> node
> > > and
> > > > >> the
> > > > >> > > node
> > > > >> > > > > should be marked blocked.
> > > > >> > > > > By design, the blocklist information will be shared among
> > all
> > > > jobs
> > > > >> > in a
> > > > >> > > > > cluster/session. The JM will sync blocklist information
> with
> > > RM.
> > > > >> > > > >
> > > > >> > > > > @Martijn
> > > > >> > > > >
> > > > >> > > > > > I agree with Yang Wang on this.
> > > > >> > > > > As Zhu Zhu and I mentioned above, we think the
> > > > >> MARK_BLOCKLISTED(Just
> > > > >> > > > limits
> > > > >> > > > > the load of the node and does not  kill all the processes
> on
> > > it)
> > > > >> is
> > > > >> > > also
> > > > >> > > > > important, and we think that external systems (*yarn
> rmadmin
> > > or
> > > > >> > kubectl
> > > > >> > > > > taint*) cannot support it. So we think it makes sense even
> > > only
> > > > >> > > > *manually*.
> > > > >> > > > >
> > > > >> > > > > > I also agree with Chesnay that magical mechanisms are
> > indeed
> > > > >> super
> > > > >> > > hard
> > > > >> > > > > to get right.
> > > > >> > > > > Yes, as you see, Jiangang(and a few others) have the same
> > > > concern.
> > > > >> > > > > However, we currently plan to not support auto-detection
> in
> > > this
> > > > >> > FLIP,
> > > > >> > > > and
> > > > >> > > > > only *manually*. In addition, I'd like to say that the
> FLIP
> > > > >> provides
> > > > >> > a
> > > > >> > > > > mechanism to support MARK_BLOCKLISTED and
> > > > >> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > > > >> > > > > the auto-detection may be done by external systems.
> > > > >> > > > >
> > > > >> > > > > Best,
> > > > >> > > > > Lijie
> > > > >> > > > >
> > > > >> > > > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> > 19:04写道：
> > > > >> > > > >
> > > > >> > > > > > > If we only support to block nodes manually, then I
> could
> > > not
> > > > >> see
> > > > >> > > > > > the obvious advantages compared with current SRE's
> > > > approach(via
> > > > >> > *yarn
> > > > >> > > > > > rmadmin or kubectl taint*).
> > > > >> > > > > >
> > > > >> > > > > > I agree with Yang Wang on this.
> > > > >> > > > > >
> > > > >> > > > > > >  To me this sounds yet again like one of those magical
> > > > >> mechanisms
> > > > >> > > > that
> > > > >> > > > > > will rarely work just right.
> > > > >> > > > > >
> > > > >> > > > > > I also agree with Chesnay that magical mechanisms are
> > indeed
> > > > >> super
> > > > >> > > hard
> > > > >> > > > > to
> > > > >> > > > > > get right.
> > > > >> > > > > >
> > > > >> > > > > > Best regards,
> > > > >> > > > > >
> > > > >> > > > > > Martijn
> > > > >> > > > > >
> > > > >> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> > > > >> > liujiangangpeng@gmail.com
> > > > >> > > >
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > >> Thanks for the valuable design. The auto-detecting can
> > > > decrease
> > > > >> > > great
> > > > >> > > > > work
> > > > >> > > > > >> for us. We have implemented the similar feature in our
> > > inner
> > > > >> flink
> > > > >> > > > > >> version.
> > > > >> > > > > >> Below is something that I care about:
> > > > >> > > > > >>
> > > > >> > > > > >>    1. For auto-detecting, I wonder how to make the
> > strategy
> > > > and
> > > > >> > > mark a
> > > > >> > > > > >> node
> > > > >> > > > > >>    blocked? Sometimes the blocked node is hard to be
> > > > detected,
> > > > >> for
> > > > >> > > > > >> example,
> > > > >> > > > > >>    the upper node or the down node will be blocked when
> > > > network
> > > > >> > > > > >> unreachable.
> > > > >> > > > > >>    2. I see that the strategy is made in JobMaster
> side.
> > > How
> > > > >> about
> > > > >> > > > > >>    implementing the similar logic in resource manager?
> In
> > > > >> session
> > > > >> > > > mode,
> > > > >> > > > > >> multi
> > > > >> > > > > >>    jobs can fail on the same bad node and the node
> should
> > > be
> > > > >> > marked
> > > > >> > > > > >> blocked.
> > > > >> > > > > >>    If the job makes the strategy, the node may be not
> > > marked
> > > > >> > blocked
> > > > >> > > > if
> > > > >> > > > > >> the
> > > > >> > > > > >>    fail times don't exceed the threshold.
> > > > >> > > > > >>
> > > > >> > > > > >>
> > > > >> > > > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > > > >> > > > > >>
> > > > >> > > > > >> > Thank you for all your feedback!
> > > > >> > > > > >> >
> > > > >> > > > > >> > Besides the answers from Lijie, I'd like to share
> some
> > of
> > > > my
> > > > >> > > > thoughts:
> > > > >> > > > > >> > 1. Whether to enable automatical blocklist
> > > > >> > > > > >> > Generally speaking, it is not a goal of FLIP-224.
> > > > >> > > > > >> > The automatical way should be something built upon
> the
> > > > >> blocklist
> > > > >> > > > > >> > mechanism and well decoupled. It was designed to be a
> > > > >> > configurable
> > > > >> > > > > >> > blocklist strategy, but I think we can further
> decouple
> > > it
> > > > by
> > > > >> > > > > >> > introducing a abnormal node detector, as Becket
> > > suggested,
> > > > >> which
> > > > >> > > > just
> > > > >> > > > > >> > uses the blocklist mechanism once bad nodes are
> > detected.
> > > > >> > However,
> > > > >> > > > it
> > > > >> > > > > >> > should be a separate FLIP with further dev
> discussions
> > > and
> > > > >> > > feedback
> > > > >> > > > > >> > from users. I also agree with Becket that different
> > users
> > > > >> have
> > > > >> > > > > different
> > > > >> > > > > >> > requirements, and we should listen to them.
> > > > >> > > > > >> >
> > > > >> > > > > >> > 2. Is it enough to just take away abnormal nodes
> > > externally
> > > > >> > > > > >> > My answer is no. As Lijie has mentioned, we need a
> way
> > to
> > > > >> avoid
> > > > >> > > > > >> > deploying tasks to temporary hot nodes. In this case,
> > > users
> > > > >> may
> > > > >> > > just
> > > > >> > > > > >> > want to limit the load of the node and do not want to
> > > kill
> > > > >> all
> > > > >> > the
> > > > >> > > > > >> > processes on it. Another case is the speculative
> > > > execution[1]
> > > > >> > > which
> > > > >> > > > > >> > may also leverage this feature to avoid starting
> mirror
> > > > >> tasks on
> > > > >> > > > slow
> > > > >> > > > > >> > nodes.
> > > > >> > > > > >> >
> > > > >> > > > > >> > Thanks,
> > > > >> > > > > >> > Zhu
> > > > >> > > > > >> >
> > > > >> > > > > >> > [1]
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > > >> > > > > >> >
> > > > >> > > > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> > > > 15:56写道：
> > > > >> > > > > >> >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Hi everyone,
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Thanks for your feedback.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > There's one detail that I'd like to re-emphasize
> here
> > > > >> because
> > > > >> > it
> > > > >> > > > can
> > > > >> > > > > >> > affect the value and design of the blocklist
> mechanism
> > > > >> (perhaps
> > > > >> > I
> > > > >> > > > > should
> > > > >> > > > > >> > highlight it in the FLIP). We propose two actions in
> > > FLIP:
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or
> > node
> > > > as
> > > > >> > > > blocked.
> > > > >> > > > > >> > Future slots should not be allocated from the blocked
> > > task
> > > > >> > manager
> > > > >> > > > or
> > > > >> > > > > >> node.
> > > > >> > > > > >> > But slots that are already allocated will not be
> > > affected.
> > > > A
> > > > >> > > typical
> > > > >> > > > > >> > application scenario is to mitigate machine hotspots.
> > In
> > > > this
> > > > >> > > case,
> > > > >> > > > we
> > > > >> > > > > >> hope
> > > > >> > > > > >> > that subsequent resource allocations will not be on
> the
> > > hot
> > > > >> > > machine,
> > > > >> > > > > but
> > > > >> > > > > >> > tasks currently running on it should not be affected.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
> task
> > > > >> manager
> > > > >> > or
> > > > >> > > > > node
> > > > >> > > > > >> as
> > > > >> > > > > >> > blocked, and evacuate all tasks on it. Evacuated
> tasks
> > > will
> > > > >> be
> > > > >> > > > > >> restarted on
> > > > >> > > > > >> > non-blocked task managers.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > For the above 2 actions, the former may more
> > highlight
> > > > the
> > > > >> > > meaning
> > > > >> > > > > of
> > > > >> > > > > >> > this FLIP, because the external system cannot do
> that.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Regarding *Manually* and *Automatically*, I
> basically
> > > > agree
> > > > >> > with
> > > > >> > > > > >> @Becket
> > > > >> > > > > >> > Qin: different users have different answers. Not all
> > > users’
> > > > >> > > > deployment
> > > > >> > > > > >> > environments have a special external system that can
> > > > perform
> > > > >> the
> > > > >> > > > > anomaly
> > > > >> > > > > >> > detection. In addition, adding pluggable/optional
> > > > >> auto-detection
> > > > >> > > > > doesn't
> > > > >> > > > > >> > require much extra work on top of manual
> specification.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > I will answer your other questions one by one.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > @Yangze
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > a) I think you are right, we do not need to expose
> > the
> > > > >> > > > > >> >
> > `cluster.resource-blocklist.item.timeout-check-interval`
> > > to
> > > > >> > users.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > b) We can abstract the `notifyException` to a
> > separate
> > > > >> > interface
> > > > >> > > > > >> (maybe
> > > > >> > > > > >> > BlocklistExceptionListener), and the
> > > > >> > > ResourceManagerBlocklistHandler
> > > > >> > > > > can
> > > > >> > > > > >> > implement it in the future.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > @Martijn
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > a) I also think the manual blocking should be done
> by
> > > > >> cluster
> > > > >> > > > > >> operators.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > b) I think manual blocking makes sense, because
> > > according
> > > > >> to
> > > > >> > my
> > > > >> > > > > >> > experience, users are often the first to perceive the
> > > > machine
> > > > >> > > > problems
> > > > >> > > > > >> > (because of job failover or delay), and they will
> > contact
> > > > >> > cluster
> > > > >> > > > > >> operators
> > > > >> > > > > >> > to solve it, or even tell the cluster operators which
> > > > >> machine is
> > > > >> > > > > >> > problematic. From this point of view, I think the
> > people
> > > > who
> > > > >> > > really
> > > > >> > > > > need
> > > > >> > > > > >> > the manual blocking are the users, and it’s just
> > > performed
> > > > by
> > > > >> > the
> > > > >> > > > > >> cluster
> > > > >> > > > > >> > operator, so I think the manual blocking makes sense.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > @Chesnay
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > We need to touch the logic of JM/SlotPool, because
> > for
> > > > >> > > > > >> MARK_BLOCKLISTED
> > > > >> > > > > >> > , we need to know whether the slot is blocklisted
> when
> > > the
> > > > >> task
> > > > >> > is
> > > > >> > > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> > > release
> > > > >> the
> > > > >> > > slot
> > > > >> > > > > >> > directly to avoid assigning other tasks (of this job)
> > on
> > > > it.
> > > > >> If
> > > > >> > we
> > > > >> > > > > only
> > > > >> > > > > >> > maintain the blocklist information on the RM, JM
> needs
> > to
> > > > >> > retrieve
> > > > >> > > > it
> > > > >> > > > > by
> > > > >> > > > > >> > RPC. I think the performance overhead of that is
> > > relatively
> > > > >> > large,
> > > > >> > > > so
> > > > >> > > > > I
> > > > >> > > > > >> > think it's worth maintaining the blocklist
> information
> > on
> > > > >> the JM
> > > > >> > > > side
> > > > >> > > > > >> and
> > > > >> > > > > >> > syncing them.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > @Роман
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >     a) “Probably storing inside Zookeeper/Configmap
> > > might
> > > > >> be
> > > > >> > > > helpful
> > > > >> > > > > >> > here.”  Can you explain it in detail? I don't fully
> > > > >> understand
> > > > >> > > that.
> > > > >> > > > > In
> > > > >> > > > > >> my
> > > > >> > > > > >> > opinion, non-active and active are the same, and no
> > > special
> > > > >> > > > treatment
> > > > >> > > > > is
> > > > >> > > > > >> > required.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > b) I agree with you, the `endTimestamp` makes
> sense,
> > I
> > > > will
> > > > >> > add
> > > > >> > > it
> > > > >> > > > > to
> > > > >> > > > > >> > FLIP.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > @Yang
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > As mentioned above, AFAK, the external system
> cannot
> > > > >> support
> > > > >> > the
> > > > >> > > > > >> > MARK_BLOCKLISTED action.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Looking forward to your further feedback.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Best,
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Lijie
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二
> > > 21:09写道：
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> I want to share some thoughts about Flink cluster
> > > > >> operations.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> In the production environment, the SRE(aka Site
> > > > >> Reliability
> > > > >> > > > > Engineer)
> > > > >> > > > > >> > >> already has many tools to detect the unstable
> nodes,
> > > > which
> > > > >> > > could
> > > > >> > > > > take
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> system logs/metrics into consideration.
> > > > >> > > > > >> > >> Then they use graceful-decomission in YARN and
> taint
> > > in
> > > > >> K8s
> > > > >> > to
> > > > >> > > > > >> prevent
> > > > >> > > > > >> > new
> > > > >> > > > > >> > >> allocations on these unstable nodes.
> > > > >> > > > > >> > >> At last, they will evict all the containers and
> pods
> > > > >> running
> > > > >> > on
> > > > >> > > > > these
> > > > >> > > > > >> > nodes.
> > > > >> > > > > >> > >> This mechanism also works for planned maintenance.
> > So
> > > I
> > > > am
> > > > >> > > afraid
> > > > >> > > > > >> this
> > > > >> > > > > >> > is
> > > > >> > > > > >> > >> not the typical use case for FLIP-224.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> If we only support to block nodes manually, then I
> > > could
> > > > >> not
> > > > >> > > see
> > > > >> > > > > >> > >> the obvious advantages compared with current SRE's
> > > > >> > approach(via
> > > > >> > > > > *yarn
> > > > >> > > > > >> > >> rmadmin or kubectl taint*).
> > > > >> > > > > >> > >> At least, we need to have a pluggable component
> > which
> > > > >> could
> > > > >> > > > expose
> > > > >> > > > > >> the
> > > > >> > > > > >> > >> potential unstable nodes automatically and block
> > them
> > > if
> > > > >> > > enabled
> > > > >> > > > > >> > explicitly.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> Best,
> > > > >> > > > > >> > >> Yang
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> > > 16:36写道：
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> > Thanks for the proposal, Lijie.
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > This is an interesting feature and discussion,
> and
> > > > >> somewhat
> > > > >> > > > > related
> > > > >> > > > > >> > to the
> > > > >> > > > > >> > >> > design principle about how people should operate
> > > > Flink.
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > I think there are three things involved in this
> > > FLIP.
> > > > >> > > > > >> > >> >      a) Detect and report the unstable node.
> > > > >> > > > > >> > >> >      b) Collect the information of the unstable
> > node
> > > > and
> > > > >> > > form a
> > > > >> > > > > >> > blocklist.
> > > > >> > > > > >> > >> >      c) Take the action to block nodes.
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > My two cents:
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > 1. It looks like people all agree that Flink
> > should
> > > > have
> > > > >> > c).
> > > > >> > > It
> > > > >> > > > > is
> > > > >> > > > > >> > not only
> > > > >> > > > > >> > >> > useful for cases of node failures, but also
> handy
> > > for
> > > > >> some
> > > > >> > > > > planned
> > > > >> > > > > >> > >> > maintenance.
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > 2. People have different opinions on b), i.e.
> who
> > > > >> should be
> > > > >> > > the
> > > > >> > > > > >> brain
> > > > >> > > > > >> > to
> > > > >> > > > > >> > >> > make the decision to block a node. I think this
> > > > largely
> > > > >> > > depends
> > > > >> > > > > on
> > > > >> > > > > >> > who we
> > > > >> > > > > >> > >> > talk to. Different users would probably give
> > > different
> > > > >> > > answers.
> > > > >> > > > > For
> > > > >> > > > > >> > people
> > > > >> > > > > >> > >> > who do have a centralized node health management
> > > > >> service,
> > > > >> > let
> > > > >> > > > > Flink
> > > > >> > > > > >> > do just
> > > > >> > > > > >> > >> > do a) and c) would be preferred. So essentially
> > > Flink
> > > > >> would
> > > > >> > > be
> > > > >> > > > > one
> > > > >> > > > > >> of
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> > sources that may detect unstable nodes, report
> it
> > to
> > > > >> that
> > > > >> > > > > service,
> > > > >> > > > > >> > and then
> > > > >> > > > > >> > >> > take the command from that service to block the
> > > > >> problematic
> > > > >> > > > > nodes.
> > > > >> > > > > >> On
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> > other hand, for users who do not have such a
> > > service,
> > > > >> > simply
> > > > >> > > > > >> letting
> > > > >> > > > > >> > Flink
> > > > >> > > > > >> > >> > be clever by itself to block the suspicious
> nodes
> > > > might
> > > > >> be
> > > > >> > > > > desired
> > > > >> > > > > >> to
> > > > >> > > > > >> > >> > ensure the jobs are running smoothly.
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > So that indicates a) and b) here should be
> > > pluggable /
> > > > >> > > > optional.
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > In light of this, maybe it would make sense to
> > have
> > > > >> > something
> > > > >> > > > > >> > pluggable
> > > > >> > > > > >> > >> > like a UnstableNodeReporter which exposes
> unstable
> > > > nodes
> > > > >> > > > > actively.
> > > > >> > > > > >> (A
> > > > >> > > > > >> > more
> > > > >> > > > > >> > >> > general interface should be JobInfoReporter<T>
> > which
> > > > >> can be
> > > > >> > > > used
> > > > >> > > > > to
> > > > >> > > > > >> > report
> > > > >> > > > > >> > >> > any information of type <T>. But I'll just keep
> > the
> > > > >> scope
> > > > >> > > > > relevant
> > > > >> > > > > >> to
> > > > >> > > > > >> > this
> > > > >> > > > > >> > >> > FLIP here). Personally speaking, I think it is
> OK
> > to
> > > > >> have a
> > > > >> > > > > default
> > > > >> > > > > >> > >> > implementation of a reporter which just tells
> > Flink
> > > to
> > > > >> take
> > > > >> > > > > action
> > > > >> > > > > >> to
> > > > >> > > > > >> > block
> > > > >> > > > > >> > >> > problematic nodes and also unblocks them after
> > > > timeout.
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > Thanks,
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > Jiangjie (Becket) Qin
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > > > >> > > > ro.v.boyko@gmail.com
> > > > >> > > > > >
> > > > >> > > > > >> > wrote:
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > > If it's possible I'd like to participate in
> > > > >> development.
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply -
> > we
> > > > >> should
> > > > >> > > > > consider
> > > > >> > > > > >> > to move
> > > > >> > > > > >> > >> > > somehow the information of blocklisted
> nodes/TMs
> > > > from
> > > > >> > > active
> > > > >> > > > > >> > >> > > ResourceManager to non-active ones. Probably
> > > storing
> > > > >> > inside
> > > > >> > > > > >> > >> > > Zookeeper/Configmap might be helpful here.
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > > And I agree with Martijn that a lot of
> > > organizations
> > > > >> > don't
> > > > >> > > > want
> > > > >> > > > > >> to
> > > > >> > > > > >> > expose
> > > > >> > > > > >> > >> > > such API for a cluster user group. But I think
> > > it's
> > > > >> > > necessary
> > > > >> > > > > to
> > > > >> > > > > >> > have the
> > > > >> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway
> > for
> > > > >> > avoiding
> > > > >> > > > > >> incorrect
> > > > >> > > > > >> > >> > > automatic behaviour.
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > > And another one small suggestion - I think it
> > > would
> > > > be
> > > > >> > > better
> > > > >> > > > > to
> > > > >> > > > > >> > extend
> > > > >> > > > > >> > >> > the
> > > > >> > > > > >> > >> > > *BlocklistedItem* class with the
> *endTimestamp*
> > > > field
> > > > >> and
> > > > >> > > > fill
> > > > >> > > > > it
> > > > >> > > > > >> > at the
> > > > >> > > > > >> > >> > > item creation. This simple addition will allow
> > to:
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > >    -
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > >    Provide the ability to users to setup the
> > exact
> > > > >> time
> > > > >> > of
> > > > >> > > > > >> > blocklist end
> > > > >> > > > > >> > >> > >    through RestAPI
> > > > >> > > > > >> > >> > >    -
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > >    Not being tied to a single value of
> > > > >> > > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler
> <
> > > > >> > > > > >> chesnay@apache.org>
> > > > >> > > > > >> > >> > wrote:
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> > > > I do share the concern between blurring the
> > > lines
> > > > a
> > > > >> > bit.
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > > That said, I'd prefer to not have any
> > > > auto-detection
> > > > >> > and
> > > > >> > > > only
> > > > >> > > > > >> > have an
> > > > >> > > > > >> > >> > > > opt-in mechanism
> > > > >> > > > > >> > >> > > > to manually block processes/nodes. To me
> this
> > > > sounds
> > > > >> > yet
> > > > >> > > > > again
> > > > >> > > > > >> > like one
> > > > >> > > > > >> > >> > > > of those
> > > > >> > > > > >> > >> > > > magical mechanisms that will rarely work
> just
> > > > right.
> > > > >> > > > > >> > >> > > > An external system can leverage way more
> > > > information
> > > > >> > > after
> > > > >> > > > > all.
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > > Moreover, I'm quite concerned about the
> > > complexity
> > > > >> of
> > > > >> > > this
> > > > >> > > > > >> > proposal.
> > > > >> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing
> > between
> > > > >> > > > components;
> > > > >> > > > > >> > >> > adjustments
> > > > >> > > > > >> > >> > > > to the
> > > > >> > > > > >> > >> > > > slot and resource protocol.
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > > In a way it seems overly complicated.
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > > If we look at it purely from an active
> > resource
> > > > >> > > management
> > > > >> > > > > >> > perspective,
> > > > >> > > > > >> > >> > > > then there
> > > > >> > > > > >> > >> > > > isn't really a need to touch the slot
> protocol
> > > at
> > > > >> all
> > > > >> > (or
> > > > >> > > > in
> > > > >> > > > > >> fact
> > > > >> > > > > >> > to
> > > > >> > > > > >> > >> > > > anything in the JobMaster),
> > > > >> > > > > >> > >> > > > because there isn't any point in keeping
> > around
> > > > >> blocked
> > > > >> > > TMs
> > > > >> > > > > in
> > > > >> > > > > >> the
> > > > >> > > > > >> > >> > first
> > > > >> > > > > >> > >> > > > place.
> > > > >> > > > > >> > >> > > > They'd just be idling, potentially shutting
> > down
> > > > >> after
> > > > >> > a
> > > > >> > > > > while
> > > > >> > > > > >> by
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> > RM
> > > > >> > > > > >> > >> > > > because of
> > > > >> > > > > >> > >> > > > it (unless we _also_ touch that logic).
> > > > >> > > > > >> > >> > > > Here the blocking of a process (be it by
> > > blocking
> > > > >> the
> > > > >> > > > process
> > > > >> > > > > >> or
> > > > >> > > > > >> > node)
> > > > >> > > > > >> > >> > is
> > > > >> > > > > >> > >> > > > equivalent with shutting down the blocked
> > > > >> process(es).
> > > > >> > > > > >> > >> > > > Once the block is lifted we can just spin it
> > > back
> > > > >> up.
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > > And I do wonder whether we couldn't apply
> the
> > > same
> > > > >> line
> > > > >> > > of
> > > > >> > > > > >> > thinking to
> > > > >> > > > > >> > >> > > > standalone resource management.
> > > > >> > > > > >> > >> > > > Here being able to stop/restart a
> process/node
> > > > >> manually
> > > > >> > > > > should
> > > > >> > > > > >> be
> > > > >> > > > > >> > a
> > > > >> > > > > >> > >> > core
> > > > >> > > > > >> > >> > > > requirement for a Flink deployment anyway.
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > > >> > > > > >> > >> > > > > Hi everyone,
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > > Thanks for creating this FLIP. I can
> > > understand
> > > > >> the
> > > > >> > > > problem
> > > > >> > > > > >> and
> > > > >> > > > > >> > I see
> > > > >> > > > > >> > >> > > > value
> > > > >> > > > > >> > >> > > > > in the automatic detection and
> > blocklisting. I
> > > > do
> > > > >> > have
> > > > >> > > > some
> > > > >> > > > > >> > concerns
> > > > >> > > > > >> > >> > > with
> > > > >> > > > > >> > >> > > > > the ability to manually specify to be
> > blocked
> > > > >> > > resources.
> > > > >> > > > I
> > > > >> > > > > >> have
> > > > >> > > > > >> > two
> > > > >> > > > > >> > >> > > > > concerns;
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > > * Most organizations explicitly have a
> > > > separation
> > > > >> of
> > > > >> > > > > >> concerns,
> > > > >> > > > > >> > >> > meaning
> > > > >> > > > > >> > >> > > > that
> > > > >> > > > > >> > >> > > > > there's a group who's responsible for
> > > managing a
> > > > >> > > cluster
> > > > >> > > > > and
> > > > >> > > > > >> > there's
> > > > >> > > > > >> > >> > a
> > > > >> > > > > >> > >> > > > user
> > > > >> > > > > >> > >> > > > > group who uses that cluster. With the
> > > > >> introduction of
> > > > >> > > > this
> > > > >> > > > > >> > mechanism,
> > > > >> > > > > >> > >> > > the
> > > > >> > > > > >> > >> > > > > latter group now can influence the
> > > > responsibility
> > > > >> of
> > > > >> > > the
> > > > >> > > > > >> first
> > > > >> > > > > >> > group.
> > > > >> > > > > >> > >> > > So
> > > > >> > > > > >> > >> > > > it
> > > > >> > > > > >> > >> > > > > can be possible that someone from the user
> > > group
> > > > >> > blocks
> > > > >> > > > > >> > something,
> > > > >> > > > > >> > >> > > which
> > > > >> > > > > >> > >> > > > > causes an outage (which could result in
> > paging
> > > > >> > > mechanism
> > > > >> > > > > >> > triggering
> > > > >> > > > > >> > >> > > etc)
> > > > >> > > > > >> > >> > > > > which impacts the first group.
> > > > >> > > > > >> > >> > > > > * How big is the group of people who can
> go
> > > > >> through
> > > > >> > the
> > > > >> > > > > >> process
> > > > >> > > > > >> > of
> > > > >> > > > > >> > >> > > > manually
> > > > >> > > > > >> > >> > > > > identifying a node that isn't behaving as
> it
> > > > >> should
> > > > >> > > be? I
> > > > >> > > > > do
> > > > >> > > > > >> > think
> > > > >> > > > > >> > >> > this
> > > > >> > > > > >> > >> > > > > group is relatively limited. Does it then
> > make
> > > > >> sense
> > > > >> > to
> > > > >> > > > > >> > introduce
> > > > >> > > > > >> > >> > such
> > > > >> > > > > >> > >> > > a
> > > > >> > > > > >> > >> > > > > feature, which would only be used by a
> > really
> > > > >> small
> > > > >> > > user
> > > > >> > > > > >> group
> > > > >> > > > > >> > of
> > > > >> > > > > >> > >> > > Flink?
> > > > >> > > > > >> > >> > > > We
> > > > >> > > > > >> > >> > > > > still have to maintain, test and support
> > such
> > > a
> > > > >> > > feature.
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > > I'm +1 for the autodetection features, but
> > I'm
> > > > >> > leaning
> > > > >> > > > > >> towards
> > > > >> > > > > >> > not
> > > > >> > > > > >> > >> > > > exposing
> > > > >> > > > > >> > >> > > > > this to the user group but having this
> > > available
> > > > >> > > strictly
> > > > >> > > > > for
> > > > >> > > > > >> > cluster
> > > > >> > > > > >> > >> > > > > operators. They could then also set up
> their
> > > > >> > > > > >> > paging/metrics/logging
> > > > >> > > > > >> > >> > > > system
> > > > >> > > > > >> > >> > > > > to take this into account.
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > > Best regards,
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > > Martijn Visser
> > > > >> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82
> > > > >> > > > > >> > >> > > > > https://github.com/MartijnVisser
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > > > >> > > > > karmagyz@gmail.com
> > > > >> > > > > >> >
> > > > >> > > > > >> > wrote:
> > > > >> > > > > >> > >> > > > >
> > > > >> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > > >> +1 for the overall proposal. Just share
> > some
> > > > >> cents
> > > > >> > > here:
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > > >> - Why do we need to expose
> > > > >> > > > > >> > >> > > > >>
> > > > >> > cluster.resource-blacklist.item.timeout-check-interval
> > > > >> > > > to
> > > > >> > > > > >> the
> > > > >> > > > > >> > user?
> > > > >> > > > > >> > >> > > > >> I think the semantics of
> > > > >> > > > > >> > `cluster.resource-blacklist.item.timeout`
> > > > >> > > > > >> > >> > is
> > > > >> > > > > >> > >> > > > >> sufficient for the user. How to guarantee
> > the
> > > > >> > timeout
> > > > >> > > > > >> > mechanism is
> > > > >> > > > > >> > >> > > > >> Flink's internal implementation. I think
> it
> > > > will
> > > > >> be
> > > > >> > > very
> > > > >> > > > > >> > confusing
> > > > >> > > > > >> > >> > and
> > > > >> > > > > >> > >> > > > >> we do not need to expose it to users.
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > > >> - ResourceManager can notify the
> exception
> > > of a
> > > > >> task
> > > > >> > > > > >> manager to
> > > > >> > > > > >> > >> > > > >> `BlacklistHandler` as well.
> > > > >> > > > > >> > >> > > > >> For example, the slot allocation might
> fail
> > > in
> > > > >> case
> > > > >> > > the
> > > > >> > > > > >> target
> > > > >> > > > > >> > task
> > > > >> > > > > >> > >> > > > >> manager is busy or has a network jitter.
> I
> > > > don't
> > > > >> > mean
> > > > >> > > we
> > > > >> > > > > >> need
> > > > >> > > > > >> > to
> > > > >> > > > > >> > >> > cover
> > > > >> > > > > >> > >> > > > >> this case in this version, but we can
> also
> > > > open a
> > > > >> > > > > >> > `notifyException`
> > > > >> > > > > >> > >> > in
> > > > >> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > > >> - Before we sync the blocklist to
> > > > >> ResourceManager,
> > > > >> > > will
> > > > >> > > > > the
> > > > >> > > > > >> > slot of
> > > > >> > > > > >> > >> > a
> > > > >> > > > > >> > >> > > > >> blocked task manager continues to be
> > released
> > > > and
> > > > >> > > > > allocated?
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > > >> Best,
> > > > >> > > > > >> > >> > > > >> Yangze Guo
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie
> Wang
> > <
> > > > >> > > > > >> > >> > wangdachui9501@gmail.com>
> > > > >> > > > > >> > >> > > > >> wrote:
> > > > >> > > > > >> > >> > > > >>> Hi Konstantin,
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> Thanks for your feedback. I will
> response
> > > > your 4
> > > > >> > > > remarks:
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the
> > > > controversy. I
> > > > >> > > think
> > > > >> > > > > >> > “BlockList”
> > > > >> > > > > >> > >> > is
> > > > >> > > > > >> > >> > > > >> good
> > > > >> > > > > >> > >> > > > >>> enough, and I will change it in FLIP.
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a
> > > good
> > > > >> idea.
> > > > >> > > > Based
> > > > >> > > > > >> on
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> > > > above, I
> > > > >> > > > > >> > >> > > > >>> would change REST API as following:
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> DELETE
> <host>/blocklist/node/<identifier>
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> DELETE
> > > > <host>/blocklist/taskmanager/<identifier>
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it
> > > means
> > > > >> that
> > > > >> > > all
> > > > >> > > > > >> task
> > > > >> > > > > >> > >> > managers
> > > > >> > > > > >> > >> > > > on
> > > > >> > > > > >> > >> > > > >>> this node are blocklisted. All slots on
> > > these
> > > > >> TMs
> > > > >> > are
> > > > >> > > > not
> > > > >> > > > > >> > >> > available.
> > > > >> > > > > >> > >> > > > This
> > > > >> > > > > >> > >> > > > >>> is actually a bit like TM losts, but
> these
> > > TMs
> > > > >> are
> > > > >> > > not
> > > > >> > > > > >> really
> > > > >> > > > > >> > lost,
> > > > >> > > > > >> > >> > > > they
> > > > >> > > > > >> > >> > > > >>> are in an unavailable status, and they
> are
> > > > still
> > > > >> > > > > registered
> > > > >> > > > > >> > in this
> > > > >> > > > > >> > >> > > > flink
> > > > >> > > > > >> > >> > > > >>> cluster. They will be available again
> once
> > > the
> > > > >> > > > > >> corresponding
> > > > >> > > > > >> > >> > > blocklist
> > > > >> > > > > >> > >> > > > >> item
> > > > >> > > > > >> > >> > > > >>> is removed. This behavior is the same in
> > > > >> > > > > active/non-active
> > > > >> > > > > >> > >> > clusters.
> > > > >> > > > > >> > >> > > > >>> However in the active clusters, these
> TMs
> > > may
> > > > be
> > > > >> > > > released
> > > > >> > > > > >> due
> > > > >> > > > > >> > to
> > > > >> > > > > >> > >> > idle
> > > > >> > > > > >> > >> > > > >>> timeouts.
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to
> keep
> > > it.
> > > > >> The
> > > > >> > > > reasons
> > > > >> > > > > >> are
> > > > >> > > > > >> > as
> > > > >> > > > > >> > >> > > > >> following:
> > > > >> > > > > >> > >> > > > >>> a) The timeout will not affect users
> > adding
> > > or
> > > > >> > > removing
> > > > >> > > > > >> items
> > > > >> > > > > >> > via
> > > > >> > > > > >> > >> > > REST
> > > > >> > > > > >> > >> > > > >> API,
> > > > >> > > > > >> > >> > > > >>> and users can disable it by configuring
> it
> > > to
> > > > >> > > > > >> Long.MAX_VALUE .
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> b) Some node problems can recover after
> a
> > > > >> period of
> > > > >> > > > time
> > > > >> > > > > >> > (such as
> > > > >> > > > > >> > >> > > > machine
> > > > >> > > > > >> > >> > > > >>> hotspots), in which case users may
> prefer
> > > that
> > > > >> > Flink
> > > > >> > > > can
> > > > >> > > > > do
> > > > >> > > > > >> > this
> > > > >> > > > > >> > >> > > > >>> automatically instead of requiring the
> > user
> > > to
> > > > >> do
> > > > >> > it
> > > > >> > > > > >> manually.
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> Best,
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> Lijie
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org>
> > > > >> 于2022年4月27日周三
> > > > >> > > > > >> 19:23写道：
> > > > >> > > > > >> > >> > > > >>>
> > > > >> > > > > >> > >> > > > >>>> Hi Lijie,
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to
> only
> > > > >> support
> > > > >> > > > > manually
> > > > >> > > > > >> > blocking
> > > > >> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the
> > different
> > > > >> > > strategies
> > > > >> > > > > can
> > > > >> > > > > >> > also be
> > > > >> > > > > >> > >> > > > >>>> maintained outside of Apache Flink.
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> A few remarks:
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> 1) Can we use another term than
> > > "bla.cklist"
> > > > >> due
> > > > >> > to
> > > > >> > > > the
> > > > >> > > > > >> > >> > controversy
> > > > >> > > > > >> > >> > > > >> around
> > > > >> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira
> > Ticket
> > > > >> about
> > > > >> > > this
> > > > >> > > > > >> topic a
> > > > >> > > > > >> > >> > while
> > > > >> > > > > >> > >> > > > >> back
> > > > >> > > > > >> > >> > > > >>>> and there was generally a consensus to
> > > avoid
> > > > >> the
> > > > >> > > term
> > > > >> > > > > >> > blacklist &
> > > > >> > > > > >> > >> > > > >> whitelist
> > > > >> > > > > >> > >> > > > >>>> [2]? We could use "blocklist"
> "denylist"
> > or
> > > > >> > > > > "quarantined"
> > > > >> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a
> > slightly
> > > > >> > different
> > > > >> > > > > >> design
> > > > >> > > > > >> > as
> > > > >> > > > > >> > >> > verbs
> > > > >> > > > > >> > >> > > > >> like
> > > > >> > > > > >> > >> > > > >>>> add/remove often considered an
> > anti-pattern
> > > > for
> > > > >> > REST
> > > > >> > > > > APIs.
> > > > >> > > > > >> > POST
> > > > >> > > > > >> > >> > on a
> > > > >> > > > > >> > >> > > > >> list
> > > > >> > > > > >> > >> > > > >>>> item is generally the standard to add
> > > items.
> > > > >> > DELETE
> > > > >> > > on
> > > > >> > > > > the
> > > > >> > > > > >> > >> > > individual
> > > > >> > > > > >> > >> > > > >>>> resource is standard to remove an item.
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> POST <host>/quarantine/items
> > > > >> > > > > >> > >> > > > >>>> DELETE
> > > > <host>/quarantine/items/<itemidentifier>
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> We could also consider to separate
> > > > taskmanagers
> > > > >> > and
> > > > >> > > > > nodes
> > > > >> > > > > >> in
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> > > REST
> > > > >> > > > > >> > >> > > > >> API
> > > > >> > > > > >> > >> > > > >>>> (and internal data structures). Any
> > opinion
> > > > on
> > > > >> > this?
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > > > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > > > >> > > > > >> > >> > > > >>>> DELETE
> > <host>/quarantine/nodes/<identifier>
> > > > >> > > > > >> > >> > > > >>>> DELETE
> > > > >> <host>/quarantine/taskmanager/<identifier>
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with
> > > > >> non-active
> > > > >> > > > > >> resource
> > > > >> > > > > >> > >> > > managers,
> > > > >> > > > > >> > >> > > > >> i.e.
> > > > >> > > > > >> > >> > > > >>>> standalone or reactive mode?
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> 4) To keep the implementation even more
> > > > >> minimal,
> > > > >> > do
> > > > >> > > we
> > > > >> > > > > >> need
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> > > > timeout
> > > > >> > > > > >> > >> > > > >>>> behavior? If items are added/removed
> > > manually
> > > > >> we
> > > > >> > > could
> > > > >> > > > > >> > delegate
> > > > >> > > > > >> > >> > this
> > > > >> > > > > >> > >> > > > >> to the
> > > > >> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout
> > > > behavior
> > > > >> > > would
> > > > >> > > > > >> better
> > > > >> > > > > >> > fit
> > > > >> > > > > >> > >> > > into
> > > > >> > > > > >> > >> > > > >>>> specific strategies at a later point.
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> Looking forward to your thoughts.
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> Cheers and thank you,
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> Konstantin
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> [1]
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > > >> > > > > >> > >> > > > >>>> [2]
> > > > >> > > https://issues.apache.org/jira/browse/FLINK-18209
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> > schrieb
> > > > >> Lijie
> > > > >> > > Wang
> > > > >> > > > <
> > > > >> > > > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> > > > >> > > > > >> > >> > > > >>>>
> > > > >> > > > > >> > >> > > > >>>>> Hi all,
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> Flink job failures may happen due to
> > > cluster
> > > > >> node
> > > > >> > > > > issues
> > > > >> > > > > >> > >> > > > >> (insufficient
> > > > >> > > > > >> > >> > > > >>>> disk
> > > > >> > > > > >> > >> > > > >>>>> space, bad hardware, network
> > > abnormalities).
> > > > >> > Flink
> > > > >> > > > will
> > > > >> > > > > >> > take care
> > > > >> > > > > >> > >> > > of
> > > > >> > > > > >> > >> > > > >> the
> > > > >> > > > > >> > >> > > > >>>>> failures and redeploy the tasks.
> > However,
> > > > due
> > > > >> to
> > > > >> > > data
> > > > >> > > > > >> > locality
> > > > >> > > > > >> > >> > and
> > > > >> > > > > >> > >> > > > >>>> limited
> > > > >> > > > > >> > >> > > > >>>>> resources, the new tasks are very
> likely
> > > to
> > > > be
> > > > >> > > > > redeployed
> > > > >> > > > > >> > to the
> > > > >> > > > > >> > >> > > same
> > > > >> > > > > >> > >> > > > >>>>> nodes, which will result in continuous
> > > task
> > > > >> > > > > abnormalities
> > > > >> > > > > >> > and
> > > > >> > > > > >> > >> > > affect
> > > > >> > > > > >> > >> > > > >> job
> > > > >> > > > > >> > >> > > > >>>>> progress.
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> Currently, Flink users need to
> manually
> > > > >> identify
> > > > >> > > the
> > > > >> > > > > >> > problematic
> > > > >> > > > > >> > >> > > > >> node and
> > > > >> > > > > >> > >> > > > >>>>> take it offline to solve this problem.
> > But
> > > > >> this
> > > > >> > > > > approach
> > > > >> > > > > >> has
> > > > >> > > > > >> > >> > > > >> following
> > > > >> > > > > >> > >> > > > >>>>> disadvantages:
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a
> heavy
> > > > >> process.
> > > > >> > > > Users
> > > > >> > > > > >> may
> > > > >> > > > > >> > need
> > > > >> > > > > >> > >> > to
> > > > >> > > > > >> > >> > > > >>>> contact
> > > > >> > > > > >> > >> > > > >>>>> cluster administors to do this. The
> > > > operation
> > > > >> can
> > > > >> > > > even
> > > > >> > > > > be
> > > > >> > > > > >> > >> > dangerous
> > > > >> > > > > >> > >> > > > >> and
> > > > >> > > > > >> > >> > > > >>>> not
> > > > >> > > > > >> > >> > > > >>>>> allowed during some important business
> > > > events.
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind
> of
> > > > >> problems
> > > > >> > > > > manually
> > > > >> > > > > >> > would
> > > > >> > > > > >> > >> > be
> > > > >> > > > > >> > >> > > > >> slow
> > > > >> > > > > >> > >> > > > >>>> and
> > > > >> > > > > >> > >> > > > >>>>> a waste of human resources.
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I
> > > propose
> > > > >> to
> > > > >> > > > > >> introduce a
> > > > >> > > > > >> > >> > > blacklist
> > > > >> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out
> > > > problematic
> > > > >> > > > > resources.
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> You can find more details in
> > FLIP-224[1].
> > > > >> Looking
> > > > >> > > > > forward
> > > > >> > > > > >> > to your
> > > > >> > > > > >> > >> > > > >>>> feedback.
> > > > >> > > > > >> > >> > > > >>>>> [1]
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> Best,
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > > >>>>> Lijie
> > > > >> > > > > >> > >> > > > >>>>>
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > > >
> > > > >> > > > > >> > >> > >
> > > > >> > > > > >> > >> >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > Best regards,
> > > > >> > > Roman Boyko
> > > > >> > > e.: ro.v.boyko@gmail.com
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > https://twitter.com/snntrable
> > > https://github.com/knaufk
> > >
> >
>
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Konstantin Knauf <kn...@apache.org>.

Hi Lijie,

hm, maybe the following is more appropriate in that case

POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge

Best,

Konstantin

Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
wangdachui9501@gmail.com>:

> Hi Konstantin,
> thanks for your feedback.
>
> From what I understand, PUT should be idempotent. However, we have a
> *timeout* field in the request. This means that initiating the same request
> at two different times will lead to different resource status (timestamps
> of the items to be removed will be different).
>
> Should we use PUT in this case? WDYT?
>
> Best,
> Lijie
>
> Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：
>
> > Hi Lijie,
> >
> > wouldn't the REST API-idiomatic way for an update/replace be a PUT on the
> > resource?
> >
> > PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
> >
> > Best,
> >
> > Konstantin
> >
> >
> >
> > Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> > wangdachui9501@gmail.com>:
> >
> > > Hi everyone,
> > >
> > > I've had an offline discussion with Becket Qin and Zhu Zhu, and made
> the
> > > following changes on REST API:
> > > 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
> one.
> > If
> > > both are specified, will return error.
> > > 2.  If the specified item is already there, the *ADD* operation has two
> > > behaviors:  *return error*(default value) or *merge/update*, and we
> add a
> > > flag to the request body to control it. You can find more details
> "Public
> > > Interface" section.
> > >
> > > If there is no more feedback, we will start the vote thread next week.
> > >
> > > Best,
> > > Lijie
> > >
> > > Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> > >
> > > > Hi Becket Qin,
> > > >
> > > > Thanks for your suggestions.  I have moved the description of
> > > > configurations, metrics and REST API into "Public Interface" section,
> > and
> > > > made a few updates according to your suggestion.  And in this FLIP,
> > there
> > > > no public java Interfaces or pluggables that users need to implement
> by
> > > > themselves.
> > > >
> > > > Answers for you questions:
> > > > 1. Yes, there 2 block actions: MARK_BLOCKED and.
> > > > MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block items
> > can
> > > > only be added through the REST API, so these 2 action are mentioned
> in
> > > the
> > > > REST API part (The REST API part has beed moved to public interface
> > now).
> > > > 2. I agree with you. I have changed the "Cause" field to String, and
> > > allow
> > > > users to specify it via REST API.
> > > > 3. Yes, it is useful to allow different timeouts. As mentioned above,
> > we
> > > > will introduce 2 fields : *timeout* and *endTimestamp* into the ADD
> > REST
> > > > API to specify when to remove the blocked item. These 2 fields are
> > > > optional, if neither is specified, it means that the blocked item is
> > > > permanent and will not be removed. If both are specified, the minimum
> > of
> > > > *currentTimestamp+tiemout *and* endTimestamp* will be used as the
> time
> > to
> > > > remove the blocked item. To keep the configurations more minimal, we
> > have
> > > > removed the *cluster.resource-blocklist.item.timeout* configuration
> > > > option.
> > > > 4. Yes, the block item will be overridden if the specified item
> already
> > > > exists. The ADD operation is *ADD or UPDATE*.
> > > > 5. Yes. On JM/RM side, all the blocklist information is maintained in
> > > > JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> > > abstracted
> > > > to other interfaces) will be propagated to different components.
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> > > >
> > > >> Hi Lijie,
> > > >>
> > > >> Thanks for updating the FLIP. It looks like the public interface
> > section
> > > >> did not fully reflect all the user sensible behavior and API. Can
> you
> > > put
> > > >> everything that users may be aware of there? That would include the
> > REST
> > > >> API, metrics, configurations, public java Interfaces or pluggables
> > that
> > > >> users may see or implement by themselves, as well as a brief summary
> > of
> > > >> the
> > > >> behavior of the public API.
> > > >>
> > > >> Besides that, I have a few questions:
> > > >>
> > > >> 1. According to the conversation in the discussion thread, it looks
> > like
> > > >> the BlockAction will have "MARK_BLOCKLISTED" and
> > > >> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can
> > you
> > > >> add
> > > >> that to the public interface as well?
> > > >>
> > > >> 2. At this point, the "Cause" field in the BlockingItem is a
> Throwable
> > > and
> > > >> is not reflected in the REST API. Should that be included in the
> query
> > > >> response? And should we change that field to be a String so users
> may
> > > >> specify the cause via the REST API when they block some nodes / TMs?
> > > >>
> > > >> 3. Would it be useful to allow users to have different timeouts for
> > > >> different blocked items? So while there is a default timeout, users
> > can
> > > >> also override it via the REST API when they block an entity.
> > > >>
> > > >> 4. Regarding the ADD operation, if the specified item is already
> > there,
> > > >> will the block item be overridden? For example, if the user wants to
> > > >> extend
> > > >> the timeout of a blocked item, can they just  issue an ADD command
> > > again?
> > > >>
> > > >> 5. I am not quite familiar with the details of this, but is there a
> > > source
> > > >> of truth for the blocked list? I think it might be good to have a
> > single
> > > >> source of truth for the blocked list and just propagate that list to
> > > >> different components to take the action of actually blocking the
> > > resource.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <wangdachui9501@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Hi everyone,
> > > >> >
> > > >> > Based on the discussion in the mailing list, I updated the FLIP
> doc,
> > > the
> > > >> > changes include:
> > > >> > 1. Changed the description of the motivation section to more
> clearly
> > > >> > describe the problem this FLIP is trying to solve.
> > > >> > 2. Only  *Manually* is supported.
> > > >> > 3. Adopted some suggestions, such as *endTimestamp*.
> > > >> >
> > > >> > Best,
> > > >> > Lijie
> > > >> >
> > > >> >
> > > >> > Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> > > >> >
> > > >> > > Hi Lijie!
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > *a) “Probably storing inside Zookeeper/Configmap might be
> > > >> helpfulhere.”
> > > >> > > Can you explain it in detail? I don't fully understand that. In
> > > >> > myopinion,
> > > >> > > non-active and active are the same, and no special treatment
> > > >> isrequired.*
> > > >> > >
> > > >> > > Sorry this was a misunderstanding from my side. I thought we
> were
> > > >> talking
> > > >> > > about the HA mode (but not about Active and Standalone
> > > >> ResourceManager).
> > > >> > > And the original question was - how to handle the blacklisted
> > nodes
> > > >> list
> > > >> > at
> > > >> > > the moment of leader change? Should we simply forget about them
> or
> > > >> try to
> > > >> > > pre-save that list on the remote storage?
> > > >> > >
> > > >> > > On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com>
> > > wrote:
> > > >> > >
> > > >> > > > Thanks Lijie and ZhuZhu for the explanation.
> > > >> > > >
> > > >> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it
> is
> > > >> indeed
> > > >> > > > some functionalities the external tools(e.g. kubectl taint)
> > could
> > > >> not
> > > >> > > > support.
> > > >> > > >
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Yang
> > > >> > > >
> > > >> > > > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> > > >> > > >
> > > >> > > > > Thanks for your feedback, Jiangang and Martijn.
> > > >> > > > >
> > > >> > > > > @Jiangang
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > > For auto-detecting, I wonder how to make the strategy and
> > > mark a
> > > >> > node
> > > >> > > > > blocked?
> > > >> > > > >
> > > >> > > > > In fact, we currently plan to not support auto-detection in
> > this
> > > >> > FLIP.
> > > >> > > > The
> > > >> > > > > part about auto-detection may be continued in a separate
> FLIP
> > in
> > > >> the
> > > >> > > > > future. Some guys have the same concerns as you, and the
> > > >> correctness
> > > >> > > and
> > > >> > > > > necessity of auto-detection may require further discussion
> in
> > > the
> > > >> > > future.
> > > >> > > > >
> > > >> > > > > > In session mode, multi jobs can fail on the same bad node
> > and
> > > >> the
> > > >> > > node
> > > >> > > > > should be marked blocked.
> > > >> > > > > By design, the blocklist information will be shared among
> all
> > > jobs
> > > >> > in a
> > > >> > > > > cluster/session. The JM will sync blocklist information with
> > RM.
> > > >> > > > >
> > > >> > > > > @Martijn
> > > >> > > > >
> > > >> > > > > > I agree with Yang Wang on this.
> > > >> > > > > As Zhu Zhu and I mentioned above, we think the
> > > >> MARK_BLOCKLISTED(Just
> > > >> > > > limits
> > > >> > > > > the load of the node and does not  kill all the processes on
> > it)
> > > >> is
> > > >> > > also
> > > >> > > > > important, and we think that external systems (*yarn rmadmin
> > or
> > > >> > kubectl
> > > >> > > > > taint*) cannot support it. So we think it makes sense even
> > only
> > > >> > > > *manually*.
> > > >> > > > >
> > > >> > > > > > I also agree with Chesnay that magical mechanisms are
> indeed
> > > >> super
> > > >> > > hard
> > > >> > > > > to get right.
> > > >> > > > > Yes, as you see, Jiangang(and a few others) have the same
> > > concern.
> > > >> > > > > However, we currently plan to not support auto-detection in
> > this
> > > >> > FLIP,
> > > >> > > > and
> > > >> > > > > only *manually*. In addition, I'd like to say that the FLIP
> > > >> provides
> > > >> > a
> > > >> > > > > mechanism to support MARK_BLOCKLISTED and
> > > >> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > > >> > > > > the auto-detection may be done by external systems.
> > > >> > > > >
> > > >> > > > > Best,
> > > >> > > > > Lijie
> > > >> > > > >
> > > >> > > > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五
> 19:04写道：
> > > >> > > > >
> > > >> > > > > > > If we only support to block nodes manually, then I could
> > not
> > > >> see
> > > >> > > > > > the obvious advantages compared with current SRE's
> > > approach(via
> > > >> > *yarn
> > > >> > > > > > rmadmin or kubectl taint*).
> > > >> > > > > >
> > > >> > > > > > I agree with Yang Wang on this.
> > > >> > > > > >
> > > >> > > > > > >  To me this sounds yet again like one of those magical
> > > >> mechanisms
> > > >> > > > that
> > > >> > > > > > will rarely work just right.
> > > >> > > > > >
> > > >> > > > > > I also agree with Chesnay that magical mechanisms are
> indeed
> > > >> super
> > > >> > > hard
> > > >> > > > > to
> > > >> > > > > > get right.
> > > >> > > > > >
> > > >> > > > > > Best regards,
> > > >> > > > > >
> > > >> > > > > > Martijn
> > > >> > > > > >
> > > >> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> > > >> > liujiangangpeng@gmail.com
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > >> Thanks for the valuable design. The auto-detecting can
> > > decrease
> > > >> > > great
> > > >> > > > > work
> > > >> > > > > >> for us. We have implemented the similar feature in our
> > inner
> > > >> flink
> > > >> > > > > >> version.
> > > >> > > > > >> Below is something that I care about:
> > > >> > > > > >>
> > > >> > > > > >>    1. For auto-detecting, I wonder how to make the
> strategy
> > > and
> > > >> > > mark a
> > > >> > > > > >> node
> > > >> > > > > >>    blocked? Sometimes the blocked node is hard to be
> > > detected,
> > > >> for
> > > >> > > > > >> example,
> > > >> > > > > >>    the upper node or the down node will be blocked when
> > > network
> > > >> > > > > >> unreachable.
> > > >> > > > > >>    2. I see that the strategy is made in JobMaster side.
> > How
> > > >> about
> > > >> > > > > >>    implementing the similar logic in resource manager? In
> > > >> session
> > > >> > > > mode,
> > > >> > > > > >> multi
> > > >> > > > > >>    jobs can fail on the same bad node and the node should
> > be
> > > >> > marked
> > > >> > > > > >> blocked.
> > > >> > > > > >>    If the job makes the strategy, the node may be not
> > marked
> > > >> > blocked
> > > >> > > > if
> > > >> > > > > >> the
> > > >> > > > > >>    fail times don't exceed the threshold.
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > > >> > > > > >>
> > > >> > > > > >> > Thank you for all your feedback!
> > > >> > > > > >> >
> > > >> > > > > >> > Besides the answers from Lijie, I'd like to share some
> of
> > > my
> > > >> > > > thoughts:
> > > >> > > > > >> > 1. Whether to enable automatical blocklist
> > > >> > > > > >> > Generally speaking, it is not a goal of FLIP-224.
> > > >> > > > > >> > The automatical way should be something built upon the
> > > >> blocklist
> > > >> > > > > >> > mechanism and well decoupled. It was designed to be a
> > > >> > configurable
> > > >> > > > > >> > blocklist strategy, but I think we can further decouple
> > it
> > > by
> > > >> > > > > >> > introducing a abnormal node detector, as Becket
> > suggested,
> > > >> which
> > > >> > > > just
> > > >> > > > > >> > uses the blocklist mechanism once bad nodes are
> detected.
> > > >> > However,
> > > >> > > > it
> > > >> > > > > >> > should be a separate FLIP with further dev discussions
> > and
> > > >> > > feedback
> > > >> > > > > >> > from users. I also agree with Becket that different
> users
> > > >> have
> > > >> > > > > different
> > > >> > > > > >> > requirements, and we should listen to them.
> > > >> > > > > >> >
> > > >> > > > > >> > 2. Is it enough to just take away abnormal nodes
> > externally
> > > >> > > > > >> > My answer is no. As Lijie has mentioned, we need a way
> to
> > > >> avoid
> > > >> > > > > >> > deploying tasks to temporary hot nodes. In this case,
> > users
> > > >> may
> > > >> > > just
> > > >> > > > > >> > want to limit the load of the node and do not want to
> > kill
> > > >> all
> > > >> > the
> > > >> > > > > >> > processes on it. Another case is the speculative
> > > execution[1]
> > > >> > > which
> > > >> > > > > >> > may also leverage this feature to avoid starting mirror
> > > >> tasks on
> > > >> > > > slow
> > > >> > > > > >> > nodes.
> > > >> > > > > >> >
> > > >> > > > > >> > Thanks,
> > > >> > > > > >> > Zhu
> > > >> > > > > >> >
> > > >> > > > > >> > [1]
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > >> > > > > >> >
> > > >> > > > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> > > 15:56写道：
> > > >> > > > > >> >
> > > >> > > > > >> > >
> > > >> > > > > >> > > Hi everyone,
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > Thanks for your feedback.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > There's one detail that I'd like to re-emphasize here
> > > >> because
> > > >> > it
> > > >> > > > can
> > > >> > > > > >> > affect the value and design of the blocklist mechanism
> > > >> (perhaps
> > > >> > I
> > > >> > > > > should
> > > >> > > > > >> > highlight it in the FLIP). We propose two actions in
> > FLIP:
> > > >> > > > > >> > >
> > > >> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or
> node
> > > as
> > > >> > > > blocked.
> > > >> > > > > >> > Future slots should not be allocated from the blocked
> > task
> > > >> > manager
> > > >> > > > or
> > > >> > > > > >> node.
> > > >> > > > > >> > But slots that are already allocated will not be
> > affected.
> > > A
> > > >> > > typical
> > > >> > > > > >> > application scenario is to mitigate machine hotspots.
> In
> > > this
> > > >> > > case,
> > > >> > > > we
> > > >> > > > > >> hope
> > > >> > > > > >> > that subsequent resource allocations will not be on the
> > hot
> > > >> > > machine,
> > > >> > > > > but
> > > >> > > > > >> > tasks currently running on it should not be affected.
> > > >> > > > > >> > >
> > > >> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task
> > > >> manager
> > > >> > or
> > > >> > > > > node
> > > >> > > > > >> as
> > > >> > > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks
> > will
> > > >> be
> > > >> > > > > >> restarted on
> > > >> > > > > >> > non-blocked task managers.
> > > >> > > > > >> > >
> > > >> > > > > >> > > For the above 2 actions, the former may more
> highlight
> > > the
> > > >> > > meaning
> > > >> > > > > of
> > > >> > > > > >> > this FLIP, because the external system cannot do that.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > Regarding *Manually* and *Automatically*, I basically
> > > agree
> > > >> > with
> > > >> > > > > >> @Becket
> > > >> > > > > >> > Qin: different users have different answers. Not all
> > users’
> > > >> > > > deployment
> > > >> > > > > >> > environments have a special external system that can
> > > perform
> > > >> the
> > > >> > > > > anomaly
> > > >> > > > > >> > detection. In addition, adding pluggable/optional
> > > >> auto-detection
> > > >> > > > > doesn't
> > > >> > > > > >> > require much extra work on top of manual specification.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > I will answer your other questions one by one.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > @Yangze
> > > >> > > > > >> > >
> > > >> > > > > >> > > a) I think you are right, we do not need to expose
> the
> > > >> > > > > >> >
> `cluster.resource-blocklist.item.timeout-check-interval`
> > to
> > > >> > users.
> > > >> > > > > >> > >
> > > >> > > > > >> > > b) We can abstract the `notifyException` to a
> separate
> > > >> > interface
> > > >> > > > > >> (maybe
> > > >> > > > > >> > BlocklistExceptionListener), and the
> > > >> > > ResourceManagerBlocklistHandler
> > > >> > > > > can
> > > >> > > > > >> > implement it in the future.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > @Martijn
> > > >> > > > > >> > >
> > > >> > > > > >> > > a) I also think the manual blocking should be done by
> > > >> cluster
> > > >> > > > > >> operators.
> > > >> > > > > >> > >
> > > >> > > > > >> > > b) I think manual blocking makes sense, because
> > according
> > > >> to
> > > >> > my
> > > >> > > > > >> > experience, users are often the first to perceive the
> > > machine
> > > >> > > > problems
> > > >> > > > > >> > (because of job failover or delay), and they will
> contact
> > > >> > cluster
> > > >> > > > > >> operators
> > > >> > > > > >> > to solve it, or even tell the cluster operators which
> > > >> machine is
> > > >> > > > > >> > problematic. From this point of view, I think the
> people
> > > who
> > > >> > > really
> > > >> > > > > need
> > > >> > > > > >> > the manual blocking are the users, and it’s just
> > performed
> > > by
> > > >> > the
> > > >> > > > > >> cluster
> > > >> > > > > >> > operator, so I think the manual blocking makes sense.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > @Chesnay
> > > >> > > > > >> > >
> > > >> > > > > >> > > We need to touch the logic of JM/SlotPool, because
> for
> > > >> > > > > >> MARK_BLOCKLISTED
> > > >> > > > > >> > , we need to know whether the slot is blocklisted when
> > the
> > > >> task
> > > >> > is
> > > >> > > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> > release
> > > >> the
> > > >> > > slot
> > > >> > > > > >> > directly to avoid assigning other tasks (of this job)
> on
> > > it.
> > > >> If
> > > >> > we
> > > >> > > > > only
> > > >> > > > > >> > maintain the blocklist information on the RM, JM needs
> to
> > > >> > retrieve
> > > >> > > > it
> > > >> > > > > by
> > > >> > > > > >> > RPC. I think the performance overhead of that is
> > relatively
> > > >> > large,
> > > >> > > > so
> > > >> > > > > I
> > > >> > > > > >> > think it's worth maintaining the blocklist information
> on
> > > >> the JM
> > > >> > > > side
> > > >> > > > > >> and
> > > >> > > > > >> > syncing them.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > @Роман
> > > >> > > > > >> > >
> > > >> > > > > >> > >     a) “Probably storing inside Zookeeper/Configmap
> > might
> > > >> be
> > > >> > > > helpful
> > > >> > > > > >> > here.”  Can you explain it in detail? I don't fully
> > > >> understand
> > > >> > > that.
> > > >> > > > > In
> > > >> > > > > >> my
> > > >> > > > > >> > opinion, non-active and active are the same, and no
> > special
> > > >> > > > treatment
> > > >> > > > > is
> > > >> > > > > >> > required.
> > > >> > > > > >> > >
> > > >> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense,
> I
> > > will
> > > >> > add
> > > >> > > it
> > > >> > > > > to
> > > >> > > > > >> > FLIP.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > @Yang
> > > >> > > > > >> > >
> > > >> > > > > >> > > As mentioned above, AFAK, the external system cannot
> > > >> support
> > > >> > the
> > > >> > > > > >> > MARK_BLOCKLISTED action.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > Looking forward to your further feedback.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > Best,
> > > >> > > > > >> > >
> > > >> > > > > >> > > Lijie
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二
> > 21:09写道：
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> I want to share some thoughts about Flink cluster
> > > >> operations.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> In the production environment, the SRE(aka Site
> > > >> Reliability
> > > >> > > > > Engineer)
> > > >> > > > > >> > >> already has many tools to detect the unstable nodes,
> > > which
> > > >> > > could
> > > >> > > > > take
> > > >> > > > > >> > the
> > > >> > > > > >> > >> system logs/metrics into consideration.
> > > >> > > > > >> > >> Then they use graceful-decomission in YARN and taint
> > in
> > > >> K8s
> > > >> > to
> > > >> > > > > >> prevent
> > > >> > > > > >> > new
> > > >> > > > > >> > >> allocations on these unstable nodes.
> > > >> > > > > >> > >> At last, they will evict all the containers and pods
> > > >> running
> > > >> > on
> > > >> > > > > these
> > > >> > > > > >> > nodes.
> > > >> > > > > >> > >> This mechanism also works for planned maintenance.
> So
> > I
> > > am
> > > >> > > afraid
> > > >> > > > > >> this
> > > >> > > > > >> > is
> > > >> > > > > >> > >> not the typical use case for FLIP-224.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> If we only support to block nodes manually, then I
> > could
> > > >> not
> > > >> > > see
> > > >> > > > > >> > >> the obvious advantages compared with current SRE's
> > > >> > approach(via
> > > >> > > > > *yarn
> > > >> > > > > >> > >> rmadmin or kubectl taint*).
> > > >> > > > > >> > >> At least, we need to have a pluggable component
> which
> > > >> could
> > > >> > > > expose
> > > >> > > > > >> the
> > > >> > > > > >> > >> potential unstable nodes automatically and block
> them
> > if
> > > >> > > enabled
> > > >> > > > > >> > explicitly.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> Best,
> > > >> > > > > >> > >> Yang
> > > >> > > > > >> > >>
> > > >> > > > > >> > >>
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> > 16:36写道：
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> > Thanks for the proposal, Lijie.
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > This is an interesting feature and discussion, and
> > > >> somewhat
> > > >> > > > > related
> > > >> > > > > >> > to the
> > > >> > > > > >> > >> > design principle about how people should operate
> > > Flink.
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > I think there are three things involved in this
> > FLIP.
> > > >> > > > > >> > >> >      a) Detect and report the unstable node.
> > > >> > > > > >> > >> >      b) Collect the information of the unstable
> node
> > > and
> > > >> > > form a
> > > >> > > > > >> > blocklist.
> > > >> > > > > >> > >> >      c) Take the action to block nodes.
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > My two cents:
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > 1. It looks like people all agree that Flink
> should
> > > have
> > > >> > c).
> > > >> > > It
> > > >> > > > > is
> > > >> > > > > >> > not only
> > > >> > > > > >> > >> > useful for cases of node failures, but also handy
> > for
> > > >> some
> > > >> > > > > planned
> > > >> > > > > >> > >> > maintenance.
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > 2. People have different opinions on b), i.e. who
> > > >> should be
> > > >> > > the
> > > >> > > > > >> brain
> > > >> > > > > >> > to
> > > >> > > > > >> > >> > make the decision to block a node. I think this
> > > largely
> > > >> > > depends
> > > >> > > > > on
> > > >> > > > > >> > who we
> > > >> > > > > >> > >> > talk to. Different users would probably give
> > different
> > > >> > > answers.
> > > >> > > > > For
> > > >> > > > > >> > people
> > > >> > > > > >> > >> > who do have a centralized node health management
> > > >> service,
> > > >> > let
> > > >> > > > > Flink
> > > >> > > > > >> > do just
> > > >> > > > > >> > >> > do a) and c) would be preferred. So essentially
> > Flink
> > > >> would
> > > >> > > be
> > > >> > > > > one
> > > >> > > > > >> of
> > > >> > > > > >> > the
> > > >> > > > > >> > >> > sources that may detect unstable nodes, report it
> to
> > > >> that
> > > >> > > > > service,
> > > >> > > > > >> > and then
> > > >> > > > > >> > >> > take the command from that service to block the
> > > >> problematic
> > > >> > > > > nodes.
> > > >> > > > > >> On
> > > >> > > > > >> > the
> > > >> > > > > >> > >> > other hand, for users who do not have such a
> > service,
> > > >> > simply
> > > >> > > > > >> letting
> > > >> > > > > >> > Flink
> > > >> > > > > >> > >> > be clever by itself to block the suspicious nodes
> > > might
> > > >> be
> > > >> > > > > desired
> > > >> > > > > >> to
> > > >> > > > > >> > >> > ensure the jobs are running smoothly.
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > So that indicates a) and b) here should be
> > pluggable /
> > > >> > > > optional.
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > In light of this, maybe it would make sense to
> have
> > > >> > something
> > > >> > > > > >> > pluggable
> > > >> > > > > >> > >> > like a UnstableNodeReporter which exposes unstable
> > > nodes
> > > >> > > > > actively.
> > > >> > > > > >> (A
> > > >> > > > > >> > more
> > > >> > > > > >> > >> > general interface should be JobInfoReporter<T>
> which
> > > >> can be
> > > >> > > > used
> > > >> > > > > to
> > > >> > > > > >> > report
> > > >> > > > > >> > >> > any information of type <T>. But I'll just keep
> the
> > > >> scope
> > > >> > > > > relevant
> > > >> > > > > >> to
> > > >> > > > > >> > this
> > > >> > > > > >> > >> > FLIP here). Personally speaking, I think it is OK
> to
> > > >> have a
> > > >> > > > > default
> > > >> > > > > >> > >> > implementation of a reporter which just tells
> Flink
> > to
> > > >> take
> > > >> > > > > action
> > > >> > > > > >> to
> > > >> > > > > >> > block
> > > >> > > > > >> > >> > problematic nodes and also unblocks them after
> > > timeout.
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > Thanks,
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > Jiangjie (Becket) Qin
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > > >> > > > ro.v.boyko@gmail.com
> > > >> > > > > >
> > > >> > > > > >> > wrote:
> > > >> > > > > >> > >> >
> > > >> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > > If it's possible I'd like to participate in
> > > >> development.
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply -
> we
> > > >> should
> > > >> > > > > consider
> > > >> > > > > >> > to move
> > > >> > > > > >> > >> > > somehow the information of blocklisted nodes/TMs
> > > from
> > > >> > > active
> > > >> > > > > >> > >> > > ResourceManager to non-active ones. Probably
> > storing
> > > >> > inside
> > > >> > > > > >> > >> > > Zookeeper/Configmap might be helpful here.
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > > And I agree with Martijn that a lot of
> > organizations
> > > >> > don't
> > > >> > > > want
> > > >> > > > > >> to
> > > >> > > > > >> > expose
> > > >> > > > > >> > >> > > such API for a cluster user group. But I think
> > it's
> > > >> > > necessary
> > > >> > > > > to
> > > >> > > > > >> > have the
> > > >> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway
> for
> > > >> > avoiding
> > > >> > > > > >> incorrect
> > > >> > > > > >> > >> > > automatic behaviour.
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > > And another one small suggestion - I think it
> > would
> > > be
> > > >> > > better
> > > >> > > > > to
> > > >> > > > > >> > extend
> > > >> > > > > >> > >> > the
> > > >> > > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp*
> > > field
> > > >> and
> > > >> > > > fill
> > > >> > > > > it
> > > >> > > > > >> > at the
> > > >> > > > > >> > >> > > item creation. This simple addition will allow
> to:
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > >    -
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > >    Provide the ability to users to setup the
> exact
> > > >> time
> > > >> > of
> > > >> > > > > >> > blocklist end
> > > >> > > > > >> > >> > >    through RestAPI
> > > >> > > > > >> > >> > >    -
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > >    Not being tied to a single value of
> > > >> > > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > > >> > > > > >> chesnay@apache.org>
> > > >> > > > > >> > >> > wrote:
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> > > > I do share the concern between blurring the
> > lines
> > > a
> > > >> > bit.
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > > That said, I'd prefer to not have any
> > > auto-detection
> > > >> > and
> > > >> > > > only
> > > >> > > > > >> > have an
> > > >> > > > > >> > >> > > > opt-in mechanism
> > > >> > > > > >> > >> > > > to manually block processes/nodes. To me this
> > > sounds
> > > >> > yet
> > > >> > > > > again
> > > >> > > > > >> > like one
> > > >> > > > > >> > >> > > > of those
> > > >> > > > > >> > >> > > > magical mechanisms that will rarely work just
> > > right.
> > > >> > > > > >> > >> > > > An external system can leverage way more
> > > information
> > > >> > > after
> > > >> > > > > all.
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > > Moreover, I'm quite concerned about the
> > complexity
> > > >> of
> > > >> > > this
> > > >> > > > > >> > proposal.
> > > >> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing
> between
> > > >> > > > components;
> > > >> > > > > >> > >> > adjustments
> > > >> > > > > >> > >> > > > to the
> > > >> > > > > >> > >> > > > slot and resource protocol.
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > > In a way it seems overly complicated.
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > > If we look at it purely from an active
> resource
> > > >> > > management
> > > >> > > > > >> > perspective,
> > > >> > > > > >> > >> > > > then there
> > > >> > > > > >> > >> > > > isn't really a need to touch the slot protocol
> > at
> > > >> all
> > > >> > (or
> > > >> > > > in
> > > >> > > > > >> fact
> > > >> > > > > >> > to
> > > >> > > > > >> > >> > > > anything in the JobMaster),
> > > >> > > > > >> > >> > > > because there isn't any point in keeping
> around
> > > >> blocked
> > > >> > > TMs
> > > >> > > > > in
> > > >> > > > > >> the
> > > >> > > > > >> > >> > first
> > > >> > > > > >> > >> > > > place.
> > > >> > > > > >> > >> > > > They'd just be idling, potentially shutting
> down
> > > >> after
> > > >> > a
> > > >> > > > > while
> > > >> > > > > >> by
> > > >> > > > > >> > the
> > > >> > > > > >> > >> > RM
> > > >> > > > > >> > >> > > > because of
> > > >> > > > > >> > >> > > > it (unless we _also_ touch that logic).
> > > >> > > > > >> > >> > > > Here the blocking of a process (be it by
> > blocking
> > > >> the
> > > >> > > > process
> > > >> > > > > >> or
> > > >> > > > > >> > node)
> > > >> > > > > >> > >> > is
> > > >> > > > > >> > >> > > > equivalent with shutting down the blocked
> > > >> process(es).
> > > >> > > > > >> > >> > > > Once the block is lifted we can just spin it
> > back
> > > >> up.
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > > And I do wonder whether we couldn't apply the
> > same
> > > >> line
> > > >> > > of
> > > >> > > > > >> > thinking to
> > > >> > > > > >> > >> > > > standalone resource management.
> > > >> > > > > >> > >> > > > Here being able to stop/restart a process/node
> > > >> manually
> > > >> > > > > should
> > > >> > > > > >> be
> > > >> > > > > >> > a
> > > >> > > > > >> > >> > core
> > > >> > > > > >> > >> > > > requirement for a Flink deployment anyway.
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > >> > > > > >> > >> > > > > Hi everyone,
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > > Thanks for creating this FLIP. I can
> > understand
> > > >> the
> > > >> > > > problem
> > > >> > > > > >> and
> > > >> > > > > >> > I see
> > > >> > > > > >> > >> > > > value
> > > >> > > > > >> > >> > > > > in the automatic detection and
> blocklisting. I
> > > do
> > > >> > have
> > > >> > > > some
> > > >> > > > > >> > concerns
> > > >> > > > > >> > >> > > with
> > > >> > > > > >> > >> > > > > the ability to manually specify to be
> blocked
> > > >> > > resources.
> > > >> > > > I
> > > >> > > > > >> have
> > > >> > > > > >> > two
> > > >> > > > > >> > >> > > > > concerns;
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > > * Most organizations explicitly have a
> > > separation
> > > >> of
> > > >> > > > > >> concerns,
> > > >> > > > > >> > >> > meaning
> > > >> > > > > >> > >> > > > that
> > > >> > > > > >> > >> > > > > there's a group who's responsible for
> > managing a
> > > >> > > cluster
> > > >> > > > > and
> > > >> > > > > >> > there's
> > > >> > > > > >> > >> > a
> > > >> > > > > >> > >> > > > user
> > > >> > > > > >> > >> > > > > group who uses that cluster. With the
> > > >> introduction of
> > > >> > > > this
> > > >> > > > > >> > mechanism,
> > > >> > > > > >> > >> > > the
> > > >> > > > > >> > >> > > > > latter group now can influence the
> > > responsibility
> > > >> of
> > > >> > > the
> > > >> > > > > >> first
> > > >> > > > > >> > group.
> > > >> > > > > >> > >> > > So
> > > >> > > > > >> > >> > > > it
> > > >> > > > > >> > >> > > > > can be possible that someone from the user
> > group
> > > >> > blocks
> > > >> > > > > >> > something,
> > > >> > > > > >> > >> > > which
> > > >> > > > > >> > >> > > > > causes an outage (which could result in
> paging
> > > >> > > mechanism
> > > >> > > > > >> > triggering
> > > >> > > > > >> > >> > > etc)
> > > >> > > > > >> > >> > > > > which impacts the first group.
> > > >> > > > > >> > >> > > > > * How big is the group of people who can go
> > > >> through
> > > >> > the
> > > >> > > > > >> process
> > > >> > > > > >> > of
> > > >> > > > > >> > >> > > > manually
> > > >> > > > > >> > >> > > > > identifying a node that isn't behaving as it
> > > >> should
> > > >> > > be? I
> > > >> > > > > do
> > > >> > > > > >> > think
> > > >> > > > > >> > >> > this
> > > >> > > > > >> > >> > > > > group is relatively limited. Does it then
> make
> > > >> sense
> > > >> > to
> > > >> > > > > >> > introduce
> > > >> > > > > >> > >> > such
> > > >> > > > > >> > >> > > a
> > > >> > > > > >> > >> > > > > feature, which would only be used by a
> really
> > > >> small
> > > >> > > user
> > > >> > > > > >> group
> > > >> > > > > >> > of
> > > >> > > > > >> > >> > > Flink?
> > > >> > > > > >> > >> > > > We
> > > >> > > > > >> > >> > > > > still have to maintain, test and support
> such
> > a
> > > >> > > feature.
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > > I'm +1 for the autodetection features, but
> I'm
> > > >> > leaning
> > > >> > > > > >> towards
> > > >> > > > > >> > not
> > > >> > > > > >> > >> > > > exposing
> > > >> > > > > >> > >> > > > > this to the user group but having this
> > available
> > > >> > > strictly
> > > >> > > > > for
> > > >> > > > > >> > cluster
> > > >> > > > > >> > >> > > > > operators. They could then also set up their
> > > >> > > > > >> > paging/metrics/logging
> > > >> > > > > >> > >> > > > system
> > > >> > > > > >> > >> > > > > to take this into account.
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > > Best regards,
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > > Martijn Visser
> > > >> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82
> > > >> > > > > >> > >> > > > > https://github.com/MartijnVisser
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > > >> > > > > karmagyz@gmail.com
> > > >> > > > > >> >
> > > >> > > > > >> > wrote:
> > > >> > > > > >> > >> > > > >
> > > >> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > > >> +1 for the overall proposal. Just share
> some
> > > >> cents
> > > >> > > here:
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > > >> - Why do we need to expose
> > > >> > > > > >> > >> > > > >>
> > > >> > cluster.resource-blacklist.item.timeout-check-interval
> > > >> > > > to
> > > >> > > > > >> the
> > > >> > > > > >> > user?
> > > >> > > > > >> > >> > > > >> I think the semantics of
> > > >> > > > > >> > `cluster.resource-blacklist.item.timeout`
> > > >> > > > > >> > >> > is
> > > >> > > > > >> > >> > > > >> sufficient for the user. How to guarantee
> the
> > > >> > timeout
> > > >> > > > > >> > mechanism is
> > > >> > > > > >> > >> > > > >> Flink's internal implementation. I think it
> > > will
> > > >> be
> > > >> > > very
> > > >> > > > > >> > confusing
> > > >> > > > > >> > >> > and
> > > >> > > > > >> > >> > > > >> we do not need to expose it to users.
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > > >> - ResourceManager can notify the exception
> > of a
> > > >> task
> > > >> > > > > >> manager to
> > > >> > > > > >> > >> > > > >> `BlacklistHandler` as well.
> > > >> > > > > >> > >> > > > >> For example, the slot allocation might fail
> > in
> > > >> case
> > > >> > > the
> > > >> > > > > >> target
> > > >> > > > > >> > task
> > > >> > > > > >> > >> > > > >> manager is busy or has a network jitter. I
> > > don't
> > > >> > mean
> > > >> > > we
> > > >> > > > > >> need
> > > >> > > > > >> > to
> > > >> > > > > >> > >> > cover
> > > >> > > > > >> > >> > > > >> this case in this version, but we can also
> > > open a
> > > >> > > > > >> > `notifyException`
> > > >> > > > > >> > >> > in
> > > >> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > > >> - Before we sync the blocklist to
> > > >> ResourceManager,
> > > >> > > will
> > > >> > > > > the
> > > >> > > > > >> > slot of
> > > >> > > > > >> > >> > a
> > > >> > > > > >> > >> > > > >> blocked task manager continues to be
> released
> > > and
> > > >> > > > > allocated?
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > > >> Best,
> > > >> > > > > >> > >> > > > >> Yangze Guo
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang
> <
> > > >> > > > > >> > >> > wangdachui9501@gmail.com>
> > > >> > > > > >> > >> > > > >> wrote:
> > > >> > > > > >> > >> > > > >>> Hi Konstantin,
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> Thanks for your feedback. I will response
> > > your 4
> > > >> > > > remarks:
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the
> > > controversy. I
> > > >> > > think
> > > >> > > > > >> > “BlockList”
> > > >> > > > > >> > >> > is
> > > >> > > > > >> > >> > > > >> good
> > > >> > > > > >> > >> > > > >>> enough, and I will change it in FLIP.
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a
> > good
> > > >> idea.
> > > >> > > > Based
> > > >> > > > > >> on
> > > >> > > > > >> > the
> > > >> > > > > >> > >> > > > above, I
> > > >> > > > > >> > >> > > > >>> would change REST API as following:
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> DELETE
> > > <host>/blocklist/taskmanager/<identifier>
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it
> > means
> > > >> that
> > > >> > > all
> > > >> > > > > >> task
> > > >> > > > > >> > >> > managers
> > > >> > > > > >> > >> > > > on
> > > >> > > > > >> > >> > > > >>> this node are blocklisted. All slots on
> > these
> > > >> TMs
> > > >> > are
> > > >> > > > not
> > > >> > > > > >> > >> > available.
> > > >> > > > > >> > >> > > > This
> > > >> > > > > >> > >> > > > >>> is actually a bit like TM losts, but these
> > TMs
> > > >> are
> > > >> > > not
> > > >> > > > > >> really
> > > >> > > > > >> > lost,
> > > >> > > > > >> > >> > > > they
> > > >> > > > > >> > >> > > > >>> are in an unavailable status, and they are
> > > still
> > > >> > > > > registered
> > > >> > > > > >> > in this
> > > >> > > > > >> > >> > > > flink
> > > >> > > > > >> > >> > > > >>> cluster. They will be available again once
> > the
> > > >> > > > > >> corresponding
> > > >> > > > > >> > >> > > blocklist
> > > >> > > > > >> > >> > > > >> item
> > > >> > > > > >> > >> > > > >>> is removed. This behavior is the same in
> > > >> > > > > active/non-active
> > > >> > > > > >> > >> > clusters.
> > > >> > > > > >> > >> > > > >>> However in the active clusters, these TMs
> > may
> > > be
> > > >> > > > released
> > > >> > > > > >> due
> > > >> > > > > >> > to
> > > >> > > > > >> > >> > idle
> > > >> > > > > >> > >> > > > >>> timeouts.
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep
> > it.
> > > >> The
> > > >> > > > reasons
> > > >> > > > > >> are
> > > >> > > > > >> > as
> > > >> > > > > >> > >> > > > >> following:
> > > >> > > > > >> > >> > > > >>> a) The timeout will not affect users
> adding
> > or
> > > >> > > removing
> > > >> > > > > >> items
> > > >> > > > > >> > via
> > > >> > > > > >> > >> > > REST
> > > >> > > > > >> > >> > > > >> API,
> > > >> > > > > >> > >> > > > >>> and users can disable it by configuring it
> > to
> > > >> > > > > >> Long.MAX_VALUE .
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> b) Some node problems can recover after a
> > > >> period of
> > > >> > > > time
> > > >> > > > > >> > (such as
> > > >> > > > > >> > >> > > > machine
> > > >> > > > > >> > >> > > > >>> hotspots), in which case users may prefer
> > that
> > > >> > Flink
> > > >> > > > can
> > > >> > > > > do
> > > >> > > > > >> > this
> > > >> > > > > >> > >> > > > >>> automatically instead of requiring the
> user
> > to
> > > >> do
> > > >> > it
> > > >> > > > > >> manually.
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> Best,
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> Lijie
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org>
> > > >> 于2022年4月27日周三
> > > >> > > > > >> 19:23写道：
> > > >> > > > > >> > >> > > > >>>
> > > >> > > > > >> > >> > > > >>>> Hi Lijie,
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to only
> > > >> support
> > > >> > > > > manually
> > > >> > > > > >> > blocking
> > > >> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the
> different
> > > >> > > strategies
> > > >> > > > > can
> > > >> > > > > >> > also be
> > > >> > > > > >> > >> > > > >>>> maintained outside of Apache Flink.
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> A few remarks:
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> 1) Can we use another term than
> > "bla.cklist"
> > > >> due
> > > >> > to
> > > >> > > > the
> > > >> > > > > >> > >> > controversy
> > > >> > > > > >> > >> > > > >> around
> > > >> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira
> Ticket
> > > >> about
> > > >> > > this
> > > >> > > > > >> topic a
> > > >> > > > > >> > >> > while
> > > >> > > > > >> > >> > > > >> back
> > > >> > > > > >> > >> > > > >>>> and there was generally a consensus to
> > avoid
> > > >> the
> > > >> > > term
> > > >> > > > > >> > blacklist &
> > > >> > > > > >> > >> > > > >> whitelist
> > > >> > > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist"
> or
> > > >> > > > > "quarantined"
> > > >> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a
> slightly
> > > >> > different
> > > >> > > > > >> design
> > > >> > > > > >> > as
> > > >> > > > > >> > >> > verbs
> > > >> > > > > >> > >> > > > >> like
> > > >> > > > > >> > >> > > > >>>> add/remove often considered an
> anti-pattern
> > > for
> > > >> > REST
> > > >> > > > > APIs.
> > > >> > > > > >> > POST
> > > >> > > > > >> > >> > on a
> > > >> > > > > >> > >> > > > >> list
> > > >> > > > > >> > >> > > > >>>> item is generally the standard to add
> > items.
> > > >> > DELETE
> > > >> > > on
> > > >> > > > > the
> > > >> > > > > >> > >> > > individual
> > > >> > > > > >> > >> > > > >>>> resource is standard to remove an item.
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> POST <host>/quarantine/items
> > > >> > > > > >> > >> > > > >>>> DELETE
> > > <host>/quarantine/items/<itemidentifier>
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> We could also consider to separate
> > > taskmanagers
> > > >> > and
> > > >> > > > > nodes
> > > >> > > > > >> in
> > > >> > > > > >> > the
> > > >> > > > > >> > >> > > REST
> > > >> > > > > >> > >> > > > >> API
> > > >> > > > > >> > >> > > > >>>> (and internal data structures). Any
> opinion
> > > on
> > > >> > this?
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > > >> > > > > >> > >> > > > >>>> DELETE
> <host>/quarantine/nodes/<identifier>
> > > >> > > > > >> > >> > > > >>>> DELETE
> > > >> <host>/quarantine/taskmanager/<identifier>
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with
> > > >> non-active
> > > >> > > > > >> resource
> > > >> > > > > >> > >> > > managers,
> > > >> > > > > >> > >> > > > >> i.e.
> > > >> > > > > >> > >> > > > >>>> standalone or reactive mode?
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> 4) To keep the implementation even more
> > > >> minimal,
> > > >> > do
> > > >> > > we
> > > >> > > > > >> need
> > > >> > > > > >> > the
> > > >> > > > > >> > >> > > > timeout
> > > >> > > > > >> > >> > > > >>>> behavior? If items are added/removed
> > manually
> > > >> we
> > > >> > > could
> > > >> > > > > >> > delegate
> > > >> > > > > >> > >> > this
> > > >> > > > > >> > >> > > > >> to the
> > > >> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout
> > > behavior
> > > >> > > would
> > > >> > > > > >> better
> > > >> > > > > >> > fit
> > > >> > > > > >> > >> > > into
> > > >> > > > > >> > >> > > > >>>> specific strategies at a later point.
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> Looking forward to your thoughts.
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> Cheers and thank you,
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> Konstantin
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> [1]
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > >> > > > > >> > >> > > > >>>> [2]
> > > >> > > https://issues.apache.org/jira/browse/FLINK-18209
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
> schrieb
> > > >> Lijie
> > > >> > > Wang
> > > >> > > > <
> > > >> > > > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> > > >> > > > > >> > >> > > > >>>>
> > > >> > > > > >> > >> > > > >>>>> Hi all,
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> Flink job failures may happen due to
> > cluster
> > > >> node
> > > >> > > > > issues
> > > >> > > > > >> > >> > > > >> (insufficient
> > > >> > > > > >> > >> > > > >>>> disk
> > > >> > > > > >> > >> > > > >>>>> space, bad hardware, network
> > abnormalities).
> > > >> > Flink
> > > >> > > > will
> > > >> > > > > >> > take care
> > > >> > > > > >> > >> > > of
> > > >> > > > > >> > >> > > > >> the
> > > >> > > > > >> > >> > > > >>>>> failures and redeploy the tasks.
> However,
> > > due
> > > >> to
> > > >> > > data
> > > >> > > > > >> > locality
> > > >> > > > > >> > >> > and
> > > >> > > > > >> > >> > > > >>>> limited
> > > >> > > > > >> > >> > > > >>>>> resources, the new tasks are very likely
> > to
> > > be
> > > >> > > > > redeployed
> > > >> > > > > >> > to the
> > > >> > > > > >> > >> > > same
> > > >> > > > > >> > >> > > > >>>>> nodes, which will result in continuous
> > task
> > > >> > > > > abnormalities
> > > >> > > > > >> > and
> > > >> > > > > >> > >> > > affect
> > > >> > > > > >> > >> > > > >> job
> > > >> > > > > >> > >> > > > >>>>> progress.
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> Currently, Flink users need to manually
> > > >> identify
> > > >> > > the
> > > >> > > > > >> > problematic
> > > >> > > > > >> > >> > > > >> node and
> > > >> > > > > >> > >> > > > >>>>> take it offline to solve this problem.
> But
> > > >> this
> > > >> > > > > approach
> > > >> > > > > >> has
> > > >> > > > > >> > >> > > > >> following
> > > >> > > > > >> > >> > > > >>>>> disadvantages:
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy
> > > >> process.
> > > >> > > > Users
> > > >> > > > > >> may
> > > >> > > > > >> > need
> > > >> > > > > >> > >> > to
> > > >> > > > > >> > >> > > > >>>> contact
> > > >> > > > > >> > >> > > > >>>>> cluster administors to do this. The
> > > operation
> > > >> can
> > > >> > > > even
> > > >> > > > > be
> > > >> > > > > >> > >> > dangerous
> > > >> > > > > >> > >> > > > >> and
> > > >> > > > > >> > >> > > > >>>> not
> > > >> > > > > >> > >> > > > >>>>> allowed during some important business
> > > events.
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of
> > > >> problems
> > > >> > > > > manually
> > > >> > > > > >> > would
> > > >> > > > > >> > >> > be
> > > >> > > > > >> > >> > > > >> slow
> > > >> > > > > >> > >> > > > >>>> and
> > > >> > > > > >> > >> > > > >>>>> a waste of human resources.
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I
> > propose
> > > >> to
> > > >> > > > > >> introduce a
> > > >> > > > > >> > >> > > blacklist
> > > >> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out
> > > problematic
> > > >> > > > > resources.
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> You can find more details in
> FLIP-224[1].
> > > >> Looking
> > > >> > > > > forward
> > > >> > > > > >> > to your
> > > >> > > > > >> > >> > > > >>>> feedback.
> > > >> > > > > >> > >> > > > >>>>> [1]
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> Best,
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > > >>>>> Lijie
> > > >> > > > > >> > >> > > > >>>>>
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > > >
> > > >> > > > > >> > >> > >
> > > >> > > > > >> > >> >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Best regards,
> > > >> > > Roman Boyko
> > > >> > > e.: ro.v.boyko@gmail.com
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> > --
> > https://twitter.com/snntrable
> > https://github.com/knaufk
> >
>


-- 
https://twitter.com/snntrable
https://github.com/knaufk

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi Konstantin,
thanks for your feedback.

From what I understand, PUT should be idempotent. However, we have a
*timeout* field in the request. This means that initiating the same request
at two different times will lead to different resource status (timestamps
of the items to be removed will be different).

Should we use PUT in this case? WDYT?

Best,
Lijie

Konstantin Knauf <kn...@apache.org> 于2022年5月13日周五 17:20写道：

> Hi Lijie,
>
> wouldn't the REST API-idiomatic way for an update/replace be a PUT on the
> resource?
>
> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>
> Best,
>
> Konstantin
>
>
>
> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
> wangdachui9501@gmail.com>:
>
> > Hi everyone,
> >
> > I've had an offline discussion with Becket Qin and Zhu Zhu, and made the
> > following changes on REST API:
> > 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose one.
> If
> > both are specified, will return error.
> > 2.  If the specified item is already there, the *ADD* operation has two
> > behaviors:  *return error*(default value) or *merge/update*, and we add a
> > flag to the request body to control it. You can find more details "Public
> > Interface" section.
> >
> > If there is no more feedback, we will start the vote thread next week.
> >
> > Best,
> > Lijie
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
> >
> > > Hi Becket Qin,
> > >
> > > Thanks for your suggestions.  I have moved the description of
> > > configurations, metrics and REST API into "Public Interface" section,
> and
> > > made a few updates according to your suggestion.  And in this FLIP,
> there
> > > no public java Interfaces or pluggables that users need to implement by
> > > themselves.
> > >
> > > Answers for you questions:
> > > 1. Yes, there 2 block actions: MARK_BLOCKED and.
> > > MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block items
> can
> > > only be added through the REST API, so these 2 action are mentioned in
> > the
> > > REST API part (The REST API part has beed moved to public interface
> now).
> > > 2. I agree with you. I have changed the "Cause" field to String, and
> > allow
> > > users to specify it via REST API.
> > > 3. Yes, it is useful to allow different timeouts. As mentioned above,
> we
> > > will introduce 2 fields : *timeout* and *endTimestamp* into the ADD
> REST
> > > API to specify when to remove the blocked item. These 2 fields are
> > > optional, if neither is specified, it means that the blocked item is
> > > permanent and will not be removed. If both are specified, the minimum
> of
> > > *currentTimestamp+tiemout *and* endTimestamp* will be used as the time
> to
> > > remove the blocked item. To keep the configurations more minimal, we
> have
> > > removed the *cluster.resource-blocklist.item.timeout* configuration
> > > option.
> > > 4. Yes, the block item will be overridden if the specified item already
> > > exists. The ADD operation is *ADD or UPDATE*.
> > > 5. Yes. On JM/RM side, all the blocklist information is maintained in
> > > JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> > abstracted
> > > to other interfaces) will be propagated to different components.
> > >
> > > Best,
> > > Lijie
> > >
> > > Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> > >
> > >> Hi Lijie,
> > >>
> > >> Thanks for updating the FLIP. It looks like the public interface
> section
> > >> did not fully reflect all the user sensible behavior and API. Can you
> > put
> > >> everything that users may be aware of there? That would include the
> REST
> > >> API, metrics, configurations, public java Interfaces or pluggables
> that
> > >> users may see or implement by themselves, as well as a brief summary
> of
> > >> the
> > >> behavior of the public API.
> > >>
> > >> Besides that, I have a few questions:
> > >>
> > >> 1. According to the conversation in the discussion thread, it looks
> like
> > >> the BlockAction will have "MARK_BLOCKLISTED" and
> > >> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can
> you
> > >> add
> > >> that to the public interface as well?
> > >>
> > >> 2. At this point, the "Cause" field in the BlockingItem is a Throwable
> > and
> > >> is not reflected in the REST API. Should that be included in the query
> > >> response? And should we change that field to be a String so users may
> > >> specify the cause via the REST API when they block some nodes / TMs?
> > >>
> > >> 3. Would it be useful to allow users to have different timeouts for
> > >> different blocked items? So while there is a default timeout, users
> can
> > >> also override it via the REST API when they block an entity.
> > >>
> > >> 4. Regarding the ADD operation, if the specified item is already
> there,
> > >> will the block item be overridden? For example, if the user wants to
> > >> extend
> > >> the timeout of a blocked item, can they just  issue an ADD command
> > again?
> > >>
> > >> 5. I am not quite familiar with the details of this, but is there a
> > source
> > >> of truth for the blocked list? I think it might be good to have a
> single
> > >> source of truth for the blocked list and just propagate that list to
> > >> different components to take the action of actually blocking the
> > resource.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <wa...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi everyone,
> > >> >
> > >> > Based on the discussion in the mailing list, I updated the FLIP doc,
> > the
> > >> > changes include:
> > >> > 1. Changed the description of the motivation section to more clearly
> > >> > describe the problem this FLIP is trying to solve.
> > >> > 2. Only  *Manually* is supported.
> > >> > 3. Adopted some suggestions, such as *endTimestamp*.
> > >> >
> > >> > Best,
> > >> > Lijie
> > >> >
> > >> >
> > >> > Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> > >> >
> > >> > > Hi Lijie!
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > *a) “Probably storing inside Zookeeper/Configmap might be
> > >> helpfulhere.”
> > >> > > Can you explain it in detail? I don't fully understand that. In
> > >> > myopinion,
> > >> > > non-active and active are the same, and no special treatment
> > >> isrequired.*
> > >> > >
> > >> > > Sorry this was a misunderstanding from my side. I thought we were
> > >> talking
> > >> > > about the HA mode (but not about Active and Standalone
> > >> ResourceManager).
> > >> > > And the original question was - how to handle the blacklisted
> nodes
> > >> list
> > >> > at
> > >> > > the moment of leader change? Should we simply forget about them or
> > >> try to
> > >> > > pre-save that list on the remote storage?
> > >> > >
> > >> > > On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com>
> > wrote:
> > >> > >
> > >> > > > Thanks Lijie and ZhuZhu for the explanation.
> > >> > > >
> > >> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is
> > >> indeed
> > >> > > > some functionalities the external tools(e.g. kubectl taint)
> could
> > >> not
> > >> > > > support.
> > >> > > >
> > >> > > >
> > >> > > > Best,
> > >> > > > Yang
> > >> > > >
> > >> > > > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> > >> > > >
> > >> > > > > Thanks for your feedback, Jiangang and Martijn.
> > >> > > > >
> > >> > > > > @Jiangang
> > >> > > > >
> > >> > > > >
> > >> > > > > > For auto-detecting, I wonder how to make the strategy and
> > mark a
> > >> > node
> > >> > > > > blocked?
> > >> > > > >
> > >> > > > > In fact, we currently plan to not support auto-detection in
> this
> > >> > FLIP.
> > >> > > > The
> > >> > > > > part about auto-detection may be continued in a separate FLIP
> in
> > >> the
> > >> > > > > future. Some guys have the same concerns as you, and the
> > >> correctness
> > >> > > and
> > >> > > > > necessity of auto-detection may require further discussion in
> > the
> > >> > > future.
> > >> > > > >
> > >> > > > > > In session mode, multi jobs can fail on the same bad node
> and
> > >> the
> > >> > > node
> > >> > > > > should be marked blocked.
> > >> > > > > By design, the blocklist information will be shared among all
> > jobs
> > >> > in a
> > >> > > > > cluster/session. The JM will sync blocklist information with
> RM.
> > >> > > > >
> > >> > > > > @Martijn
> > >> > > > >
> > >> > > > > > I agree with Yang Wang on this.
> > >> > > > > As Zhu Zhu and I mentioned above, we think the
> > >> MARK_BLOCKLISTED(Just
> > >> > > > limits
> > >> > > > > the load of the node and does not  kill all the processes on
> it)
> > >> is
> > >> > > also
> > >> > > > > important, and we think that external systems (*yarn rmadmin
> or
> > >> > kubectl
> > >> > > > > taint*) cannot support it. So we think it makes sense even
> only
> > >> > > > *manually*.
> > >> > > > >
> > >> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> > >> super
> > >> > > hard
> > >> > > > > to get right.
> > >> > > > > Yes, as you see, Jiangang(and a few others) have the same
> > concern.
> > >> > > > > However, we currently plan to not support auto-detection in
> this
> > >> > FLIP,
> > >> > > > and
> > >> > > > > only *manually*. In addition, I'd like to say that the FLIP
> > >> provides
> > >> > a
> > >> > > > > mechanism to support MARK_BLOCKLISTED and
> > >> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > >> > > > > the auto-detection may be done by external systems.
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > Lijie
> > >> > > > >
> > >> > > > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
> > >> > > > >
> > >> > > > > > > If we only support to block nodes manually, then I could
> not
> > >> see
> > >> > > > > > the obvious advantages compared with current SRE's
> > approach(via
> > >> > *yarn
> > >> > > > > > rmadmin or kubectl taint*).
> > >> > > > > >
> > >> > > > > > I agree with Yang Wang on this.
> > >> > > > > >
> > >> > > > > > >  To me this sounds yet again like one of those magical
> > >> mechanisms
> > >> > > > that
> > >> > > > > > will rarely work just right.
> > >> > > > > >
> > >> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> > >> super
> > >> > > hard
> > >> > > > > to
> > >> > > > > > get right.
> > >> > > > > >
> > >> > > > > > Best regards,
> > >> > > > > >
> > >> > > > > > Martijn
> > >> > > > > >
> > >> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> > >> > liujiangangpeng@gmail.com
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > >> Thanks for the valuable design. The auto-detecting can
> > decrease
> > >> > > great
> > >> > > > > work
> > >> > > > > >> for us. We have implemented the similar feature in our
> inner
> > >> flink
> > >> > > > > >> version.
> > >> > > > > >> Below is something that I care about:
> > >> > > > > >>
> > >> > > > > >>    1. For auto-detecting, I wonder how to make the strategy
> > and
> > >> > > mark a
> > >> > > > > >> node
> > >> > > > > >>    blocked? Sometimes the blocked node is hard to be
> > detected,
> > >> for
> > >> > > > > >> example,
> > >> > > > > >>    the upper node or the down node will be blocked when
> > network
> > >> > > > > >> unreachable.
> > >> > > > > >>    2. I see that the strategy is made in JobMaster side.
> How
> > >> about
> > >> > > > > >>    implementing the similar logic in resource manager? In
> > >> session
> > >> > > > mode,
> > >> > > > > >> multi
> > >> > > > > >>    jobs can fail on the same bad node and the node should
> be
> > >> > marked
> > >> > > > > >> blocked.
> > >> > > > > >>    If the job makes the strategy, the node may be not
> marked
> > >> > blocked
> > >> > > > if
> > >> > > > > >> the
> > >> > > > > >>    fail times don't exceed the threshold.
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > >> > > > > >>
> > >> > > > > >> > Thank you for all your feedback!
> > >> > > > > >> >
> > >> > > > > >> > Besides the answers from Lijie, I'd like to share some of
> > my
> > >> > > > thoughts:
> > >> > > > > >> > 1. Whether to enable automatical blocklist
> > >> > > > > >> > Generally speaking, it is not a goal of FLIP-224.
> > >> > > > > >> > The automatical way should be something built upon the
> > >> blocklist
> > >> > > > > >> > mechanism and well decoupled. It was designed to be a
> > >> > configurable
> > >> > > > > >> > blocklist strategy, but I think we can further decouple
> it
> > by
> > >> > > > > >> > introducing a abnormal node detector, as Becket
> suggested,
> > >> which
> > >> > > > just
> > >> > > > > >> > uses the blocklist mechanism once bad nodes are detected.
> > >> > However,
> > >> > > > it
> > >> > > > > >> > should be a separate FLIP with further dev discussions
> and
> > >> > > feedback
> > >> > > > > >> > from users. I also agree with Becket that different users
> > >> have
> > >> > > > > different
> > >> > > > > >> > requirements, and we should listen to them.
> > >> > > > > >> >
> > >> > > > > >> > 2. Is it enough to just take away abnormal nodes
> externally
> > >> > > > > >> > My answer is no. As Lijie has mentioned, we need a way to
> > >> avoid
> > >> > > > > >> > deploying tasks to temporary hot nodes. In this case,
> users
> > >> may
> > >> > > just
> > >> > > > > >> > want to limit the load of the node and do not want to
> kill
> > >> all
> > >> > the
> > >> > > > > >> > processes on it. Another case is the speculative
> > execution[1]
> > >> > > which
> > >> > > > > >> > may also leverage this feature to avoid starting mirror
> > >> tasks on
> > >> > > > slow
> > >> > > > > >> > nodes.
> > >> > > > > >> >
> > >> > > > > >> > Thanks,
> > >> > > > > >> > Zhu
> > >> > > > > >> >
> > >> > > > > >> > [1]
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > >> > > > > >> >
> > >> > > > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> > 15:56写道：
> > >> > > > > >> >
> > >> > > > > >> > >
> > >> > > > > >> > > Hi everyone,
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > Thanks for your feedback.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > There's one detail that I'd like to re-emphasize here
> > >> because
> > >> > it
> > >> > > > can
> > >> > > > > >> > affect the value and design of the blocklist mechanism
> > >> (perhaps
> > >> > I
> > >> > > > > should
> > >> > > > > >> > highlight it in the FLIP). We propose two actions in
> FLIP:
> > >> > > > > >> > >
> > >> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node
> > as
> > >> > > > blocked.
> > >> > > > > >> > Future slots should not be allocated from the blocked
> task
> > >> > manager
> > >> > > > or
> > >> > > > > >> node.
> > >> > > > > >> > But slots that are already allocated will not be
> affected.
> > A
> > >> > > typical
> > >> > > > > >> > application scenario is to mitigate machine hotspots. In
> > this
> > >> > > case,
> > >> > > > we
> > >> > > > > >> hope
> > >> > > > > >> > that subsequent resource allocations will not be on the
> hot
> > >> > > machine,
> > >> > > > > but
> > >> > > > > >> > tasks currently running on it should not be affected.
> > >> > > > > >> > >
> > >> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task
> > >> manager
> > >> > or
> > >> > > > > node
> > >> > > > > >> as
> > >> > > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks
> will
> > >> be
> > >> > > > > >> restarted on
> > >> > > > > >> > non-blocked task managers.
> > >> > > > > >> > >
> > >> > > > > >> > > For the above 2 actions, the former may more highlight
> > the
> > >> > > meaning
> > >> > > > > of
> > >> > > > > >> > this FLIP, because the external system cannot do that.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > Regarding *Manually* and *Automatically*, I basically
> > agree
> > >> > with
> > >> > > > > >> @Becket
> > >> > > > > >> > Qin: different users have different answers. Not all
> users’
> > >> > > > deployment
> > >> > > > > >> > environments have a special external system that can
> > perform
> > >> the
> > >> > > > > anomaly
> > >> > > > > >> > detection. In addition, adding pluggable/optional
> > >> auto-detection
> > >> > > > > doesn't
> > >> > > > > >> > require much extra work on top of manual specification.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > I will answer your other questions one by one.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > @Yangze
> > >> > > > > >> > >
> > >> > > > > >> > > a) I think you are right, we do not need to expose the
> > >> > > > > >> > `cluster.resource-blocklist.item.timeout-check-interval`
> to
> > >> > users.
> > >> > > > > >> > >
> > >> > > > > >> > > b) We can abstract the `notifyException` to a separate
> > >> > interface
> > >> > > > > >> (maybe
> > >> > > > > >> > BlocklistExceptionListener), and the
> > >> > > ResourceManagerBlocklistHandler
> > >> > > > > can
> > >> > > > > >> > implement it in the future.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > @Martijn
> > >> > > > > >> > >
> > >> > > > > >> > > a) I also think the manual blocking should be done by
> > >> cluster
> > >> > > > > >> operators.
> > >> > > > > >> > >
> > >> > > > > >> > > b) I think manual blocking makes sense, because
> according
> > >> to
> > >> > my
> > >> > > > > >> > experience, users are often the first to perceive the
> > machine
> > >> > > > problems
> > >> > > > > >> > (because of job failover or delay), and they will contact
> > >> > cluster
> > >> > > > > >> operators
> > >> > > > > >> > to solve it, or even tell the cluster operators which
> > >> machine is
> > >> > > > > >> > problematic. From this point of view, I think the people
> > who
> > >> > > really
> > >> > > > > need
> > >> > > > > >> > the manual blocking are the users, and it’s just
> performed
> > by
> > >> > the
> > >> > > > > >> cluster
> > >> > > > > >> > operator, so I think the manual blocking makes sense.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > @Chesnay
> > >> > > > > >> > >
> > >> > > > > >> > > We need to touch the logic of JM/SlotPool, because for
> > >> > > > > >> MARK_BLOCKLISTED
> > >> > > > > >> > , we need to know whether the slot is blocklisted when
> the
> > >> task
> > >> > is
> > >> > > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should
> release
> > >> the
> > >> > > slot
> > >> > > > > >> > directly to avoid assigning other tasks (of this job) on
> > it.
> > >> If
> > >> > we
> > >> > > > > only
> > >> > > > > >> > maintain the blocklist information on the RM, JM needs to
> > >> > retrieve
> > >> > > > it
> > >> > > > > by
> > >> > > > > >> > RPC. I think the performance overhead of that is
> relatively
> > >> > large,
> > >> > > > so
> > >> > > > > I
> > >> > > > > >> > think it's worth maintaining the blocklist information on
> > >> the JM
> > >> > > > side
> > >> > > > > >> and
> > >> > > > > >> > syncing them.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > @Роман
> > >> > > > > >> > >
> > >> > > > > >> > >     a) “Probably storing inside Zookeeper/Configmap
> might
> > >> be
> > >> > > > helpful
> > >> > > > > >> > here.”  Can you explain it in detail? I don't fully
> > >> understand
> > >> > > that.
> > >> > > > > In
> > >> > > > > >> my
> > >> > > > > >> > opinion, non-active and active are the same, and no
> special
> > >> > > > treatment
> > >> > > > > is
> > >> > > > > >> > required.
> > >> > > > > >> > >
> > >> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense, I
> > will
> > >> > add
> > >> > > it
> > >> > > > > to
> > >> > > > > >> > FLIP.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > @Yang
> > >> > > > > >> > >
> > >> > > > > >> > > As mentioned above, AFAK, the external system cannot
> > >> support
> > >> > the
> > >> > > > > >> > MARK_BLOCKLISTED action.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > Looking forward to your further feedback.
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > Best,
> > >> > > > > >> > >
> > >> > > > > >> > > Lijie
> > >> > > > > >> > >
> > >> > > > > >> > >
> > >> > > > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二
> 21:09写道：
> > >> > > > > >> > >>
> > >> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > >> > > > > >> > >>
> > >> > > > > >> > >> I want to share some thoughts about Flink cluster
> > >> operations.
> > >> > > > > >> > >>
> > >> > > > > >> > >> In the production environment, the SRE(aka Site
> > >> Reliability
> > >> > > > > Engineer)
> > >> > > > > >> > >> already has many tools to detect the unstable nodes,
> > which
> > >> > > could
> > >> > > > > take
> > >> > > > > >> > the
> > >> > > > > >> > >> system logs/metrics into consideration.
> > >> > > > > >> > >> Then they use graceful-decomission in YARN and taint
> in
> > >> K8s
> > >> > to
> > >> > > > > >> prevent
> > >> > > > > >> > new
> > >> > > > > >> > >> allocations on these unstable nodes.
> > >> > > > > >> > >> At last, they will evict all the containers and pods
> > >> running
> > >> > on
> > >> > > > > these
> > >> > > > > >> > nodes.
> > >> > > > > >> > >> This mechanism also works for planned maintenance. So
> I
> > am
> > >> > > afraid
> > >> > > > > >> this
> > >> > > > > >> > is
> > >> > > > > >> > >> not the typical use case for FLIP-224.
> > >> > > > > >> > >>
> > >> > > > > >> > >> If we only support to block nodes manually, then I
> could
> > >> not
> > >> > > see
> > >> > > > > >> > >> the obvious advantages compared with current SRE's
> > >> > approach(via
> > >> > > > > *yarn
> > >> > > > > >> > >> rmadmin or kubectl taint*).
> > >> > > > > >> > >> At least, we need to have a pluggable component which
> > >> could
> > >> > > > expose
> > >> > > > > >> the
> > >> > > > > >> > >> potential unstable nodes automatically and block them
> if
> > >> > > enabled
> > >> > > > > >> > explicitly.
> > >> > > > > >> > >>
> > >> > > > > >> > >>
> > >> > > > > >> > >> Best,
> > >> > > > > >> > >> Yang
> > >> > > > > >> > >>
> > >> > > > > >> > >>
> > >> > > > > >> > >>
> > >> > > > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一
> 16:36写道：
> > >> > > > > >> > >>
> > >> > > > > >> > >> > Thanks for the proposal, Lijie.
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > This is an interesting feature and discussion, and
> > >> somewhat
> > >> > > > > related
> > >> > > > > >> > to the
> > >> > > > > >> > >> > design principle about how people should operate
> > Flink.
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > I think there are three things involved in this
> FLIP.
> > >> > > > > >> > >> >      a) Detect and report the unstable node.
> > >> > > > > >> > >> >      b) Collect the information of the unstable node
> > and
> > >> > > form a
> > >> > > > > >> > blocklist.
> > >> > > > > >> > >> >      c) Take the action to block nodes.
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > My two cents:
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > 1. It looks like people all agree that Flink should
> > have
> > >> > c).
> > >> > > It
> > >> > > > > is
> > >> > > > > >> > not only
> > >> > > > > >> > >> > useful for cases of node failures, but also handy
> for
> > >> some
> > >> > > > > planned
> > >> > > > > >> > >> > maintenance.
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > 2. People have different opinions on b), i.e. who
> > >> should be
> > >> > > the
> > >> > > > > >> brain
> > >> > > > > >> > to
> > >> > > > > >> > >> > make the decision to block a node. I think this
> > largely
> > >> > > depends
> > >> > > > > on
> > >> > > > > >> > who we
> > >> > > > > >> > >> > talk to. Different users would probably give
> different
> > >> > > answers.
> > >> > > > > For
> > >> > > > > >> > people
> > >> > > > > >> > >> > who do have a centralized node health management
> > >> service,
> > >> > let
> > >> > > > > Flink
> > >> > > > > >> > do just
> > >> > > > > >> > >> > do a) and c) would be preferred. So essentially
> Flink
> > >> would
> > >> > > be
> > >> > > > > one
> > >> > > > > >> of
> > >> > > > > >> > the
> > >> > > > > >> > >> > sources that may detect unstable nodes, report it to
> > >> that
> > >> > > > > service,
> > >> > > > > >> > and then
> > >> > > > > >> > >> > take the command from that service to block the
> > >> problematic
> > >> > > > > nodes.
> > >> > > > > >> On
> > >> > > > > >> > the
> > >> > > > > >> > >> > other hand, for users who do not have such a
> service,
> > >> > simply
> > >> > > > > >> letting
> > >> > > > > >> > Flink
> > >> > > > > >> > >> > be clever by itself to block the suspicious nodes
> > might
> > >> be
> > >> > > > > desired
> > >> > > > > >> to
> > >> > > > > >> > >> > ensure the jobs are running smoothly.
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > So that indicates a) and b) here should be
> pluggable /
> > >> > > > optional.
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > In light of this, maybe it would make sense to have
> > >> > something
> > >> > > > > >> > pluggable
> > >> > > > > >> > >> > like a UnstableNodeReporter which exposes unstable
> > nodes
> > >> > > > > actively.
> > >> > > > > >> (A
> > >> > > > > >> > more
> > >> > > > > >> > >> > general interface should be JobInfoReporter<T> which
> > >> can be
> > >> > > > used
> > >> > > > > to
> > >> > > > > >> > report
> > >> > > > > >> > >> > any information of type <T>. But I'll just keep the
> > >> scope
> > >> > > > > relevant
> > >> > > > > >> to
> > >> > > > > >> > this
> > >> > > > > >> > >> > FLIP here). Personally speaking, I think it is OK to
> > >> have a
> > >> > > > > default
> > >> > > > > >> > >> > implementation of a reporter which just tells Flink
> to
> > >> take
> > >> > > > > action
> > >> > > > > >> to
> > >> > > > > >> > block
> > >> > > > > >> > >> > problematic nodes and also unblocks them after
> > timeout.
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > Thanks,
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > Jiangjie (Becket) Qin
> > >> > > > > >> > >> >
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > >> > > > ro.v.boyko@gmail.com
> > >> > > > > >
> > >> > > > > >> > wrote:
> > >> > > > > >> > >> >
> > >> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > > If it's possible I'd like to participate in
> > >> development.
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply - we
> > >> should
> > >> > > > > consider
> > >> > > > > >> > to move
> > >> > > > > >> > >> > > somehow the information of blocklisted nodes/TMs
> > from
> > >> > > active
> > >> > > > > >> > >> > > ResourceManager to non-active ones. Probably
> storing
> > >> > inside
> > >> > > > > >> > >> > > Zookeeper/Configmap might be helpful here.
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > > And I agree with Martijn that a lot of
> organizations
> > >> > don't
> > >> > > > want
> > >> > > > > >> to
> > >> > > > > >> > expose
> > >> > > > > >> > >> > > such API for a cluster user group. But I think
> it's
> > >> > > necessary
> > >> > > > > to
> > >> > > > > >> > have the
> > >> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for
> > >> > avoiding
> > >> > > > > >> incorrect
> > >> > > > > >> > >> > > automatic behaviour.
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > > And another one small suggestion - I think it
> would
> > be
> > >> > > better
> > >> > > > > to
> > >> > > > > >> > extend
> > >> > > > > >> > >> > the
> > >> > > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp*
> > field
> > >> and
> > >> > > > fill
> > >> > > > > it
> > >> > > > > >> > at the
> > >> > > > > >> > >> > > item creation. This simple addition will allow to:
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > >    -
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > >    Provide the ability to users to setup the exact
> > >> time
> > >> > of
> > >> > > > > >> > blocklist end
> > >> > > > > >> > >> > >    through RestAPI
> > >> > > > > >> > >> > >    -
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > >    Not being tied to a single value of
> > >> > > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > >> > > > > >> chesnay@apache.org>
> > >> > > > > >> > >> > wrote:
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> > > > I do share the concern between blurring the
> lines
> > a
> > >> > bit.
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > > That said, I'd prefer to not have any
> > auto-detection
> > >> > and
> > >> > > > only
> > >> > > > > >> > have an
> > >> > > > > >> > >> > > > opt-in mechanism
> > >> > > > > >> > >> > > > to manually block processes/nodes. To me this
> > sounds
> > >> > yet
> > >> > > > > again
> > >> > > > > >> > like one
> > >> > > > > >> > >> > > > of those
> > >> > > > > >> > >> > > > magical mechanisms that will rarely work just
> > right.
> > >> > > > > >> > >> > > > An external system can leverage way more
> > information
> > >> > > after
> > >> > > > > all.
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > > Moreover, I'm quite concerned about the
> complexity
> > >> of
> > >> > > this
> > >> > > > > >> > proposal.
> > >> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing between
> > >> > > > components;
> > >> > > > > >> > >> > adjustments
> > >> > > > > >> > >> > > > to the
> > >> > > > > >> > >> > > > slot and resource protocol.
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > > In a way it seems overly complicated.
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > > If we look at it purely from an active resource
> > >> > > management
> > >> > > > > >> > perspective,
> > >> > > > > >> > >> > > > then there
> > >> > > > > >> > >> > > > isn't really a need to touch the slot protocol
> at
> > >> all
> > >> > (or
> > >> > > > in
> > >> > > > > >> fact
> > >> > > > > >> > to
> > >> > > > > >> > >> > > > anything in the JobMaster),
> > >> > > > > >> > >> > > > because there isn't any point in keeping around
> > >> blocked
> > >> > > TMs
> > >> > > > > in
> > >> > > > > >> the
> > >> > > > > >> > >> > first
> > >> > > > > >> > >> > > > place.
> > >> > > > > >> > >> > > > They'd just be idling, potentially shutting down
> > >> after
> > >> > a
> > >> > > > > while
> > >> > > > > >> by
> > >> > > > > >> > the
> > >> > > > > >> > >> > RM
> > >> > > > > >> > >> > > > because of
> > >> > > > > >> > >> > > > it (unless we _also_ touch that logic).
> > >> > > > > >> > >> > > > Here the blocking of a process (be it by
> blocking
> > >> the
> > >> > > > process
> > >> > > > > >> or
> > >> > > > > >> > node)
> > >> > > > > >> > >> > is
> > >> > > > > >> > >> > > > equivalent with shutting down the blocked
> > >> process(es).
> > >> > > > > >> > >> > > > Once the block is lifted we can just spin it
> back
> > >> up.
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > > And I do wonder whether we couldn't apply the
> same
> > >> line
> > >> > > of
> > >> > > > > >> > thinking to
> > >> > > > > >> > >> > > > standalone resource management.
> > >> > > > > >> > >> > > > Here being able to stop/restart a process/node
> > >> manually
> > >> > > > > should
> > >> > > > > >> be
> > >> > > > > >> > a
> > >> > > > > >> > >> > core
> > >> > > > > >> > >> > > > requirement for a Flink deployment anyway.
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > >> > > > > >> > >> > > > > Hi everyone,
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > > Thanks for creating this FLIP. I can
> understand
> > >> the
> > >> > > > problem
> > >> > > > > >> and
> > >> > > > > >> > I see
> > >> > > > > >> > >> > > > value
> > >> > > > > >> > >> > > > > in the automatic detection and blocklisting. I
> > do
> > >> > have
> > >> > > > some
> > >> > > > > >> > concerns
> > >> > > > > >> > >> > > with
> > >> > > > > >> > >> > > > > the ability to manually specify to be blocked
> > >> > > resources.
> > >> > > > I
> > >> > > > > >> have
> > >> > > > > >> > two
> > >> > > > > >> > >> > > > > concerns;
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > > * Most organizations explicitly have a
> > separation
> > >> of
> > >> > > > > >> concerns,
> > >> > > > > >> > >> > meaning
> > >> > > > > >> > >> > > > that
> > >> > > > > >> > >> > > > > there's a group who's responsible for
> managing a
> > >> > > cluster
> > >> > > > > and
> > >> > > > > >> > there's
> > >> > > > > >> > >> > a
> > >> > > > > >> > >> > > > user
> > >> > > > > >> > >> > > > > group who uses that cluster. With the
> > >> introduction of
> > >> > > > this
> > >> > > > > >> > mechanism,
> > >> > > > > >> > >> > > the
> > >> > > > > >> > >> > > > > latter group now can influence the
> > responsibility
> > >> of
> > >> > > the
> > >> > > > > >> first
> > >> > > > > >> > group.
> > >> > > > > >> > >> > > So
> > >> > > > > >> > >> > > > it
> > >> > > > > >> > >> > > > > can be possible that someone from the user
> group
> > >> > blocks
> > >> > > > > >> > something,
> > >> > > > > >> > >> > > which
> > >> > > > > >> > >> > > > > causes an outage (which could result in paging
> > >> > > mechanism
> > >> > > > > >> > triggering
> > >> > > > > >> > >> > > etc)
> > >> > > > > >> > >> > > > > which impacts the first group.
> > >> > > > > >> > >> > > > > * How big is the group of people who can go
> > >> through
> > >> > the
> > >> > > > > >> process
> > >> > > > > >> > of
> > >> > > > > >> > >> > > > manually
> > >> > > > > >> > >> > > > > identifying a node that isn't behaving as it
> > >> should
> > >> > > be? I
> > >> > > > > do
> > >> > > > > >> > think
> > >> > > > > >> > >> > this
> > >> > > > > >> > >> > > > > group is relatively limited. Does it then make
> > >> sense
> > >> > to
> > >> > > > > >> > introduce
> > >> > > > > >> > >> > such
> > >> > > > > >> > >> > > a
> > >> > > > > >> > >> > > > > feature, which would only be used by a really
> > >> small
> > >> > > user
> > >> > > > > >> group
> > >> > > > > >> > of
> > >> > > > > >> > >> > > Flink?
> > >> > > > > >> > >> > > > We
> > >> > > > > >> > >> > > > > still have to maintain, test and support such
> a
> > >> > > feature.
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > > I'm +1 for the autodetection features, but I'm
> > >> > leaning
> > >> > > > > >> towards
> > >> > > > > >> > not
> > >> > > > > >> > >> > > > exposing
> > >> > > > > >> > >> > > > > this to the user group but having this
> available
> > >> > > strictly
> > >> > > > > for
> > >> > > > > >> > cluster
> > >> > > > > >> > >> > > > > operators. They could then also set up their
> > >> > > > > >> > paging/metrics/logging
> > >> > > > > >> > >> > > > system
> > >> > > > > >> > >> > > > > to take this into account.
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > > Best regards,
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > > Martijn Visser
> > >> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82
> > >> > > > > >> > >> > > > > https://github.com/MartijnVisser
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > >> > > > > karmagyz@gmail.com
> > >> > > > > >> >
> > >> > > > > >> > wrote:
> > >> > > > > >> > >> > > > >
> > >> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > > >> +1 for the overall proposal. Just share some
> > >> cents
> > >> > > here:
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > > >> - Why do we need to expose
> > >> > > > > >> > >> > > > >>
> > >> > cluster.resource-blacklist.item.timeout-check-interval
> > >> > > > to
> > >> > > > > >> the
> > >> > > > > >> > user?
> > >> > > > > >> > >> > > > >> I think the semantics of
> > >> > > > > >> > `cluster.resource-blacklist.item.timeout`
> > >> > > > > >> > >> > is
> > >> > > > > >> > >> > > > >> sufficient for the user. How to guarantee the
> > >> > timeout
> > >> > > > > >> > mechanism is
> > >> > > > > >> > >> > > > >> Flink's internal implementation. I think it
> > will
> > >> be
> > >> > > very
> > >> > > > > >> > confusing
> > >> > > > > >> > >> > and
> > >> > > > > >> > >> > > > >> we do not need to expose it to users.
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > > >> - ResourceManager can notify the exception
> of a
> > >> task
> > >> > > > > >> manager to
> > >> > > > > >> > >> > > > >> `BlacklistHandler` as well.
> > >> > > > > >> > >> > > > >> For example, the slot allocation might fail
> in
> > >> case
> > >> > > the
> > >> > > > > >> target
> > >> > > > > >> > task
> > >> > > > > >> > >> > > > >> manager is busy or has a network jitter. I
> > don't
> > >> > mean
> > >> > > we
> > >> > > > > >> need
> > >> > > > > >> > to
> > >> > > > > >> > >> > cover
> > >> > > > > >> > >> > > > >> this case in this version, but we can also
> > open a
> > >> > > > > >> > `notifyException`
> > >> > > > > >> > >> > in
> > >> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > > >> - Before we sync the blocklist to
> > >> ResourceManager,
> > >> > > will
> > >> > > > > the
> > >> > > > > >> > slot of
> > >> > > > > >> > >> > a
> > >> > > > > >> > >> > > > >> blocked task manager continues to be released
> > and
> > >> > > > > allocated?
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > > >> Best,
> > >> > > > > >> > >> > > > >> Yangze Guo
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > >> > > > > >> > >> > wangdachui9501@gmail.com>
> > >> > > > > >> > >> > > > >> wrote:
> > >> > > > > >> > >> > > > >>> Hi Konstantin,
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> Thanks for your feedback. I will response
> > your 4
> > >> > > > remarks:
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the
> > controversy. I
> > >> > > think
> > >> > > > > >> > “BlockList”
> > >> > > > > >> > >> > is
> > >> > > > > >> > >> > > > >> good
> > >> > > > > >> > >> > > > >>> enough, and I will change it in FLIP.
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a
> good
> > >> idea.
> > >> > > > Based
> > >> > > > > >> on
> > >> > > > > >> > the
> > >> > > > > >> > >> > > > above, I
> > >> > > > > >> > >> > > > >>> would change REST API as following:
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> DELETE
> > <host>/blocklist/taskmanager/<identifier>
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it
> means
> > >> that
> > >> > > all
> > >> > > > > >> task
> > >> > > > > >> > >> > managers
> > >> > > > > >> > >> > > > on
> > >> > > > > >> > >> > > > >>> this node are blocklisted. All slots on
> these
> > >> TMs
> > >> > are
> > >> > > > not
> > >> > > > > >> > >> > available.
> > >> > > > > >> > >> > > > This
> > >> > > > > >> > >> > > > >>> is actually a bit like TM losts, but these
> TMs
> > >> are
> > >> > > not
> > >> > > > > >> really
> > >> > > > > >> > lost,
> > >> > > > > >> > >> > > > they
> > >> > > > > >> > >> > > > >>> are in an unavailable status, and they are
> > still
> > >> > > > > registered
> > >> > > > > >> > in this
> > >> > > > > >> > >> > > > flink
> > >> > > > > >> > >> > > > >>> cluster. They will be available again once
> the
> > >> > > > > >> corresponding
> > >> > > > > >> > >> > > blocklist
> > >> > > > > >> > >> > > > >> item
> > >> > > > > >> > >> > > > >>> is removed. This behavior is the same in
> > >> > > > > active/non-active
> > >> > > > > >> > >> > clusters.
> > >> > > > > >> > >> > > > >>> However in the active clusters, these TMs
> may
> > be
> > >> > > > released
> > >> > > > > >> due
> > >> > > > > >> > to
> > >> > > > > >> > >> > idle
> > >> > > > > >> > >> > > > >>> timeouts.
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep
> it.
> > >> The
> > >> > > > reasons
> > >> > > > > >> are
> > >> > > > > >> > as
> > >> > > > > >> > >> > > > >> following:
> > >> > > > > >> > >> > > > >>> a) The timeout will not affect users adding
> or
> > >> > > removing
> > >> > > > > >> items
> > >> > > > > >> > via
> > >> > > > > >> > >> > > REST
> > >> > > > > >> > >> > > > >> API,
> > >> > > > > >> > >> > > > >>> and users can disable it by configuring it
> to
> > >> > > > > >> Long.MAX_VALUE .
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> b) Some node problems can recover after a
> > >> period of
> > >> > > > time
> > >> > > > > >> > (such as
> > >> > > > > >> > >> > > > machine
> > >> > > > > >> > >> > > > >>> hotspots), in which case users may prefer
> that
> > >> > Flink
> > >> > > > can
> > >> > > > > do
> > >> > > > > >> > this
> > >> > > > > >> > >> > > > >>> automatically instead of requiring the user
> to
> > >> do
> > >> > it
> > >> > > > > >> manually.
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> Best,
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> Lijie
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org>
> > >> 于2022年4月27日周三
> > >> > > > > >> 19:23写道：
> > >> > > > > >> > >> > > > >>>
> > >> > > > > >> > >> > > > >>>> Hi Lijie,
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to only
> > >> support
> > >> > > > > manually
> > >> > > > > >> > blocking
> > >> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different
> > >> > > strategies
> > >> > > > > can
> > >> > > > > >> > also be
> > >> > > > > >> > >> > > > >>>> maintained outside of Apache Flink.
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> A few remarks:
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> 1) Can we use another term than
> "bla.cklist"
> > >> due
> > >> > to
> > >> > > > the
> > >> > > > > >> > >> > controversy
> > >> > > > > >> > >> > > > >> around
> > >> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket
> > >> about
> > >> > > this
> > >> > > > > >> topic a
> > >> > > > > >> > >> > while
> > >> > > > > >> > >> > > > >> back
> > >> > > > > >> > >> > > > >>>> and there was generally a consensus to
> avoid
> > >> the
> > >> > > term
> > >> > > > > >> > blacklist &
> > >> > > > > >> > >> > > > >> whitelist
> > >> > > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> > >> > > > > "quarantined"
> > >> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly
> > >> > different
> > >> > > > > >> design
> > >> > > > > >> > as
> > >> > > > > >> > >> > verbs
> > >> > > > > >> > >> > > > >> like
> > >> > > > > >> > >> > > > >>>> add/remove often considered an anti-pattern
> > for
> > >> > REST
> > >> > > > > APIs.
> > >> > > > > >> > POST
> > >> > > > > >> > >> > on a
> > >> > > > > >> > >> > > > >> list
> > >> > > > > >> > >> > > > >>>> item is generally the standard to add
> items.
> > >> > DELETE
> > >> > > on
> > >> > > > > the
> > >> > > > > >> > >> > > individual
> > >> > > > > >> > >> > > > >>>> resource is standard to remove an item.
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> POST <host>/quarantine/items
> > >> > > > > >> > >> > > > >>>> DELETE
> > <host>/quarantine/items/<itemidentifier>
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> We could also consider to separate
> > taskmanagers
> > >> > and
> > >> > > > > nodes
> > >> > > > > >> in
> > >> > > > > >> > the
> > >> > > > > >> > >> > > REST
> > >> > > > > >> > >> > > > >> API
> > >> > > > > >> > >> > > > >>>> (and internal data structures). Any opinion
> > on
> > >> > this?
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > >> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > >> > > > > >> > >> > > > >>>> DELETE
> > >> <host>/quarantine/taskmanager/<identifier>
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with
> > >> non-active
> > >> > > > > >> resource
> > >> > > > > >> > >> > > managers,
> > >> > > > > >> > >> > > > >> i.e.
> > >> > > > > >> > >> > > > >>>> standalone or reactive mode?
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> 4) To keep the implementation even more
> > >> minimal,
> > >> > do
> > >> > > we
> > >> > > > > >> need
> > >> > > > > >> > the
> > >> > > > > >> > >> > > > timeout
> > >> > > > > >> > >> > > > >>>> behavior? If items are added/removed
> manually
> > >> we
> > >> > > could
> > >> > > > > >> > delegate
> > >> > > > > >> > >> > this
> > >> > > > > >> > >> > > > >> to the
> > >> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout
> > behavior
> > >> > > would
> > >> > > > > >> better
> > >> > > > > >> > fit
> > >> > > > > >> > >> > > into
> > >> > > > > >> > >> > > > >>>> specific strategies at a later point.
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> Looking forward to your thoughts.
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> Cheers and thank you,
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> Konstantin
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> [1]
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> >
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > >> > > > > >> > >> > > > >>>> [2]
> > >> > > https://issues.apache.org/jira/browse/FLINK-18209
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb
> > >> Lijie
> > >> > > Wang
> > >> > > > <
> > >> > > > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> > >> > > > > >> > >> > > > >>>>
> > >> > > > > >> > >> > > > >>>>> Hi all,
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> Flink job failures may happen due to
> cluster
> > >> node
> > >> > > > > issues
> > >> > > > > >> > >> > > > >> (insufficient
> > >> > > > > >> > >> > > > >>>> disk
> > >> > > > > >> > >> > > > >>>>> space, bad hardware, network
> abnormalities).
> > >> > Flink
> > >> > > > will
> > >> > > > > >> > take care
> > >> > > > > >> > >> > > of
> > >> > > > > >> > >> > > > >> the
> > >> > > > > >> > >> > > > >>>>> failures and redeploy the tasks. However,
> > due
> > >> to
> > >> > > data
> > >> > > > > >> > locality
> > >> > > > > >> > >> > and
> > >> > > > > >> > >> > > > >>>> limited
> > >> > > > > >> > >> > > > >>>>> resources, the new tasks are very likely
> to
> > be
> > >> > > > > redeployed
> > >> > > > > >> > to the
> > >> > > > > >> > >> > > same
> > >> > > > > >> > >> > > > >>>>> nodes, which will result in continuous
> task
> > >> > > > > abnormalities
> > >> > > > > >> > and
> > >> > > > > >> > >> > > affect
> > >> > > > > >> > >> > > > >> job
> > >> > > > > >> > >> > > > >>>>> progress.
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> Currently, Flink users need to manually
> > >> identify
> > >> > > the
> > >> > > > > >> > problematic
> > >> > > > > >> > >> > > > >> node and
> > >> > > > > >> > >> > > > >>>>> take it offline to solve this problem. But
> > >> this
> > >> > > > > approach
> > >> > > > > >> has
> > >> > > > > >> > >> > > > >> following
> > >> > > > > >> > >> > > > >>>>> disadvantages:
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy
> > >> process.
> > >> > > > Users
> > >> > > > > >> may
> > >> > > > > >> > need
> > >> > > > > >> > >> > to
> > >> > > > > >> > >> > > > >>>> contact
> > >> > > > > >> > >> > > > >>>>> cluster administors to do this. The
> > operation
> > >> can
> > >> > > > even
> > >> > > > > be
> > >> > > > > >> > >> > dangerous
> > >> > > > > >> > >> > > > >> and
> > >> > > > > >> > >> > > > >>>> not
> > >> > > > > >> > >> > > > >>>>> allowed during some important business
> > events.
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of
> > >> problems
> > >> > > > > manually
> > >> > > > > >> > would
> > >> > > > > >> > >> > be
> > >> > > > > >> > >> > > > >> slow
> > >> > > > > >> > >> > > > >>>> and
> > >> > > > > >> > >> > > > >>>>> a waste of human resources.
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I
> propose
> > >> to
> > >> > > > > >> introduce a
> > >> > > > > >> > >> > > blacklist
> > >> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out
> > problematic
> > >> > > > > resources.
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> You can find more details in FLIP-224[1].
> > >> Looking
> > >> > > > > forward
> > >> > > > > >> > to your
> > >> > > > > >> > >> > > > >>>> feedback.
> > >> > > > > >> > >> > > > >>>>> [1]
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> >
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> Best,
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > > >>>>> Lijie
> > >> > > > > >> > >> > > > >>>>>
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > > >
> > >> > > > > >> > >> > >
> > >> > > > > >> > >> >
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Best regards,
> > >> > > Roman Boyko
> > >> > > e.: ro.v.boyko@gmail.com
> > >> > >
> > >> >
> > >>
> > >
> >
>
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Konstantin Knauf <kn...@apache.org>.

Hi Lijie,

wouldn't the REST API-idiomatic way for an update/replace be a PUT on the
resource?

PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}

Best,

Konstantin



Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
wangdachui9501@gmail.com>:

> Hi everyone,
>
> I've had an offline discussion with Becket Qin and Zhu Zhu, and made the
> following changes on REST API:
> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose one. If
> both are specified, will return error.
> 2.  If the specified item is already there, the *ADD* operation has two
> behaviors:  *return error*(default value) or *merge/update*, and we add a
> flag to the request body to control it. You can find more details "Public
> Interface" section.
>
> If there is no more feedback, we will start the vote thread next week.
>
> Best,
> Lijie
>
> Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：
>
> > Hi Becket Qin,
> >
> > Thanks for your suggestions.  I have moved the description of
> > configurations, metrics and REST API into "Public Interface" section, and
> > made a few updates according to your suggestion.  And in this FLIP, there
> > no public java Interfaces or pluggables that users need to implement by
> > themselves.
> >
> > Answers for you questions:
> > 1. Yes, there 2 block actions: MARK_BLOCKED and.
> > MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block items can
> > only be added through the REST API, so these 2 action are mentioned in
> the
> > REST API part (The REST API part has beed moved to public interface now).
> > 2. I agree with you. I have changed the "Cause" field to String, and
> allow
> > users to specify it via REST API.
> > 3. Yes, it is useful to allow different timeouts. As mentioned above, we
> > will introduce 2 fields : *timeout* and *endTimestamp* into the ADD REST
> > API to specify when to remove the blocked item. These 2 fields are
> > optional, if neither is specified, it means that the blocked item is
> > permanent and will not be removed. If both are specified, the minimum of
> > *currentTimestamp+tiemout *and* endTimestamp* will be used as the time to
> > remove the blocked item. To keep the configurations more minimal, we have
> > removed the *cluster.resource-blocklist.item.timeout* configuration
> > option.
> > 4. Yes, the block item will be overridden if the specified item already
> > exists. The ADD operation is *ADD or UPDATE*.
> > 5. Yes. On JM/RM side, all the blocklist information is maintained in
> > JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
> abstracted
> > to other interfaces) will be propagated to different components.
> >
> > Best,
> > Lijie
> >
> > Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
> >
> >> Hi Lijie,
> >>
> >> Thanks for updating the FLIP. It looks like the public interface section
> >> did not fully reflect all the user sensible behavior and API. Can you
> put
> >> everything that users may be aware of there? That would include the REST
> >> API, metrics, configurations, public java Interfaces or pluggables that
> >> users may see or implement by themselves, as well as a brief summary of
> >> the
> >> behavior of the public API.
> >>
> >> Besides that, I have a few questions:
> >>
> >> 1. According to the conversation in the discussion thread, it looks like
> >> the BlockAction will have "MARK_BLOCKLISTED" and
> >> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can you
> >> add
> >> that to the public interface as well?
> >>
> >> 2. At this point, the "Cause" field in the BlockingItem is a Throwable
> and
> >> is not reflected in the REST API. Should that be included in the query
> >> response? And should we change that field to be a String so users may
> >> specify the cause via the REST API when they block some nodes / TMs?
> >>
> >> 3. Would it be useful to allow users to have different timeouts for
> >> different blocked items? So while there is a default timeout, users can
> >> also override it via the REST API when they block an entity.
> >>
> >> 4. Regarding the ADD operation, if the specified item is already there,
> >> will the block item be overridden? For example, if the user wants to
> >> extend
> >> the timeout of a blocked item, can they just  issue an ADD command
> again?
> >>
> >> 5. I am not quite familiar with the details of this, but is there a
> source
> >> of truth for the blocked list? I think it might be good to have a single
> >> source of truth for the blocked list and just propagate that list to
> >> different components to take the action of actually blocking the
> resource.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <wa...@gmail.com>
> >> wrote:
> >>
> >> > Hi everyone,
> >> >
> >> > Based on the discussion in the mailing list, I updated the FLIP doc,
> the
> >> > changes include:
> >> > 1. Changed the description of the motivation section to more clearly
> >> > describe the problem this FLIP is trying to solve.
> >> > 2. Only  *Manually* is supported.
> >> > 3. Adopted some suggestions, such as *endTimestamp*.
> >> >
> >> > Best,
> >> > Lijie
> >> >
> >> >
> >> > Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> >> >
> >> > > Hi Lijie!
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > *a) “Probably storing inside Zookeeper/Configmap might be
> >> helpfulhere.”
> >> > > Can you explain it in detail? I don't fully understand that. In
> >> > myopinion,
> >> > > non-active and active are the same, and no special treatment
> >> isrequired.*
> >> > >
> >> > > Sorry this was a misunderstanding from my side. I thought we were
> >> talking
> >> > > about the HA mode (but not about Active and Standalone
> >> ResourceManager).
> >> > > And the original question was - how to handle the blacklisted nodes
> >> list
> >> > at
> >> > > the moment of leader change? Should we simply forget about them or
> >> try to
> >> > > pre-save that list on the remote storage?
> >> > >
> >> > > On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com>
> wrote:
> >> > >
> >> > > > Thanks Lijie and ZhuZhu for the explanation.
> >> > > >
> >> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is
> >> indeed
> >> > > > some functionalities the external tools(e.g. kubectl taint) could
> >> not
> >> > > > support.
> >> > > >
> >> > > >
> >> > > > Best,
> >> > > > Yang
> >> > > >
> >> > > > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> >> > > >
> >> > > > > Thanks for your feedback, Jiangang and Martijn.
> >> > > > >
> >> > > > > @Jiangang
> >> > > > >
> >> > > > >
> >> > > > > > For auto-detecting, I wonder how to make the strategy and
> mark a
> >> > node
> >> > > > > blocked?
> >> > > > >
> >> > > > > In fact, we currently plan to not support auto-detection in this
> >> > FLIP.
> >> > > > The
> >> > > > > part about auto-detection may be continued in a separate FLIP in
> >> the
> >> > > > > future. Some guys have the same concerns as you, and the
> >> correctness
> >> > > and
> >> > > > > necessity of auto-detection may require further discussion in
> the
> >> > > future.
> >> > > > >
> >> > > > > > In session mode, multi jobs can fail on the same bad node and
> >> the
> >> > > node
> >> > > > > should be marked blocked.
> >> > > > > By design, the blocklist information will be shared among all
> jobs
> >> > in a
> >> > > > > cluster/session. The JM will sync blocklist information with RM.
> >> > > > >
> >> > > > > @Martijn
> >> > > > >
> >> > > > > > I agree with Yang Wang on this.
> >> > > > > As Zhu Zhu and I mentioned above, we think the
> >> MARK_BLOCKLISTED(Just
> >> > > > limits
> >> > > > > the load of the node and does not  kill all the processes on it)
> >> is
> >> > > also
> >> > > > > important, and we think that external systems (*yarn rmadmin or
> >> > kubectl
> >> > > > > taint*) cannot support it. So we think it makes sense even only
> >> > > > *manually*.
> >> > > > >
> >> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> >> super
> >> > > hard
> >> > > > > to get right.
> >> > > > > Yes, as you see, Jiangang(and a few others) have the same
> concern.
> >> > > > > However, we currently plan to not support auto-detection in this
> >> > FLIP,
> >> > > > and
> >> > > > > only *manually*. In addition, I'd like to say that the FLIP
> >> provides
> >> > a
> >> > > > > mechanism to support MARK_BLOCKLISTED and
> >> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> >> > > > > the auto-detection may be done by external systems.
> >> > > > >
> >> > > > > Best,
> >> > > > > Lijie
> >> > > > >
> >> > > > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
> >> > > > >
> >> > > > > > > If we only support to block nodes manually, then I could not
> >> see
> >> > > > > > the obvious advantages compared with current SRE's
> approach(via
> >> > *yarn
> >> > > > > > rmadmin or kubectl taint*).
> >> > > > > >
> >> > > > > > I agree with Yang Wang on this.
> >> > > > > >
> >> > > > > > >  To me this sounds yet again like one of those magical
> >> mechanisms
> >> > > > that
> >> > > > > > will rarely work just right.
> >> > > > > >
> >> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> >> super
> >> > > hard
> >> > > > > to
> >> > > > > > get right.
> >> > > > > >
> >> > > > > > Best regards,
> >> > > > > >
> >> > > > > > Martijn
> >> > > > > >
> >> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> >> > liujiangangpeng@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > >> Thanks for the valuable design. The auto-detecting can
> decrease
> >> > > great
> >> > > > > work
> >> > > > > >> for us. We have implemented the similar feature in our inner
> >> flink
> >> > > > > >> version.
> >> > > > > >> Below is something that I care about:
> >> > > > > >>
> >> > > > > >>    1. For auto-detecting, I wonder how to make the strategy
> and
> >> > > mark a
> >> > > > > >> node
> >> > > > > >>    blocked? Sometimes the blocked node is hard to be
> detected,
> >> for
> >> > > > > >> example,
> >> > > > > >>    the upper node or the down node will be blocked when
> network
> >> > > > > >> unreachable.
> >> > > > > >>    2. I see that the strategy is made in JobMaster side. How
> >> about
> >> > > > > >>    implementing the similar logic in resource manager? In
> >> session
> >> > > > mode,
> >> > > > > >> multi
> >> > > > > >>    jobs can fail on the same bad node and the node should be
> >> > marked
> >> > > > > >> blocked.
> >> > > > > >>    If the job makes the strategy, the node may be not marked
> >> > blocked
> >> > > > if
> >> > > > > >> the
> >> > > > > >>    fail times don't exceed the threshold.
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> >> > > > > >>
> >> > > > > >> > Thank you for all your feedback!
> >> > > > > >> >
> >> > > > > >> > Besides the answers from Lijie, I'd like to share some of
> my
> >> > > > thoughts:
> >> > > > > >> > 1. Whether to enable automatical blocklist
> >> > > > > >> > Generally speaking, it is not a goal of FLIP-224.
> >> > > > > >> > The automatical way should be something built upon the
> >> blocklist
> >> > > > > >> > mechanism and well decoupled. It was designed to be a
> >> > configurable
> >> > > > > >> > blocklist strategy, but I think we can further decouple it
> by
> >> > > > > >> > introducing a abnormal node detector, as Becket suggested,
> >> which
> >> > > > just
> >> > > > > >> > uses the blocklist mechanism once bad nodes are detected.
> >> > However,
> >> > > > it
> >> > > > > >> > should be a separate FLIP with further dev discussions and
> >> > > feedback
> >> > > > > >> > from users. I also agree with Becket that different users
> >> have
> >> > > > > different
> >> > > > > >> > requirements, and we should listen to them.
> >> > > > > >> >
> >> > > > > >> > 2. Is it enough to just take away abnormal nodes externally
> >> > > > > >> > My answer is no. As Lijie has mentioned, we need a way to
> >> avoid
> >> > > > > >> > deploying tasks to temporary hot nodes. In this case, users
> >> may
> >> > > just
> >> > > > > >> > want to limit the load of the node and do not want to kill
> >> all
> >> > the
> >> > > > > >> > processes on it. Another case is the speculative
> execution[1]
> >> > > which
> >> > > > > >> > may also leverage this feature to avoid starting mirror
> >> tasks on
> >> > > > slow
> >> > > > > >> > nodes.
> >> > > > > >> >
> >> > > > > >> > Thanks,
> >> > > > > >> > Zhu
> >> > > > > >> >
> >> > > > > >> > [1]
> >> > > > > >> >
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> > > > > >> >
> >> > > > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四
> 15:56写道：
> >> > > > > >> >
> >> > > > > >> > >
> >> > > > > >> > > Hi everyone,
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > Thanks for your feedback.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > There's one detail that I'd like to re-emphasize here
> >> because
> >> > it
> >> > > > can
> >> > > > > >> > affect the value and design of the blocklist mechanism
> >> (perhaps
> >> > I
> >> > > > > should
> >> > > > > >> > highlight it in the FLIP). We propose two actions in FLIP:
> >> > > > > >> > >
> >> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node
> as
> >> > > > blocked.
> >> > > > > >> > Future slots should not be allocated from the blocked task
> >> > manager
> >> > > > or
> >> > > > > >> node.
> >> > > > > >> > But slots that are already allocated will not be affected.
> A
> >> > > typical
> >> > > > > >> > application scenario is to mitigate machine hotspots. In
> this
> >> > > case,
> >> > > > we
> >> > > > > >> hope
> >> > > > > >> > that subsequent resource allocations will not be on the hot
> >> > > machine,
> >> > > > > but
> >> > > > > >> > tasks currently running on it should not be affected.
> >> > > > > >> > >
> >> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task
> >> manager
> >> > or
> >> > > > > node
> >> > > > > >> as
> >> > > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks will
> >> be
> >> > > > > >> restarted on
> >> > > > > >> > non-blocked task managers.
> >> > > > > >> > >
> >> > > > > >> > > For the above 2 actions, the former may more highlight
> the
> >> > > meaning
> >> > > > > of
> >> > > > > >> > this FLIP, because the external system cannot do that.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > Regarding *Manually* and *Automatically*, I basically
> agree
> >> > with
> >> > > > > >> @Becket
> >> > > > > >> > Qin: different users have different answers. Not all users’
> >> > > > deployment
> >> > > > > >> > environments have a special external system that can
> perform
> >> the
> >> > > > > anomaly
> >> > > > > >> > detection. In addition, adding pluggable/optional
> >> auto-detection
> >> > > > > doesn't
> >> > > > > >> > require much extra work on top of manual specification.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > I will answer your other questions one by one.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > @Yangze
> >> > > > > >> > >
> >> > > > > >> > > a) I think you are right, we do not need to expose the
> >> > > > > >> > `cluster.resource-blocklist.item.timeout-check-interval` to
> >> > users.
> >> > > > > >> > >
> >> > > > > >> > > b) We can abstract the `notifyException` to a separate
> >> > interface
> >> > > > > >> (maybe
> >> > > > > >> > BlocklistExceptionListener), and the
> >> > > ResourceManagerBlocklistHandler
> >> > > > > can
> >> > > > > >> > implement it in the future.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > @Martijn
> >> > > > > >> > >
> >> > > > > >> > > a) I also think the manual blocking should be done by
> >> cluster
> >> > > > > >> operators.
> >> > > > > >> > >
> >> > > > > >> > > b) I think manual blocking makes sense, because according
> >> to
> >> > my
> >> > > > > >> > experience, users are often the first to perceive the
> machine
> >> > > > problems
> >> > > > > >> > (because of job failover or delay), and they will contact
> >> > cluster
> >> > > > > >> operators
> >> > > > > >> > to solve it, or even tell the cluster operators which
> >> machine is
> >> > > > > >> > problematic. From this point of view, I think the people
> who
> >> > > really
> >> > > > > need
> >> > > > > >> > the manual blocking are the users, and it’s just performed
> by
> >> > the
> >> > > > > >> cluster
> >> > > > > >> > operator, so I think the manual blocking makes sense.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > @Chesnay
> >> > > > > >> > >
> >> > > > > >> > > We need to touch the logic of JM/SlotPool, because for
> >> > > > > >> MARK_BLOCKLISTED
> >> > > > > >> > , we need to know whether the slot is blocklisted when the
> >> task
> >> > is
> >> > > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release
> >> the
> >> > > slot
> >> > > > > >> > directly to avoid assigning other tasks (of this job) on
> it.
> >> If
> >> > we
> >> > > > > only
> >> > > > > >> > maintain the blocklist information on the RM, JM needs to
> >> > retrieve
> >> > > > it
> >> > > > > by
> >> > > > > >> > RPC. I think the performance overhead of that is relatively
> >> > large,
> >> > > > so
> >> > > > > I
> >> > > > > >> > think it's worth maintaining the blocklist information on
> >> the JM
> >> > > > side
> >> > > > > >> and
> >> > > > > >> > syncing them.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > @Роман
> >> > > > > >> > >
> >> > > > > >> > >     a) “Probably storing inside Zookeeper/Configmap might
> >> be
> >> > > > helpful
> >> > > > > >> > here.”  Can you explain it in detail? I don't fully
> >> understand
> >> > > that.
> >> > > > > In
> >> > > > > >> my
> >> > > > > >> > opinion, non-active and active are the same, and no special
> >> > > > treatment
> >> > > > > is
> >> > > > > >> > required.
> >> > > > > >> > >
> >> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense, I
> will
> >> > add
> >> > > it
> >> > > > > to
> >> > > > > >> > FLIP.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > @Yang
> >> > > > > >> > >
> >> > > > > >> > > As mentioned above, AFAK, the external system cannot
> >> support
> >> > the
> >> > > > > >> > MARK_BLOCKLISTED action.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > Looking forward to your further feedback.
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > Best,
> >> > > > > >> > >
> >> > > > > >> > > Lijie
> >> > > > > >> > >
> >> > > > > >> > >
> >> > > > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> >> > > > > >> > >>
> >> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> >> > > > > >> > >>
> >> > > > > >> > >> I want to share some thoughts about Flink cluster
> >> operations.
> >> > > > > >> > >>
> >> > > > > >> > >> In the production environment, the SRE(aka Site
> >> Reliability
> >> > > > > Engineer)
> >> > > > > >> > >> already has many tools to detect the unstable nodes,
> which
> >> > > could
> >> > > > > take
> >> > > > > >> > the
> >> > > > > >> > >> system logs/metrics into consideration.
> >> > > > > >> > >> Then they use graceful-decomission in YARN and taint in
> >> K8s
> >> > to
> >> > > > > >> prevent
> >> > > > > >> > new
> >> > > > > >> > >> allocations on these unstable nodes.
> >> > > > > >> > >> At last, they will evict all the containers and pods
> >> running
> >> > on
> >> > > > > these
> >> > > > > >> > nodes.
> >> > > > > >> > >> This mechanism also works for planned maintenance. So I
> am
> >> > > afraid
> >> > > > > >> this
> >> > > > > >> > is
> >> > > > > >> > >> not the typical use case for FLIP-224.
> >> > > > > >> > >>
> >> > > > > >> > >> If we only support to block nodes manually, then I could
> >> not
> >> > > see
> >> > > > > >> > >> the obvious advantages compared with current SRE's
> >> > approach(via
> >> > > > > *yarn
> >> > > > > >> > >> rmadmin or kubectl taint*).
> >> > > > > >> > >> At least, we need to have a pluggable component which
> >> could
> >> > > > expose
> >> > > > > >> the
> >> > > > > >> > >> potential unstable nodes automatically and block them if
> >> > > enabled
> >> > > > > >> > explicitly.
> >> > > > > >> > >>
> >> > > > > >> > >>
> >> > > > > >> > >> Best,
> >> > > > > >> > >> Yang
> >> > > > > >> > >>
> >> > > > > >> > >>
> >> > > > > >> > >>
> >> > > > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> >> > > > > >> > >>
> >> > > > > >> > >> > Thanks for the proposal, Lijie.
> >> > > > > >> > >> >
> >> > > > > >> > >> > This is an interesting feature and discussion, and
> >> somewhat
> >> > > > > related
> >> > > > > >> > to the
> >> > > > > >> > >> > design principle about how people should operate
> Flink.
> >> > > > > >> > >> >
> >> > > > > >> > >> > I think there are three things involved in this FLIP.
> >> > > > > >> > >> >      a) Detect and report the unstable node.
> >> > > > > >> > >> >      b) Collect the information of the unstable node
> and
> >> > > form a
> >> > > > > >> > blocklist.
> >> > > > > >> > >> >      c) Take the action to block nodes.
> >> > > > > >> > >> >
> >> > > > > >> > >> > My two cents:
> >> > > > > >> > >> >
> >> > > > > >> > >> > 1. It looks like people all agree that Flink should
> have
> >> > c).
> >> > > It
> >> > > > > is
> >> > > > > >> > not only
> >> > > > > >> > >> > useful for cases of node failures, but also handy for
> >> some
> >> > > > > planned
> >> > > > > >> > >> > maintenance.
> >> > > > > >> > >> >
> >> > > > > >> > >> > 2. People have different opinions on b), i.e. who
> >> should be
> >> > > the
> >> > > > > >> brain
> >> > > > > >> > to
> >> > > > > >> > >> > make the decision to block a node. I think this
> largely
> >> > > depends
> >> > > > > on
> >> > > > > >> > who we
> >> > > > > >> > >> > talk to. Different users would probably give different
> >> > > answers.
> >> > > > > For
> >> > > > > >> > people
> >> > > > > >> > >> > who do have a centralized node health management
> >> service,
> >> > let
> >> > > > > Flink
> >> > > > > >> > do just
> >> > > > > >> > >> > do a) and c) would be preferred. So essentially Flink
> >> would
> >> > > be
> >> > > > > one
> >> > > > > >> of
> >> > > > > >> > the
> >> > > > > >> > >> > sources that may detect unstable nodes, report it to
> >> that
> >> > > > > service,
> >> > > > > >> > and then
> >> > > > > >> > >> > take the command from that service to block the
> >> problematic
> >> > > > > nodes.
> >> > > > > >> On
> >> > > > > >> > the
> >> > > > > >> > >> > other hand, for users who do not have such a service,
> >> > simply
> >> > > > > >> letting
> >> > > > > >> > Flink
> >> > > > > >> > >> > be clever by itself to block the suspicious nodes
> might
> >> be
> >> > > > > desired
> >> > > > > >> to
> >> > > > > >> > >> > ensure the jobs are running smoothly.
> >> > > > > >> > >> >
> >> > > > > >> > >> > So that indicates a) and b) here should be pluggable /
> >> > > > optional.
> >> > > > > >> > >> >
> >> > > > > >> > >> > In light of this, maybe it would make sense to have
> >> > something
> >> > > > > >> > pluggable
> >> > > > > >> > >> > like a UnstableNodeReporter which exposes unstable
> nodes
> >> > > > > actively.
> >> > > > > >> (A
> >> > > > > >> > more
> >> > > > > >> > >> > general interface should be JobInfoReporter<T> which
> >> can be
> >> > > > used
> >> > > > > to
> >> > > > > >> > report
> >> > > > > >> > >> > any information of type <T>. But I'll just keep the
> >> scope
> >> > > > > relevant
> >> > > > > >> to
> >> > > > > >> > this
> >> > > > > >> > >> > FLIP here). Personally speaking, I think it is OK to
> >> have a
> >> > > > > default
> >> > > > > >> > >> > implementation of a reporter which just tells Flink to
> >> take
> >> > > > > action
> >> > > > > >> to
> >> > > > > >> > block
> >> > > > > >> > >> > problematic nodes and also unblocks them after
> timeout.
> >> > > > > >> > >> >
> >> > > > > >> > >> > Thanks,
> >> > > > > >> > >> >
> >> > > > > >> > >> > Jiangjie (Becket) Qin
> >> > > > > >> > >> >
> >> > > > > >> > >> >
> >> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> >> > > > ro.v.boyko@gmail.com
> >> > > > > >
> >> > > > > >> > wrote:
> >> > > > > >> > >> >
> >> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> >> > > > > >> > >> > >
> >> > > > > >> > >> > > If it's possible I'd like to participate in
> >> development.
> >> > > > > >> > >> > >
> >> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply - we
> >> should
> >> > > > > consider
> >> > > > > >> > to move
> >> > > > > >> > >> > > somehow the information of blocklisted nodes/TMs
> from
> >> > > active
> >> > > > > >> > >> > > ResourceManager to non-active ones. Probably storing
> >> > inside
> >> > > > > >> > >> > > Zookeeper/Configmap might be helpful here.
> >> > > > > >> > >> > >
> >> > > > > >> > >> > > And I agree with Martijn that a lot of organizations
> >> > don't
> >> > > > want
> >> > > > > >> to
> >> > > > > >> > expose
> >> > > > > >> > >> > > such API for a cluster user group. But I think it's
> >> > > necessary
> >> > > > > to
> >> > > > > >> > have the
> >> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for
> >> > avoiding
> >> > > > > >> incorrect
> >> > > > > >> > >> > > automatic behaviour.
> >> > > > > >> > >> > >
> >> > > > > >> > >> > > And another one small suggestion - I think it would
> be
> >> > > better
> >> > > > > to
> >> > > > > >> > extend
> >> > > > > >> > >> > the
> >> > > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp*
> field
> >> and
> >> > > > fill
> >> > > > > it
> >> > > > > >> > at the
> >> > > > > >> > >> > > item creation. This simple addition will allow to:
> >> > > > > >> > >> > >
> >> > > > > >> > >> > >    -
> >> > > > > >> > >> > >
> >> > > > > >> > >> > >    Provide the ability to users to setup the exact
> >> time
> >> > of
> >> > > > > >> > blocklist end
> >> > > > > >> > >> > >    through RestAPI
> >> > > > > >> > >> > >    -
> >> > > > > >> > >> > >
> >> > > > > >> > >> > >    Not being tied to a single value of
> >> > > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> >> > > > > >> > >> > >
> >> > > > > >> > >> > >
> >> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> >> > > > > >> chesnay@apache.org>
> >> > > > > >> > >> > wrote:
> >> > > > > >> > >> > >
> >> > > > > >> > >> > > > I do share the concern between blurring the lines
> a
> >> > bit.
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > > That said, I'd prefer to not have any
> auto-detection
> >> > and
> >> > > > only
> >> > > > > >> > have an
> >> > > > > >> > >> > > > opt-in mechanism
> >> > > > > >> > >> > > > to manually block processes/nodes. To me this
> sounds
> >> > yet
> >> > > > > again
> >> > > > > >> > like one
> >> > > > > >> > >> > > > of those
> >> > > > > >> > >> > > > magical mechanisms that will rarely work just
> right.
> >> > > > > >> > >> > > > An external system can leverage way more
> information
> >> > > after
> >> > > > > all.
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > > Moreover, I'm quite concerned about the complexity
> >> of
> >> > > this
> >> > > > > >> > proposal.
> >> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing between
> >> > > > components;
> >> > > > > >> > >> > adjustments
> >> > > > > >> > >> > > > to the
> >> > > > > >> > >> > > > slot and resource protocol.
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > > In a way it seems overly complicated.
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > > If we look at it purely from an active resource
> >> > > management
> >> > > > > >> > perspective,
> >> > > > > >> > >> > > > then there
> >> > > > > >> > >> > > > isn't really a need to touch the slot protocol at
> >> all
> >> > (or
> >> > > > in
> >> > > > > >> fact
> >> > > > > >> > to
> >> > > > > >> > >> > > > anything in the JobMaster),
> >> > > > > >> > >> > > > because there isn't any point in keeping around
> >> blocked
> >> > > TMs
> >> > > > > in
> >> > > > > >> the
> >> > > > > >> > >> > first
> >> > > > > >> > >> > > > place.
> >> > > > > >> > >> > > > They'd just be idling, potentially shutting down
> >> after
> >> > a
> >> > > > > while
> >> > > > > >> by
> >> > > > > >> > the
> >> > > > > >> > >> > RM
> >> > > > > >> > >> > > > because of
> >> > > > > >> > >> > > > it (unless we _also_ touch that logic).
> >> > > > > >> > >> > > > Here the blocking of a process (be it by blocking
> >> the
> >> > > > process
> >> > > > > >> or
> >> > > > > >> > node)
> >> > > > > >> > >> > is
> >> > > > > >> > >> > > > equivalent with shutting down the blocked
> >> process(es).
> >> > > > > >> > >> > > > Once the block is lifted we can just spin it back
> >> up.
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > > And I do wonder whether we couldn't apply the same
> >> line
> >> > > of
> >> > > > > >> > thinking to
> >> > > > > >> > >> > > > standalone resource management.
> >> > > > > >> > >> > > > Here being able to stop/restart a process/node
> >> manually
> >> > > > > should
> >> > > > > >> be
> >> > > > > >> > a
> >> > > > > >> > >> > core
> >> > > > > >> > >> > > > requirement for a Flink deployment anyway.
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> >> > > > > >> > >> > > > > Hi everyone,
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > > Thanks for creating this FLIP. I can understand
> >> the
> >> > > > problem
> >> > > > > >> and
> >> > > > > >> > I see
> >> > > > > >> > >> > > > value
> >> > > > > >> > >> > > > > in the automatic detection and blocklisting. I
> do
> >> > have
> >> > > > some
> >> > > > > >> > concerns
> >> > > > > >> > >> > > with
> >> > > > > >> > >> > > > > the ability to manually specify to be blocked
> >> > > resources.
> >> > > > I
> >> > > > > >> have
> >> > > > > >> > two
> >> > > > > >> > >> > > > > concerns;
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > > * Most organizations explicitly have a
> separation
> >> of
> >> > > > > >> concerns,
> >> > > > > >> > >> > meaning
> >> > > > > >> > >> > > > that
> >> > > > > >> > >> > > > > there's a group who's responsible for managing a
> >> > > cluster
> >> > > > > and
> >> > > > > >> > there's
> >> > > > > >> > >> > a
> >> > > > > >> > >> > > > user
> >> > > > > >> > >> > > > > group who uses that cluster. With the
> >> introduction of
> >> > > > this
> >> > > > > >> > mechanism,
> >> > > > > >> > >> > > the
> >> > > > > >> > >> > > > > latter group now can influence the
> responsibility
> >> of
> >> > > the
> >> > > > > >> first
> >> > > > > >> > group.
> >> > > > > >> > >> > > So
> >> > > > > >> > >> > > > it
> >> > > > > >> > >> > > > > can be possible that someone from the user group
> >> > blocks
> >> > > > > >> > something,
> >> > > > > >> > >> > > which
> >> > > > > >> > >> > > > > causes an outage (which could result in paging
> >> > > mechanism
> >> > > > > >> > triggering
> >> > > > > >> > >> > > etc)
> >> > > > > >> > >> > > > > which impacts the first group.
> >> > > > > >> > >> > > > > * How big is the group of people who can go
> >> through
> >> > the
> >> > > > > >> process
> >> > > > > >> > of
> >> > > > > >> > >> > > > manually
> >> > > > > >> > >> > > > > identifying a node that isn't behaving as it
> >> should
> >> > > be? I
> >> > > > > do
> >> > > > > >> > think
> >> > > > > >> > >> > this
> >> > > > > >> > >> > > > > group is relatively limited. Does it then make
> >> sense
> >> > to
> >> > > > > >> > introduce
> >> > > > > >> > >> > such
> >> > > > > >> > >> > > a
> >> > > > > >> > >> > > > > feature, which would only be used by a really
> >> small
> >> > > user
> >> > > > > >> group
> >> > > > > >> > of
> >> > > > > >> > >> > > Flink?
> >> > > > > >> > >> > > > We
> >> > > > > >> > >> > > > > still have to maintain, test and support such a
> >> > > feature.
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > > I'm +1 for the autodetection features, but I'm
> >> > leaning
> >> > > > > >> towards
> >> > > > > >> > not
> >> > > > > >> > >> > > > exposing
> >> > > > > >> > >> > > > > this to the user group but having this available
> >> > > strictly
> >> > > > > for
> >> > > > > >> > cluster
> >> > > > > >> > >> > > > > operators. They could then also set up their
> >> > > > > >> > paging/metrics/logging
> >> > > > > >> > >> > > > system
> >> > > > > >> > >> > > > > to take this into account.
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > > Best regards,
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > > Martijn Visser
> >> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82
> >> > > > > >> > >> > > > > https://github.com/MartijnVisser
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> >> > > > > karmagyz@gmail.com
> >> > > > > >> >
> >> > > > > >> > wrote:
> >> > > > > >> > >> > > > >
> >> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > > >> +1 for the overall proposal. Just share some
> >> cents
> >> > > here:
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > > >> - Why do we need to expose
> >> > > > > >> > >> > > > >>
> >> > cluster.resource-blacklist.item.timeout-check-interval
> >> > > > to
> >> > > > > >> the
> >> > > > > >> > user?
> >> > > > > >> > >> > > > >> I think the semantics of
> >> > > > > >> > `cluster.resource-blacklist.item.timeout`
> >> > > > > >> > >> > is
> >> > > > > >> > >> > > > >> sufficient for the user. How to guarantee the
> >> > timeout
> >> > > > > >> > mechanism is
> >> > > > > >> > >> > > > >> Flink's internal implementation. I think it
> will
> >> be
> >> > > very
> >> > > > > >> > confusing
> >> > > > > >> > >> > and
> >> > > > > >> > >> > > > >> we do not need to expose it to users.
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > > >> - ResourceManager can notify the exception of a
> >> task
> >> > > > > >> manager to
> >> > > > > >> > >> > > > >> `BlacklistHandler` as well.
> >> > > > > >> > >> > > > >> For example, the slot allocation might fail in
> >> case
> >> > > the
> >> > > > > >> target
> >> > > > > >> > task
> >> > > > > >> > >> > > > >> manager is busy or has a network jitter. I
> don't
> >> > mean
> >> > > we
> >> > > > > >> need
> >> > > > > >> > to
> >> > > > > >> > >> > cover
> >> > > > > >> > >> > > > >> this case in this version, but we can also
> open a
> >> > > > > >> > `notifyException`
> >> > > > > >> > >> > in
> >> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > > >> - Before we sync the blocklist to
> >> ResourceManager,
> >> > > will
> >> > > > > the
> >> > > > > >> > slot of
> >> > > > > >> > >> > a
> >> > > > > >> > >> > > > >> blocked task manager continues to be released
> and
> >> > > > > allocated?
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > > >> Best,
> >> > > > > >> > >> > > > >> Yangze Guo
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> >> > > > > >> > >> > wangdachui9501@gmail.com>
> >> > > > > >> > >> > > > >> wrote:
> >> > > > > >> > >> > > > >>> Hi Konstantin,
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> Thanks for your feedback. I will response
> your 4
> >> > > > remarks:
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the
> controversy. I
> >> > > think
> >> > > > > >> > “BlockList”
> >> > > > > >> > >> > is
> >> > > > > >> > >> > > > >> good
> >> > > > > >> > >> > > > >>> enough, and I will change it in FLIP.
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a good
> >> idea.
> >> > > > Based
> >> > > > > >> on
> >> > > > > >> > the
> >> > > > > >> > >> > > > above, I
> >> > > > > >> > >> > > > >>> would change REST API as following:
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> DELETE
> <host>/blocklist/taskmanager/<identifier>
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means
> >> that
> >> > > all
> >> > > > > >> task
> >> > > > > >> > >> > managers
> >> > > > > >> > >> > > > on
> >> > > > > >> > >> > > > >>> this node are blocklisted. All slots on these
> >> TMs
> >> > are
> >> > > > not
> >> > > > > >> > >> > available.
> >> > > > > >> > >> > > > This
> >> > > > > >> > >> > > > >>> is actually a bit like TM losts, but these TMs
> >> are
> >> > > not
> >> > > > > >> really
> >> > > > > >> > lost,
> >> > > > > >> > >> > > > they
> >> > > > > >> > >> > > > >>> are in an unavailable status, and they are
> still
> >> > > > > registered
> >> > > > > >> > in this
> >> > > > > >> > >> > > > flink
> >> > > > > >> > >> > > > >>> cluster. They will be available again once the
> >> > > > > >> corresponding
> >> > > > > >> > >> > > blocklist
> >> > > > > >> > >> > > > >> item
> >> > > > > >> > >> > > > >>> is removed. This behavior is the same in
> >> > > > > active/non-active
> >> > > > > >> > >> > clusters.
> >> > > > > >> > >> > > > >>> However in the active clusters, these TMs may
> be
> >> > > > released
> >> > > > > >> due
> >> > > > > >> > to
> >> > > > > >> > >> > idle
> >> > > > > >> > >> > > > >>> timeouts.
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it.
> >> The
> >> > > > reasons
> >> > > > > >> are
> >> > > > > >> > as
> >> > > > > >> > >> > > > >> following:
> >> > > > > >> > >> > > > >>> a) The timeout will not affect users adding or
> >> > > removing
> >> > > > > >> items
> >> > > > > >> > via
> >> > > > > >> > >> > > REST
> >> > > > > >> > >> > > > >> API,
> >> > > > > >> > >> > > > >>> and users can disable it by configuring it to
> >> > > > > >> Long.MAX_VALUE .
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> b) Some node problems can recover after a
> >> period of
> >> > > > time
> >> > > > > >> > (such as
> >> > > > > >> > >> > > > machine
> >> > > > > >> > >> > > > >>> hotspots), in which case users may prefer that
> >> > Flink
> >> > > > can
> >> > > > > do
> >> > > > > >> > this
> >> > > > > >> > >> > > > >>> automatically instead of requiring the user to
> >> do
> >> > it
> >> > > > > >> manually.
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> Best,
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> Lijie
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org>
> >> 于2022年4月27日周三
> >> > > > > >> 19:23写道：
> >> > > > > >> > >> > > > >>>
> >> > > > > >> > >> > > > >>>> Hi Lijie,
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to only
> >> support
> >> > > > > manually
> >> > > > > >> > blocking
> >> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different
> >> > > strategies
> >> > > > > can
> >> > > > > >> > also be
> >> > > > > >> > >> > > > >>>> maintained outside of Apache Flink.
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> A few remarks:
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist"
> >> due
> >> > to
> >> > > > the
> >> > > > > >> > >> > controversy
> >> > > > > >> > >> > > > >> around
> >> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket
> >> about
> >> > > this
> >> > > > > >> topic a
> >> > > > > >> > >> > while
> >> > > > > >> > >> > > > >> back
> >> > > > > >> > >> > > > >>>> and there was generally a consensus to avoid
> >> the
> >> > > term
> >> > > > > >> > blacklist &
> >> > > > > >> > >> > > > >> whitelist
> >> > > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> >> > > > > "quarantined"
> >> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly
> >> > different
> >> > > > > >> design
> >> > > > > >> > as
> >> > > > > >> > >> > verbs
> >> > > > > >> > >> > > > >> like
> >> > > > > >> > >> > > > >>>> add/remove often considered an anti-pattern
> for
> >> > REST
> >> > > > > APIs.
> >> > > > > >> > POST
> >> > > > > >> > >> > on a
> >> > > > > >> > >> > > > >> list
> >> > > > > >> > >> > > > >>>> item is generally the standard to add items.
> >> > DELETE
> >> > > on
> >> > > > > the
> >> > > > > >> > >> > > individual
> >> > > > > >> > >> > > > >>>> resource is standard to remove an item.
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> POST <host>/quarantine/items
> >> > > > > >> > >> > > > >>>> DELETE
> <host>/quarantine/items/<itemidentifier>
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> We could also consider to separate
> taskmanagers
> >> > and
> >> > > > > nodes
> >> > > > > >> in
> >> > > > > >> > the
> >> > > > > >> > >> > > REST
> >> > > > > >> > >> > > > >> API
> >> > > > > >> > >> > > > >>>> (and internal data structures). Any opinion
> on
> >> > this?
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> >> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> >> > > > > >> > >> > > > >>>> DELETE
> >> <host>/quarantine/taskmanager/<identifier>
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with
> >> non-active
> >> > > > > >> resource
> >> > > > > >> > >> > > managers,
> >> > > > > >> > >> > > > >> i.e.
> >> > > > > >> > >> > > > >>>> standalone or reactive mode?
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> 4) To keep the implementation even more
> >> minimal,
> >> > do
> >> > > we
> >> > > > > >> need
> >> > > > > >> > the
> >> > > > > >> > >> > > > timeout
> >> > > > > >> > >> > > > >>>> behavior? If items are added/removed manually
> >> we
> >> > > could
> >> > > > > >> > delegate
> >> > > > > >> > >> > this
> >> > > > > >> > >> > > > >> to the
> >> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout
> behavior
> >> > > would
> >> > > > > >> better
> >> > > > > >> > fit
> >> > > > > >> > >> > > into
> >> > > > > >> > >> > > > >>>> specific strategies at a later point.
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> Looking forward to your thoughts.
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> Cheers and thank you,
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> Konstantin
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> [1]
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > >
> >> > > > > >> > >> >
> >> > > > > >> >
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >> > > > > >> > >> > > > >>>> [2]
> >> > > https://issues.apache.org/jira/browse/FLINK-18209
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb
> >> Lijie
> >> > > Wang
> >> > > > <
> >> > > > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> >> > > > > >> > >> > > > >>>>
> >> > > > > >> > >> > > > >>>>> Hi all,
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> Flink job failures may happen due to cluster
> >> node
> >> > > > > issues
> >> > > > > >> > >> > > > >> (insufficient
> >> > > > > >> > >> > > > >>>> disk
> >> > > > > >> > >> > > > >>>>> space, bad hardware, network abnormalities).
> >> > Flink
> >> > > > will
> >> > > > > >> > take care
> >> > > > > >> > >> > > of
> >> > > > > >> > >> > > > >> the
> >> > > > > >> > >> > > > >>>>> failures and redeploy the tasks. However,
> due
> >> to
> >> > > data
> >> > > > > >> > locality
> >> > > > > >> > >> > and
> >> > > > > >> > >> > > > >>>> limited
> >> > > > > >> > >> > > > >>>>> resources, the new tasks are very likely to
> be
> >> > > > > redeployed
> >> > > > > >> > to the
> >> > > > > >> > >> > > same
> >> > > > > >> > >> > > > >>>>> nodes, which will result in continuous task
> >> > > > > abnormalities
> >> > > > > >> > and
> >> > > > > >> > >> > > affect
> >> > > > > >> > >> > > > >> job
> >> > > > > >> > >> > > > >>>>> progress.
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> Currently, Flink users need to manually
> >> identify
> >> > > the
> >> > > > > >> > problematic
> >> > > > > >> > >> > > > >> node and
> >> > > > > >> > >> > > > >>>>> take it offline to solve this problem. But
> >> this
> >> > > > > approach
> >> > > > > >> has
> >> > > > > >> > >> > > > >> following
> >> > > > > >> > >> > > > >>>>> disadvantages:
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy
> >> process.
> >> > > > Users
> >> > > > > >> may
> >> > > > > >> > need
> >> > > > > >> > >> > to
> >> > > > > >> > >> > > > >>>> contact
> >> > > > > >> > >> > > > >>>>> cluster administors to do this. The
> operation
> >> can
> >> > > > even
> >> > > > > be
> >> > > > > >> > >> > dangerous
> >> > > > > >> > >> > > > >> and
> >> > > > > >> > >> > > > >>>> not
> >> > > > > >> > >> > > > >>>>> allowed during some important business
> events.
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of
> >> problems
> >> > > > > manually
> >> > > > > >> > would
> >> > > > > >> > >> > be
> >> > > > > >> > >> > > > >> slow
> >> > > > > >> > >> > > > >>>> and
> >> > > > > >> > >> > > > >>>>> a waste of human resources.
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose
> >> to
> >> > > > > >> introduce a
> >> > > > > >> > >> > > blacklist
> >> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out
> problematic
> >> > > > > resources.
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> You can find more details in FLIP-224[1].
> >> Looking
> >> > > > > forward
> >> > > > > >> > to your
> >> > > > > >> > >> > > > >>>> feedback.
> >> > > > > >> > >> > > > >>>>> [1]
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > >
> >> > > > > >> > >> >
> >> > > > > >> >
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> Best,
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > > >>>>> Lijie
> >> > > > > >> > >> > > > >>>>>
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > > >
> >> > > > > >> > >> > >
> >> > > > > >> > >> >
> >> > > > > >> >
> >> > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > > Roman Boyko
> >> > > e.: ro.v.boyko@gmail.com
> >> > >
> >> >
> >>
> >
>


-- 
https://twitter.com/snntrable
https://github.com/knaufk

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi everyone,

I've had an offline discussion with Becket Qin and Zhu Zhu, and made the
following changes on REST API:
1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose one. If
both are specified, will return error.
2.  If the specified item is already there, the *ADD* operation has two
behaviors:  *return error*(default value) or *merge/update*, and we add a
flag to the request body to control it. You can find more details "Public
Interface" section.

If there is no more feedback, we will start the vote thread next week.

Best,
Lijie

Lijie Wang <wa...@gmail.com> 于2022年5月10日周二 17:14写道：

> Hi Becket Qin,
>
> Thanks for your suggestions.  I have moved the description of
> configurations, metrics and REST API into "Public Interface" section, and
> made a few updates according to your suggestion.  And in this FLIP, there
> no public java Interfaces or pluggables that users need to implement by
> themselves.
>
> Answers for you questions:
> 1. Yes, there 2 block actions: MARK_BLOCKED and.
> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block items can
> only be added through the REST API, so these 2 action are mentioned in the
> REST API part (The REST API part has beed moved to public interface now).
> 2. I agree with you. I have changed the "Cause" field to String, and allow
> users to specify it via REST API.
> 3. Yes, it is useful to allow different timeouts. As mentioned above, we
> will introduce 2 fields : *timeout* and *endTimestamp* into the ADD REST
> API to specify when to remove the blocked item. These 2 fields are
> optional, if neither is specified, it means that the blocked item is
> permanent and will not be removed. If both are specified, the minimum of
> *currentTimestamp+tiemout *and* endTimestamp* will be used as the time to
> remove the blocked item. To keep the configurations more minimal, we have
> removed the *cluster.resource-blocklist.item.timeout* configuration
> option.
> 4. Yes, the block item will be overridden if the specified item already
> exists. The ADD operation is *ADD or UPDATE*.
> 5. Yes. On JM/RM side, all the blocklist information is maintained in
> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or abstracted
> to other interfaces) will be propagated to different components.
>
> Best,
> Lijie
>
> Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：
>
>> Hi Lijie,
>>
>> Thanks for updating the FLIP. It looks like the public interface section
>> did not fully reflect all the user sensible behavior and API. Can you put
>> everything that users may be aware of there? That would include the REST
>> API, metrics, configurations, public java Interfaces or pluggables that
>> users may see or implement by themselves, as well as a brief summary of
>> the
>> behavior of the public API.
>>
>> Besides that, I have a few questions:
>>
>> 1. According to the conversation in the discussion thread, it looks like
>> the BlockAction will have "MARK_BLOCKLISTED" and
>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can you
>> add
>> that to the public interface as well?
>>
>> 2. At this point, the "Cause" field in the BlockingItem is a Throwable and
>> is not reflected in the REST API. Should that be included in the query
>> response? And should we change that field to be a String so users may
>> specify the cause via the REST API when they block some nodes / TMs?
>>
>> 3. Would it be useful to allow users to have different timeouts for
>> different blocked items? So while there is a default timeout, users can
>> also override it via the REST API when they block an entity.
>>
>> 4. Regarding the ADD operation, if the specified item is already there,
>> will the block item be overridden? For example, if the user wants to
>> extend
>> the timeout of a blocked item, can they just  issue an ADD command again?
>>
>> 5. I am not quite familiar with the details of this, but is there a source
>> of truth for the blocked list? I think it might be good to have a single
>> source of truth for the blocked list and just propagate that list to
>> different components to take the action of actually blocking the resource.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <wa...@gmail.com>
>> wrote:
>>
>> > Hi everyone,
>> >
>> > Based on the discussion in the mailing list, I updated the FLIP doc, the
>> > changes include:
>> > 1. Changed the description of the motivation section to more clearly
>> > describe the problem this FLIP is trying to solve.
>> > 2. Only  *Manually* is supported.
>> > 3. Adopted some suggestions, such as *endTimestamp*.
>> >
>> > Best,
>> > Lijie
>> >
>> >
>> > Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>> >
>> > > Hi Lijie!
>> > >
>> > >
>> > >
>> > >
>> > > *a) “Probably storing inside Zookeeper/Configmap might be
>> helpfulhere.”
>> > > Can you explain it in detail? I don't fully understand that. In
>> > myopinion,
>> > > non-active and active are the same, and no special treatment
>> isrequired.*
>> > >
>> > > Sorry this was a misunderstanding from my side. I thought we were
>> talking
>> > > about the HA mode (but not about Active and Standalone
>> ResourceManager).
>> > > And the original question was - how to handle the blacklisted nodes
>> list
>> > at
>> > > the moment of leader change? Should we simply forget about them or
>> try to
>> > > pre-save that list on the remote storage?
>> > >
>> > > On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com> wrote:
>> > >
>> > > > Thanks Lijie and ZhuZhu for the explanation.
>> > > >
>> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is
>> indeed
>> > > > some functionalities the external tools(e.g. kubectl taint) could
>> not
>> > > > support.
>> > > >
>> > > >
>> > > > Best,
>> > > > Yang
>> > > >
>> > > > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
>> > > >
>> > > > > Thanks for your feedback, Jiangang and Martijn.
>> > > > >
>> > > > > @Jiangang
>> > > > >
>> > > > >
>> > > > > > For auto-detecting, I wonder how to make the strategy and mark a
>> > node
>> > > > > blocked?
>> > > > >
>> > > > > In fact, we currently plan to not support auto-detection in this
>> > FLIP.
>> > > > The
>> > > > > part about auto-detection may be continued in a separate FLIP in
>> the
>> > > > > future. Some guys have the same concerns as you, and the
>> correctness
>> > > and
>> > > > > necessity of auto-detection may require further discussion in the
>> > > future.
>> > > > >
>> > > > > > In session mode, multi jobs can fail on the same bad node and
>> the
>> > > node
>> > > > > should be marked blocked.
>> > > > > By design, the blocklist information will be shared among all jobs
>> > in a
>> > > > > cluster/session. The JM will sync blocklist information with RM.
>> > > > >
>> > > > > @Martijn
>> > > > >
>> > > > > > I agree with Yang Wang on this.
>> > > > > As Zhu Zhu and I mentioned above, we think the
>> MARK_BLOCKLISTED(Just
>> > > > limits
>> > > > > the load of the node and does not  kill all the processes on it)
>> is
>> > > also
>> > > > > important, and we think that external systems (*yarn rmadmin or
>> > kubectl
>> > > > > taint*) cannot support it. So we think it makes sense even only
>> > > > *manually*.
>> > > > >
>> > > > > > I also agree with Chesnay that magical mechanisms are indeed
>> super
>> > > hard
>> > > > > to get right.
>> > > > > Yes, as you see, Jiangang(and a few others) have the same concern.
>> > > > > However, we currently plan to not support auto-detection in this
>> > FLIP,
>> > > > and
>> > > > > only *manually*. In addition, I'd like to say that the FLIP
>> provides
>> > a
>> > > > > mechanism to support MARK_BLOCKLISTED and
>> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>> > > > > the auto-detection may be done by external systems.
>> > > > >
>> > > > > Best,
>> > > > > Lijie
>> > > > >
>> > > > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
>> > > > >
>> > > > > > > If we only support to block nodes manually, then I could not
>> see
>> > > > > > the obvious advantages compared with current SRE's approach(via
>> > *yarn
>> > > > > > rmadmin or kubectl taint*).
>> > > > > >
>> > > > > > I agree with Yang Wang on this.
>> > > > > >
>> > > > > > >  To me this sounds yet again like one of those magical
>> mechanisms
>> > > > that
>> > > > > > will rarely work just right.
>> > > > > >
>> > > > > > I also agree with Chesnay that magical mechanisms are indeed
>> super
>> > > hard
>> > > > > to
>> > > > > > get right.
>> > > > > >
>> > > > > > Best regards,
>> > > > > >
>> > > > > > Martijn
>> > > > > >
>> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>> > liujiangangpeng@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > >> Thanks for the valuable design. The auto-detecting can decrease
>> > > great
>> > > > > work
>> > > > > >> for us. We have implemented the similar feature in our inner
>> flink
>> > > > > >> version.
>> > > > > >> Below is something that I care about:
>> > > > > >>
>> > > > > >>    1. For auto-detecting, I wonder how to make the strategy and
>> > > mark a
>> > > > > >> node
>> > > > > >>    blocked? Sometimes the blocked node is hard to be detected,
>> for
>> > > > > >> example,
>> > > > > >>    the upper node or the down node will be blocked when network
>> > > > > >> unreachable.
>> > > > > >>    2. I see that the strategy is made in JobMaster side. How
>> about
>> > > > > >>    implementing the similar logic in resource manager? In
>> session
>> > > > mode,
>> > > > > >> multi
>> > > > > >>    jobs can fail on the same bad node and the node should be
>> > marked
>> > > > > >> blocked.
>> > > > > >>    If the job makes the strategy, the node may be not marked
>> > blocked
>> > > > if
>> > > > > >> the
>> > > > > >>    fail times don't exceed the threshold.
>> > > > > >>
>> > > > > >>
>> > > > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>> > > > > >>
>> > > > > >> > Thank you for all your feedback!
>> > > > > >> >
>> > > > > >> > Besides the answers from Lijie, I'd like to share some of my
>> > > > thoughts:
>> > > > > >> > 1. Whether to enable automatical blocklist
>> > > > > >> > Generally speaking, it is not a goal of FLIP-224.
>> > > > > >> > The automatical way should be something built upon the
>> blocklist
>> > > > > >> > mechanism and well decoupled. It was designed to be a
>> > configurable
>> > > > > >> > blocklist strategy, but I think we can further decouple it by
>> > > > > >> > introducing a abnormal node detector, as Becket suggested,
>> which
>> > > > just
>> > > > > >> > uses the blocklist mechanism once bad nodes are detected.
>> > However,
>> > > > it
>> > > > > >> > should be a separate FLIP with further dev discussions and
>> > > feedback
>> > > > > >> > from users. I also agree with Becket that different users
>> have
>> > > > > different
>> > > > > >> > requirements, and we should listen to them.
>> > > > > >> >
>> > > > > >> > 2. Is it enough to just take away abnormal nodes externally
>> > > > > >> > My answer is no. As Lijie has mentioned, we need a way to
>> avoid
>> > > > > >> > deploying tasks to temporary hot nodes. In this case, users
>> may
>> > > just
>> > > > > >> > want to limit the load of the node and do not want to kill
>> all
>> > the
>> > > > > >> > processes on it. Another case is the speculative execution[1]
>> > > which
>> > > > > >> > may also leverage this feature to avoid starting mirror
>> tasks on
>> > > > slow
>> > > > > >> > nodes.
>> > > > > >> >
>> > > > > >> > Thanks,
>> > > > > >> > Zhu
>> > > > > >> >
>> > > > > >> > [1]
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> > > > > >> >
>> > > > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
>> > > > > >> >
>> > > > > >> > >
>> > > > > >> > > Hi everyone,
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > Thanks for your feedback.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > There's one detail that I'd like to re-emphasize here
>> because
>> > it
>> > > > can
>> > > > > >> > affect the value and design of the blocklist mechanism
>> (perhaps
>> > I
>> > > > > should
>> > > > > >> > highlight it in the FLIP). We propose two actions in FLIP:
>> > > > > >> > >
>> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as
>> > > > blocked.
>> > > > > >> > Future slots should not be allocated from the blocked task
>> > manager
>> > > > or
>> > > > > >> node.
>> > > > > >> > But slots that are already allocated will not be affected. A
>> > > typical
>> > > > > >> > application scenario is to mitigate machine hotspots. In this
>> > > case,
>> > > > we
>> > > > > >> hope
>> > > > > >> > that subsequent resource allocations will not be on the hot
>> > > machine,
>> > > > > but
>> > > > > >> > tasks currently running on it should not be affected.
>> > > > > >> > >
>> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task
>> manager
>> > or
>> > > > > node
>> > > > > >> as
>> > > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks will
>> be
>> > > > > >> restarted on
>> > > > > >> > non-blocked task managers.
>> > > > > >> > >
>> > > > > >> > > For the above 2 actions, the former may more highlight the
>> > > meaning
>> > > > > of
>> > > > > >> > this FLIP, because the external system cannot do that.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > Regarding *Manually* and *Automatically*, I basically agree
>> > with
>> > > > > >> @Becket
>> > > > > >> > Qin: different users have different answers. Not all users’
>> > > > deployment
>> > > > > >> > environments have a special external system that can perform
>> the
>> > > > > anomaly
>> > > > > >> > detection. In addition, adding pluggable/optional
>> auto-detection
>> > > > > doesn't
>> > > > > >> > require much extra work on top of manual specification.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > I will answer your other questions one by one.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > @Yangze
>> > > > > >> > >
>> > > > > >> > > a) I think you are right, we do not need to expose the
>> > > > > >> > `cluster.resource-blocklist.item.timeout-check-interval` to
>> > users.
>> > > > > >> > >
>> > > > > >> > > b) We can abstract the `notifyException` to a separate
>> > interface
>> > > > > >> (maybe
>> > > > > >> > BlocklistExceptionListener), and the
>> > > ResourceManagerBlocklistHandler
>> > > > > can
>> > > > > >> > implement it in the future.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > @Martijn
>> > > > > >> > >
>> > > > > >> > > a) I also think the manual blocking should be done by
>> cluster
>> > > > > >> operators.
>> > > > > >> > >
>> > > > > >> > > b) I think manual blocking makes sense, because according
>> to
>> > my
>> > > > > >> > experience, users are often the first to perceive the machine
>> > > > problems
>> > > > > >> > (because of job failover or delay), and they will contact
>> > cluster
>> > > > > >> operators
>> > > > > >> > to solve it, or even tell the cluster operators which
>> machine is
>> > > > > >> > problematic. From this point of view, I think the people who
>> > > really
>> > > > > need
>> > > > > >> > the manual blocking are the users, and it’s just performed by
>> > the
>> > > > > >> cluster
>> > > > > >> > operator, so I think the manual blocking makes sense.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > @Chesnay
>> > > > > >> > >
>> > > > > >> > > We need to touch the logic of JM/SlotPool, because for
>> > > > > >> MARK_BLOCKLISTED
>> > > > > >> > , we need to know whether the slot is blocklisted when the
>> task
>> > is
>> > > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release
>> the
>> > > slot
>> > > > > >> > directly to avoid assigning other tasks (of this job) on it.
>> If
>> > we
>> > > > > only
>> > > > > >> > maintain the blocklist information on the RM, JM needs to
>> > retrieve
>> > > > it
>> > > > > by
>> > > > > >> > RPC. I think the performance overhead of that is relatively
>> > large,
>> > > > so
>> > > > > I
>> > > > > >> > think it's worth maintaining the blocklist information on
>> the JM
>> > > > side
>> > > > > >> and
>> > > > > >> > syncing them.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > @Роман
>> > > > > >> > >
>> > > > > >> > >     a) “Probably storing inside Zookeeper/Configmap might
>> be
>> > > > helpful
>> > > > > >> > here.”  Can you explain it in detail? I don't fully
>> understand
>> > > that.
>> > > > > In
>> > > > > >> my
>> > > > > >> > opinion, non-active and active are the same, and no special
>> > > > treatment
>> > > > > is
>> > > > > >> > required.
>> > > > > >> > >
>> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense, I will
>> > add
>> > > it
>> > > > > to
>> > > > > >> > FLIP.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > @Yang
>> > > > > >> > >
>> > > > > >> > > As mentioned above, AFAK, the external system cannot
>> support
>> > the
>> > > > > >> > MARK_BLOCKLISTED action.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > Looking forward to your further feedback.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > Best,
>> > > > > >> > >
>> > > > > >> > > Lijie
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
>> > > > > >> > >>
>> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
>> > > > > >> > >>
>> > > > > >> > >> I want to share some thoughts about Flink cluster
>> operations.
>> > > > > >> > >>
>> > > > > >> > >> In the production environment, the SRE(aka Site
>> Reliability
>> > > > > Engineer)
>> > > > > >> > >> already has many tools to detect the unstable nodes, which
>> > > could
>> > > > > take
>> > > > > >> > the
>> > > > > >> > >> system logs/metrics into consideration.
>> > > > > >> > >> Then they use graceful-decomission in YARN and taint in
>> K8s
>> > to
>> > > > > >> prevent
>> > > > > >> > new
>> > > > > >> > >> allocations on these unstable nodes.
>> > > > > >> > >> At last, they will evict all the containers and pods
>> running
>> > on
>> > > > > these
>> > > > > >> > nodes.
>> > > > > >> > >> This mechanism also works for planned maintenance. So I am
>> > > afraid
>> > > > > >> this
>> > > > > >> > is
>> > > > > >> > >> not the typical use case for FLIP-224.
>> > > > > >> > >>
>> > > > > >> > >> If we only support to block nodes manually, then I could
>> not
>> > > see
>> > > > > >> > >> the obvious advantages compared with current SRE's
>> > approach(via
>> > > > > *yarn
>> > > > > >> > >> rmadmin or kubectl taint*).
>> > > > > >> > >> At least, we need to have a pluggable component which
>> could
>> > > > expose
>> > > > > >> the
>> > > > > >> > >> potential unstable nodes automatically and block them if
>> > > enabled
>> > > > > >> > explicitly.
>> > > > > >> > >>
>> > > > > >> > >>
>> > > > > >> > >> Best,
>> > > > > >> > >> Yang
>> > > > > >> > >>
>> > > > > >> > >>
>> > > > > >> > >>
>> > > > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
>> > > > > >> > >>
>> > > > > >> > >> > Thanks for the proposal, Lijie.
>> > > > > >> > >> >
>> > > > > >> > >> > This is an interesting feature and discussion, and
>> somewhat
>> > > > > related
>> > > > > >> > to the
>> > > > > >> > >> > design principle about how people should operate Flink.
>> > > > > >> > >> >
>> > > > > >> > >> > I think there are three things involved in this FLIP.
>> > > > > >> > >> >      a) Detect and report the unstable node.
>> > > > > >> > >> >      b) Collect the information of the unstable node and
>> > > form a
>> > > > > >> > blocklist.
>> > > > > >> > >> >      c) Take the action to block nodes.
>> > > > > >> > >> >
>> > > > > >> > >> > My two cents:
>> > > > > >> > >> >
>> > > > > >> > >> > 1. It looks like people all agree that Flink should have
>> > c).
>> > > It
>> > > > > is
>> > > > > >> > not only
>> > > > > >> > >> > useful for cases of node failures, but also handy for
>> some
>> > > > > planned
>> > > > > >> > >> > maintenance.
>> > > > > >> > >> >
>> > > > > >> > >> > 2. People have different opinions on b), i.e. who
>> should be
>> > > the
>> > > > > >> brain
>> > > > > >> > to
>> > > > > >> > >> > make the decision to block a node. I think this largely
>> > > depends
>> > > > > on
>> > > > > >> > who we
>> > > > > >> > >> > talk to. Different users would probably give different
>> > > answers.
>> > > > > For
>> > > > > >> > people
>> > > > > >> > >> > who do have a centralized node health management
>> service,
>> > let
>> > > > > Flink
>> > > > > >> > do just
>> > > > > >> > >> > do a) and c) would be preferred. So essentially Flink
>> would
>> > > be
>> > > > > one
>> > > > > >> of
>> > > > > >> > the
>> > > > > >> > >> > sources that may detect unstable nodes, report it to
>> that
>> > > > > service,
>> > > > > >> > and then
>> > > > > >> > >> > take the command from that service to block the
>> problematic
>> > > > > nodes.
>> > > > > >> On
>> > > > > >> > the
>> > > > > >> > >> > other hand, for users who do not have such a service,
>> > simply
>> > > > > >> letting
>> > > > > >> > Flink
>> > > > > >> > >> > be clever by itself to block the suspicious nodes might
>> be
>> > > > > desired
>> > > > > >> to
>> > > > > >> > >> > ensure the jobs are running smoothly.
>> > > > > >> > >> >
>> > > > > >> > >> > So that indicates a) and b) here should be pluggable /
>> > > > optional.
>> > > > > >> > >> >
>> > > > > >> > >> > In light of this, maybe it would make sense to have
>> > something
>> > > > > >> > pluggable
>> > > > > >> > >> > like a UnstableNodeReporter which exposes unstable nodes
>> > > > > actively.
>> > > > > >> (A
>> > > > > >> > more
>> > > > > >> > >> > general interface should be JobInfoReporter<T> which
>> can be
>> > > > used
>> > > > > to
>> > > > > >> > report
>> > > > > >> > >> > any information of type <T>. But I'll just keep the
>> scope
>> > > > > relevant
>> > > > > >> to
>> > > > > >> > this
>> > > > > >> > >> > FLIP here). Personally speaking, I think it is OK to
>> have a
>> > > > > default
>> > > > > >> > >> > implementation of a reporter which just tells Flink to
>> take
>> > > > > action
>> > > > > >> to
>> > > > > >> > block
>> > > > > >> > >> > problematic nodes and also unblocks them after timeout.
>> > > > > >> > >> >
>> > > > > >> > >> > Thanks,
>> > > > > >> > >> >
>> > > > > >> > >> > Jiangjie (Becket) Qin
>> > > > > >> > >> >
>> > > > > >> > >> >
>> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>> > > > ro.v.boyko@gmail.com
>> > > > > >
>> > > > > >> > wrote:
>> > > > > >> > >> >
>> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
>> > > > > >> > >> > >
>> > > > > >> > >> > > If it's possible I'd like to participate in
>> development.
>> > > > > >> > >> > >
>> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply - we
>> should
>> > > > > consider
>> > > > > >> > to move
>> > > > > >> > >> > > somehow the information of blocklisted nodes/TMs from
>> > > active
>> > > > > >> > >> > > ResourceManager to non-active ones. Probably storing
>> > inside
>> > > > > >> > >> > > Zookeeper/Configmap might be helpful here.
>> > > > > >> > >> > >
>> > > > > >> > >> > > And I agree with Martijn that a lot of organizations
>> > don't
>> > > > want
>> > > > > >> to
>> > > > > >> > expose
>> > > > > >> > >> > > such API for a cluster user group. But I think it's
>> > > necessary
>> > > > > to
>> > > > > >> > have the
>> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for
>> > avoiding
>> > > > > >> incorrect
>> > > > > >> > >> > > automatic behaviour.
>> > > > > >> > >> > >
>> > > > > >> > >> > > And another one small suggestion - I think it would be
>> > > better
>> > > > > to
>> > > > > >> > extend
>> > > > > >> > >> > the
>> > > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field
>> and
>> > > > fill
>> > > > > it
>> > > > > >> > at the
>> > > > > >> > >> > > item creation. This simple addition will allow to:
>> > > > > >> > >> > >
>> > > > > >> > >> > >    -
>> > > > > >> > >> > >
>> > > > > >> > >> > >    Provide the ability to users to setup the exact
>> time
>> > of
>> > > > > >> > blocklist end
>> > > > > >> > >> > >    through RestAPI
>> > > > > >> > >> > >    -
>> > > > > >> > >> > >
>> > > > > >> > >> > >    Not being tied to a single value of
>> > > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
>> > > > > >> > >> > >
>> > > > > >> > >> > >
>> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
>> > > > > >> chesnay@apache.org>
>> > > > > >> > >> > wrote:
>> > > > > >> > >> > >
>> > > > > >> > >> > > > I do share the concern between blurring the lines a
>> > bit.
>> > > > > >> > >> > > >
>> > > > > >> > >> > > > That said, I'd prefer to not have any auto-detection
>> > and
>> > > > only
>> > > > > >> > have an
>> > > > > >> > >> > > > opt-in mechanism
>> > > > > >> > >> > > > to manually block processes/nodes. To me this sounds
>> > yet
>> > > > > again
>> > > > > >> > like one
>> > > > > >> > >> > > > of those
>> > > > > >> > >> > > > magical mechanisms that will rarely work just right.
>> > > > > >> > >> > > > An external system can leverage way more information
>> > > after
>> > > > > all.
>> > > > > >> > >> > > >
>> > > > > >> > >> > > > Moreover, I'm quite concerned about the complexity
>> of
>> > > this
>> > > > > >> > proposal.
>> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing between
>> > > > components;
>> > > > > >> > >> > adjustments
>> > > > > >> > >> > > > to the
>> > > > > >> > >> > > > slot and resource protocol.
>> > > > > >> > >> > > >
>> > > > > >> > >> > > > In a way it seems overly complicated.
>> > > > > >> > >> > > >
>> > > > > >> > >> > > > If we look at it purely from an active resource
>> > > management
>> > > > > >> > perspective,
>> > > > > >> > >> > > > then there
>> > > > > >> > >> > > > isn't really a need to touch the slot protocol at
>> all
>> > (or
>> > > > in
>> > > > > >> fact
>> > > > > >> > to
>> > > > > >> > >> > > > anything in the JobMaster),
>> > > > > >> > >> > > > because there isn't any point in keeping around
>> blocked
>> > > TMs
>> > > > > in
>> > > > > >> the
>> > > > > >> > >> > first
>> > > > > >> > >> > > > place.
>> > > > > >> > >> > > > They'd just be idling, potentially shutting down
>> after
>> > a
>> > > > > while
>> > > > > >> by
>> > > > > >> > the
>> > > > > >> > >> > RM
>> > > > > >> > >> > > > because of
>> > > > > >> > >> > > > it (unless we _also_ touch that logic).
>> > > > > >> > >> > > > Here the blocking of a process (be it by blocking
>> the
>> > > > process
>> > > > > >> or
>> > > > > >> > node)
>> > > > > >> > >> > is
>> > > > > >> > >> > > > equivalent with shutting down the blocked
>> process(es).
>> > > > > >> > >> > > > Once the block is lifted we can just spin it back
>> up.
>> > > > > >> > >> > > >
>> > > > > >> > >> > > > And I do wonder whether we couldn't apply the same
>> line
>> > > of
>> > > > > >> > thinking to
>> > > > > >> > >> > > > standalone resource management.
>> > > > > >> > >> > > > Here being able to stop/restart a process/node
>> manually
>> > > > > should
>> > > > > >> be
>> > > > > >> > a
>> > > > > >> > >> > core
>> > > > > >> > >> > > > requirement for a Flink deployment anyway.
>> > > > > >> > >> > > >
>> > > > > >> > >> > > >
>> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
>> > > > > >> > >> > > > > Hi everyone,
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > > Thanks for creating this FLIP. I can understand
>> the
>> > > > problem
>> > > > > >> and
>> > > > > >> > I see
>> > > > > >> > >> > > > value
>> > > > > >> > >> > > > > in the automatic detection and blocklisting. I do
>> > have
>> > > > some
>> > > > > >> > concerns
>> > > > > >> > >> > > with
>> > > > > >> > >> > > > > the ability to manually specify to be blocked
>> > > resources.
>> > > > I
>> > > > > >> have
>> > > > > >> > two
>> > > > > >> > >> > > > > concerns;
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > > * Most organizations explicitly have a separation
>> of
>> > > > > >> concerns,
>> > > > > >> > >> > meaning
>> > > > > >> > >> > > > that
>> > > > > >> > >> > > > > there's a group who's responsible for managing a
>> > > cluster
>> > > > > and
>> > > > > >> > there's
>> > > > > >> > >> > a
>> > > > > >> > >> > > > user
>> > > > > >> > >> > > > > group who uses that cluster. With the
>> introduction of
>> > > > this
>> > > > > >> > mechanism,
>> > > > > >> > >> > > the
>> > > > > >> > >> > > > > latter group now can influence the responsibility
>> of
>> > > the
>> > > > > >> first
>> > > > > >> > group.
>> > > > > >> > >> > > So
>> > > > > >> > >> > > > it
>> > > > > >> > >> > > > > can be possible that someone from the user group
>> > blocks
>> > > > > >> > something,
>> > > > > >> > >> > > which
>> > > > > >> > >> > > > > causes an outage (which could result in paging
>> > > mechanism
>> > > > > >> > triggering
>> > > > > >> > >> > > etc)
>> > > > > >> > >> > > > > which impacts the first group.
>> > > > > >> > >> > > > > * How big is the group of people who can go
>> through
>> > the
>> > > > > >> process
>> > > > > >> > of
>> > > > > >> > >> > > > manually
>> > > > > >> > >> > > > > identifying a node that isn't behaving as it
>> should
>> > > be? I
>> > > > > do
>> > > > > >> > think
>> > > > > >> > >> > this
>> > > > > >> > >> > > > > group is relatively limited. Does it then make
>> sense
>> > to
>> > > > > >> > introduce
>> > > > > >> > >> > such
>> > > > > >> > >> > > a
>> > > > > >> > >> > > > > feature, which would only be used by a really
>> small
>> > > user
>> > > > > >> group
>> > > > > >> > of
>> > > > > >> > >> > > Flink?
>> > > > > >> > >> > > > We
>> > > > > >> > >> > > > > still have to maintain, test and support such a
>> > > feature.
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > > I'm +1 for the autodetection features, but I'm
>> > leaning
>> > > > > >> towards
>> > > > > >> > not
>> > > > > >> > >> > > > exposing
>> > > > > >> > >> > > > > this to the user group but having this available
>> > > strictly
>> > > > > for
>> > > > > >> > cluster
>> > > > > >> > >> > > > > operators. They could then also set up their
>> > > > > >> > paging/metrics/logging
>> > > > > >> > >> > > > system
>> > > > > >> > >> > > > > to take this into account.
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > > Best regards,
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > > Martijn Visser
>> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82
>> > > > > >> > >> > > > > https://github.com/MartijnVisser
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>> > > > > karmagyz@gmail.com
>> > > > > >> >
>> > > > > >> > wrote:
>> > > > > >> > >> > > > >
>> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > > >> +1 for the overall proposal. Just share some
>> cents
>> > > here:
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > > >> - Why do we need to expose
>> > > > > >> > >> > > > >>
>> > cluster.resource-blacklist.item.timeout-check-interval
>> > > > to
>> > > > > >> the
>> > > > > >> > user?
>> > > > > >> > >> > > > >> I think the semantics of
>> > > > > >> > `cluster.resource-blacklist.item.timeout`
>> > > > > >> > >> > is
>> > > > > >> > >> > > > >> sufficient for the user. How to guarantee the
>> > timeout
>> > > > > >> > mechanism is
>> > > > > >> > >> > > > >> Flink's internal implementation. I think it will
>> be
>> > > very
>> > > > > >> > confusing
>> > > > > >> > >> > and
>> > > > > >> > >> > > > >> we do not need to expose it to users.
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > > >> - ResourceManager can notify the exception of a
>> task
>> > > > > >> manager to
>> > > > > >> > >> > > > >> `BlacklistHandler` as well.
>> > > > > >> > >> > > > >> For example, the slot allocation might fail in
>> case
>> > > the
>> > > > > >> target
>> > > > > >> > task
>> > > > > >> > >> > > > >> manager is busy or has a network jitter. I don't
>> > mean
>> > > we
>> > > > > >> need
>> > > > > >> > to
>> > > > > >> > >> > cover
>> > > > > >> > >> > > > >> this case in this version, but we can also open a
>> > > > > >> > `notifyException`
>> > > > > >> > >> > in
>> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > > >> - Before we sync the blocklist to
>> ResourceManager,
>> > > will
>> > > > > the
>> > > > > >> > slot of
>> > > > > >> > >> > a
>> > > > > >> > >> > > > >> blocked task manager continues to be released and
>> > > > > allocated?
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > > >> Best,
>> > > > > >> > >> > > > >> Yangze Guo
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
>> > > > > >> > >> > wangdachui9501@gmail.com>
>> > > > > >> > >> > > > >> wrote:
>> > > > > >> > >> > > > >>> Hi Konstantin,
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> Thanks for your feedback. I will response your 4
>> > > > remarks:
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I
>> > > think
>> > > > > >> > “BlockList”
>> > > > > >> > >> > is
>> > > > > >> > >> > > > >> good
>> > > > > >> > >> > > > >>> enough, and I will change it in FLIP.
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a good
>> idea.
>> > > > Based
>> > > > > >> on
>> > > > > >> > the
>> > > > > >> > >> > > > above, I
>> > > > > >> > >> > > > >>> would change REST API as following:
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means
>> that
>> > > all
>> > > > > >> task
>> > > > > >> > >> > managers
>> > > > > >> > >> > > > on
>> > > > > >> > >> > > > >>> this node are blocklisted. All slots on these
>> TMs
>> > are
>> > > > not
>> > > > > >> > >> > available.
>> > > > > >> > >> > > > This
>> > > > > >> > >> > > > >>> is actually a bit like TM losts, but these TMs
>> are
>> > > not
>> > > > > >> really
>> > > > > >> > lost,
>> > > > > >> > >> > > > they
>> > > > > >> > >> > > > >>> are in an unavailable status, and they are still
>> > > > > registered
>> > > > > >> > in this
>> > > > > >> > >> > > > flink
>> > > > > >> > >> > > > >>> cluster. They will be available again once the
>> > > > > >> corresponding
>> > > > > >> > >> > > blocklist
>> > > > > >> > >> > > > >> item
>> > > > > >> > >> > > > >>> is removed. This behavior is the same in
>> > > > > active/non-active
>> > > > > >> > >> > clusters.
>> > > > > >> > >> > > > >>> However in the active clusters, these TMs may be
>> > > > released
>> > > > > >> due
>> > > > > >> > to
>> > > > > >> > >> > idle
>> > > > > >> > >> > > > >>> timeouts.
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it.
>> The
>> > > > reasons
>> > > > > >> are
>> > > > > >> > as
>> > > > > >> > >> > > > >> following:
>> > > > > >> > >> > > > >>> a) The timeout will not affect users adding or
>> > > removing
>> > > > > >> items
>> > > > > >> > via
>> > > > > >> > >> > > REST
>> > > > > >> > >> > > > >> API,
>> > > > > >> > >> > > > >>> and users can disable it by configuring it to
>> > > > > >> Long.MAX_VALUE .
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> b) Some node problems can recover after a
>> period of
>> > > > time
>> > > > > >> > (such as
>> > > > > >> > >> > > > machine
>> > > > > >> > >> > > > >>> hotspots), in which case users may prefer that
>> > Flink
>> > > > can
>> > > > > do
>> > > > > >> > this
>> > > > > >> > >> > > > >>> automatically instead of requiring the user to
>> do
>> > it
>> > > > > >> manually.
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> Best,
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> Lijie
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org>
>> 于2022年4月27日周三
>> > > > > >> 19:23写道：
>> > > > > >> > >> > > > >>>
>> > > > > >> > >> > > > >>>> Hi Lijie,
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to only
>> support
>> > > > > manually
>> > > > > >> > blocking
>> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different
>> > > strategies
>> > > > > can
>> > > > > >> > also be
>> > > > > >> > >> > > > >>>> maintained outside of Apache Flink.
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> A few remarks:
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist"
>> due
>> > to
>> > > > the
>> > > > > >> > >> > controversy
>> > > > > >> > >> > > > >> around
>> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket
>> about
>> > > this
>> > > > > >> topic a
>> > > > > >> > >> > while
>> > > > > >> > >> > > > >> back
>> > > > > >> > >> > > > >>>> and there was generally a consensus to avoid
>> the
>> > > term
>> > > > > >> > blacklist &
>> > > > > >> > >> > > > >> whitelist
>> > > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
>> > > > > "quarantined"
>> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly
>> > different
>> > > > > >> design
>> > > > > >> > as
>> > > > > >> > >> > verbs
>> > > > > >> > >> > > > >> like
>> > > > > >> > >> > > > >>>> add/remove often considered an anti-pattern for
>> > REST
>> > > > > APIs.
>> > > > > >> > POST
>> > > > > >> > >> > on a
>> > > > > >> > >> > > > >> list
>> > > > > >> > >> > > > >>>> item is generally the standard to add items.
>> > DELETE
>> > > on
>> > > > > the
>> > > > > >> > >> > > individual
>> > > > > >> > >> > > > >>>> resource is standard to remove an item.
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> POST <host>/quarantine/items
>> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> We could also consider to separate taskmanagers
>> > and
>> > > > > nodes
>> > > > > >> in
>> > > > > >> > the
>> > > > > >> > >> > > REST
>> > > > > >> > >> > > > >> API
>> > > > > >> > >> > > > >>>> (and internal data structures). Any opinion on
>> > this?
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
>> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
>> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
>> > > > > >> > >> > > > >>>> DELETE
>> <host>/quarantine/taskmanager/<identifier>
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with
>> non-active
>> > > > > >> resource
>> > > > > >> > >> > > managers,
>> > > > > >> > >> > > > >> i.e.
>> > > > > >> > >> > > > >>>> standalone or reactive mode?
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> 4) To keep the implementation even more
>> minimal,
>> > do
>> > > we
>> > > > > >> need
>> > > > > >> > the
>> > > > > >> > >> > > > timeout
>> > > > > >> > >> > > > >>>> behavior? If items are added/removed manually
>> we
>> > > could
>> > > > > >> > delegate
>> > > > > >> > >> > this
>> > > > > >> > >> > > > >> to the
>> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout behavior
>> > > would
>> > > > > >> better
>> > > > > >> > fit
>> > > > > >> > >> > > into
>> > > > > >> > >> > > > >>>> specific strategies at a later point.
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> Looking forward to your thoughts.
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> Cheers and thank you,
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> Konstantin
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> [1]
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > >
>> > > > > >> > >> > >
>> > > > > >> > >> >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>> > > > > >> > >> > > > >>>> [2]
>> > > https://issues.apache.org/jira/browse/FLINK-18209
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb
>> Lijie
>> > > Wang
>> > > > <
>> > > > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
>> > > > > >> > >> > > > >>>>
>> > > > > >> > >> > > > >>>>> Hi all,
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> Flink job failures may happen due to cluster
>> node
>> > > > > issues
>> > > > > >> > >> > > > >> (insufficient
>> > > > > >> > >> > > > >>>> disk
>> > > > > >> > >> > > > >>>>> space, bad hardware, network abnormalities).
>> > Flink
>> > > > will
>> > > > > >> > take care
>> > > > > >> > >> > > of
>> > > > > >> > >> > > > >> the
>> > > > > >> > >> > > > >>>>> failures and redeploy the tasks. However, due
>> to
>> > > data
>> > > > > >> > locality
>> > > > > >> > >> > and
>> > > > > >> > >> > > > >>>> limited
>> > > > > >> > >> > > > >>>>> resources, the new tasks are very likely to be
>> > > > > redeployed
>> > > > > >> > to the
>> > > > > >> > >> > > same
>> > > > > >> > >> > > > >>>>> nodes, which will result in continuous task
>> > > > > abnormalities
>> > > > > >> > and
>> > > > > >> > >> > > affect
>> > > > > >> > >> > > > >> job
>> > > > > >> > >> > > > >>>>> progress.
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> Currently, Flink users need to manually
>> identify
>> > > the
>> > > > > >> > problematic
>> > > > > >> > >> > > > >> node and
>> > > > > >> > >> > > > >>>>> take it offline to solve this problem. But
>> this
>> > > > > approach
>> > > > > >> has
>> > > > > >> > >> > > > >> following
>> > > > > >> > >> > > > >>>>> disadvantages:
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy
>> process.
>> > > > Users
>> > > > > >> may
>> > > > > >> > need
>> > > > > >> > >> > to
>> > > > > >> > >> > > > >>>> contact
>> > > > > >> > >> > > > >>>>> cluster administors to do this. The operation
>> can
>> > > > even
>> > > > > be
>> > > > > >> > >> > dangerous
>> > > > > >> > >> > > > >> and
>> > > > > >> > >> > > > >>>> not
>> > > > > >> > >> > > > >>>>> allowed during some important business events.
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of
>> problems
>> > > > > manually
>> > > > > >> > would
>> > > > > >> > >> > be
>> > > > > >> > >> > > > >> slow
>> > > > > >> > >> > > > >>>> and
>> > > > > >> > >> > > > >>>>> a waste of human resources.
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose
>> to
>> > > > > >> introduce a
>> > > > > >> > >> > > blacklist
>> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out problematic
>> > > > > resources.
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> You can find more details in FLIP-224[1].
>> Looking
>> > > > > forward
>> > > > > >> > to your
>> > > > > >> > >> > > > >>>> feedback.
>> > > > > >> > >> > > > >>>>> [1]
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>
>> > > > > >> > >> > > >
>> > > > > >> > >> > >
>> > > > > >> > >> >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> Best,
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > > >>>>> Lijie
>> > > > > >> > >> > > > >>>>>
>> > > > > >> > >> > > >
>> > > > > >> > >> > > >
>> > > > > >> > >> > >
>> > > > > >> > >> >
>> > > > > >> >
>> > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > > Roman Boyko
>> > > e.: ro.v.boyko@gmail.com
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi Becket Qin,

Thanks for your suggestions.  I have moved the description of
configurations, metrics and REST API into "Public Interface" section, and
made a few updates according to your suggestion.  And in this FLIP, there
no public java Interfaces or pluggables that users need to implement by
themselves.

Answers for you questions:
1. Yes, there 2 block actions: MARK_BLOCKED and.
MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block items can
only be added through the REST API, so these 2 action are mentioned in the
REST API part (The REST API part has beed moved to public interface now).
2. I agree with you. I have changed the "Cause" field to String, and allow
users to specify it via REST API.
3. Yes, it is useful to allow different timeouts. As mentioned above, we
will introduce 2 fields : *timeout* and *endTimestamp* into the ADD REST
API to specify when to remove the blocked item. These 2 fields are
optional, if neither is specified, it means that the blocked item is
permanent and will not be removed. If both are specified, the minimum of
*currentTimestamp+tiemout *and* endTimestamp* will be used as the time to
remove the blocked item. To keep the configurations more minimal, we have
removed the *cluster.resource-blocklist.item.timeout* configuration option.
4. Yes, the block item will be overridden if the specified item already
exists. The ADD operation is *ADD or UPDATE*.
5. Yes. On JM/RM side, all the blocklist information is maintained in
JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or abstracted
to other interfaces) will be propagated to different components.

Best,
Lijie

Becket Qin <be...@gmail.com> 于2022年5月10日周二 11:26写道：

> Hi Lijie,
>
> Thanks for updating the FLIP. It looks like the public interface section
> did not fully reflect all the user sensible behavior and API. Can you put
> everything that users may be aware of there? That would include the REST
> API, metrics, configurations, public java Interfaces or pluggables that
> users may see or implement by themselves, as well as a brief summary of the
> behavior of the public API.
>
> Besides that, I have a few questions:
>
> 1. According to the conversation in the discussion thread, it looks like
> the BlockAction will have "MARK_BLOCKLISTED" and
> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can you add
> that to the public interface as well?
>
> 2. At this point, the "Cause" field in the BlockingItem is a Throwable and
> is not reflected in the REST API. Should that be included in the query
> response? And should we change that field to be a String so users may
> specify the cause via the REST API when they block some nodes / TMs?
>
> 3. Would it be useful to allow users to have different timeouts for
> different blocked items? So while there is a default timeout, users can
> also override it via the REST API when they block an entity.
>
> 4. Regarding the ADD operation, if the specified item is already there,
> will the block item be overridden? For example, if the user wants to extend
> the timeout of a blocked item, can they just  issue an ADD command again?
>
> 5. I am not quite familiar with the details of this, but is there a source
> of truth for the blocked list? I think it might be good to have a single
> source of truth for the blocked list and just propagate that list to
> different components to take the action of actually blocking the resource.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <wa...@gmail.com>
> wrote:
>
> > Hi everyone,
> >
> > Based on the discussion in the mailing list, I updated the FLIP doc, the
> > changes include:
> > 1. Changed the description of the motivation section to more clearly
> > describe the problem this FLIP is trying to solve.
> > 2. Only  *Manually* is supported.
> > 3. Adopted some suggestions, such as *endTimestamp*.
> >
> > Best,
> > Lijie
> >
> >
> > Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
> >
> > > Hi Lijie!
> > >
> > >
> > >
> > >
> > > *a) “Probably storing inside Zookeeper/Configmap might be helpfulhere.”
> > > Can you explain it in detail? I don't fully understand that. In
> > myopinion,
> > > non-active and active are the same, and no special treatment
> isrequired.*
> > >
> > > Sorry this was a misunderstanding from my side. I thought we were
> talking
> > > about the HA mode (but not about Active and Standalone
> ResourceManager).
> > > And the original question was - how to handle the blacklisted nodes
> list
> > at
> > > the moment of leader change? Should we simply forget about them or try
> to
> > > pre-save that list on the remote storage?
> > >
> > > On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com> wrote:
> > >
> > > > Thanks Lijie and ZhuZhu for the explanation.
> > > >
> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is
> indeed
> > > > some functionalities the external tools(e.g. kubectl taint) could not
> > > > support.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> > > >
> > > > > Thanks for your feedback, Jiangang and Martijn.
> > > > >
> > > > > @Jiangang
> > > > >
> > > > >
> > > > > > For auto-detecting, I wonder how to make the strategy and mark a
> > node
> > > > > blocked?
> > > > >
> > > > > In fact, we currently plan to not support auto-detection in this
> > FLIP.
> > > > The
> > > > > part about auto-detection may be continued in a separate FLIP in
> the
> > > > > future. Some guys have the same concerns as you, and the
> correctness
> > > and
> > > > > necessity of auto-detection may require further discussion in the
> > > future.
> > > > >
> > > > > > In session mode, multi jobs can fail on the same bad node and the
> > > node
> > > > > should be marked blocked.
> > > > > By design, the blocklist information will be shared among all jobs
> > in a
> > > > > cluster/session. The JM will sync blocklist information with RM.
> > > > >
> > > > > @Martijn
> > > > >
> > > > > > I agree with Yang Wang on this.
> > > > > As Zhu Zhu and I mentioned above, we think the
> MARK_BLOCKLISTED(Just
> > > > limits
> > > > > the load of the node and does not  kill all the processes on it) is
> > > also
> > > > > important, and we think that external systems (*yarn rmadmin or
> > kubectl
> > > > > taint*) cannot support it. So we think it makes sense even only
> > > > *manually*.
> > > > >
> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> super
> > > hard
> > > > > to get right.
> > > > > Yes, as you see, Jiangang(and a few others) have the same concern.
> > > > > However, we currently plan to not support auto-detection in this
> > FLIP,
> > > > and
> > > > > only *manually*. In addition, I'd like to say that the FLIP
> provides
> > a
> > > > > mechanism to support MARK_BLOCKLISTED and
> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > > > > the auto-detection may be done by external systems.
> > > > >
> > > > > Best,
> > > > > Lijie
> > > > >
> > > > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
> > > > >
> > > > > > > If we only support to block nodes manually, then I could not
> see
> > > > > > the obvious advantages compared with current SRE's approach(via
> > *yarn
> > > > > > rmadmin or kubectl taint*).
> > > > > >
> > > > > > I agree with Yang Wang on this.
> > > > > >
> > > > > > >  To me this sounds yet again like one of those magical
> mechanisms
> > > > that
> > > > > > will rarely work just right.
> > > > > >
> > > > > > I also agree with Chesnay that magical mechanisms are indeed
> super
> > > hard
> > > > > to
> > > > > > get right.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Martijn
> > > > > >
> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> > liujiangangpeng@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks for the valuable design. The auto-detecting can decrease
> > > great
> > > > > work
> > > > > >> for us. We have implemented the similar feature in our inner
> flink
> > > > > >> version.
> > > > > >> Below is something that I care about:
> > > > > >>
> > > > > >>    1. For auto-detecting, I wonder how to make the strategy and
> > > mark a
> > > > > >> node
> > > > > >>    blocked? Sometimes the blocked node is hard to be detected,
> for
> > > > > >> example,
> > > > > >>    the upper node or the down node will be blocked when network
> > > > > >> unreachable.
> > > > > >>    2. I see that the strategy is made in JobMaster side. How
> about
> > > > > >>    implementing the similar logic in resource manager? In
> session
> > > > mode,
> > > > > >> multi
> > > > > >>    jobs can fail on the same bad node and the node should be
> > marked
> > > > > >> blocked.
> > > > > >>    If the job makes the strategy, the node may be not marked
> > blocked
> > > > if
> > > > > >> the
> > > > > >>    fail times don't exceed the threshold.
> > > > > >>
> > > > > >>
> > > > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > > > > >>
> > > > > >> > Thank you for all your feedback!
> > > > > >> >
> > > > > >> > Besides the answers from Lijie, I'd like to share some of my
> > > > thoughts:
> > > > > >> > 1. Whether to enable automatical blocklist
> > > > > >> > Generally speaking, it is not a goal of FLIP-224.
> > > > > >> > The automatical way should be something built upon the
> blocklist
> > > > > >> > mechanism and well decoupled. It was designed to be a
> > configurable
> > > > > >> > blocklist strategy, but I think we can further decouple it by
> > > > > >> > introducing a abnormal node detector, as Becket suggested,
> which
> > > > just
> > > > > >> > uses the blocklist mechanism once bad nodes are detected.
> > However,
> > > > it
> > > > > >> > should be a separate FLIP with further dev discussions and
> > > feedback
> > > > > >> > from users. I also agree with Becket that different users have
> > > > > different
> > > > > >> > requirements, and we should listen to them.
> > > > > >> >
> > > > > >> > 2. Is it enough to just take away abnormal nodes externally
> > > > > >> > My answer is no. As Lijie has mentioned, we need a way to
> avoid
> > > > > >> > deploying tasks to temporary hot nodes. In this case, users
> may
> > > just
> > > > > >> > want to limit the load of the node and do not want to kill all
> > the
> > > > > >> > processes on it. Another case is the speculative execution[1]
> > > which
> > > > > >> > may also leverage this feature to avoid starting mirror tasks
> on
> > > > slow
> > > > > >> > nodes.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Zhu
> > > > > >> >
> > > > > >> > [1]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > > > >> >
> > > > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Hi everyone,
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Thanks for your feedback.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > There's one detail that I'd like to re-emphasize here
> because
> > it
> > > > can
> > > > > >> > affect the value and design of the blocklist mechanism
> (perhaps
> > I
> > > > > should
> > > > > >> > highlight it in the FLIP). We propose two actions in FLIP:
> > > > > >> > >
> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as
> > > > blocked.
> > > > > >> > Future slots should not be allocated from the blocked task
> > manager
> > > > or
> > > > > >> node.
> > > > > >> > But slots that are already allocated will not be affected. A
> > > typical
> > > > > >> > application scenario is to mitigate machine hotspots. In this
> > > case,
> > > > we
> > > > > >> hope
> > > > > >> > that subsequent resource allocations will not be on the hot
> > > machine,
> > > > > but
> > > > > >> > tasks currently running on it should not be affected.
> > > > > >> > >
> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task
> manager
> > or
> > > > > node
> > > > > >> as
> > > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks will be
> > > > > >> restarted on
> > > > > >> > non-blocked task managers.
> > > > > >> > >
> > > > > >> > > For the above 2 actions, the former may more highlight the
> > > meaning
> > > > > of
> > > > > >> > this FLIP, because the external system cannot do that.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Regarding *Manually* and *Automatically*, I basically agree
> > with
> > > > > >> @Becket
> > > > > >> > Qin: different users have different answers. Not all users’
> > > > deployment
> > > > > >> > environments have a special external system that can perform
> the
> > > > > anomaly
> > > > > >> > detection. In addition, adding pluggable/optional
> auto-detection
> > > > > doesn't
> > > > > >> > require much extra work on top of manual specification.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > I will answer your other questions one by one.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Yangze
> > > > > >> > >
> > > > > >> > > a) I think you are right, we do not need to expose the
> > > > > >> > `cluster.resource-blocklist.item.timeout-check-interval` to
> > users.
> > > > > >> > >
> > > > > >> > > b) We can abstract the `notifyException` to a separate
> > interface
> > > > > >> (maybe
> > > > > >> > BlocklistExceptionListener), and the
> > > ResourceManagerBlocklistHandler
> > > > > can
> > > > > >> > implement it in the future.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Martijn
> > > > > >> > >
> > > > > >> > > a) I also think the manual blocking should be done by
> cluster
> > > > > >> operators.
> > > > > >> > >
> > > > > >> > > b) I think manual blocking makes sense, because according to
> > my
> > > > > >> > experience, users are often the first to perceive the machine
> > > > problems
> > > > > >> > (because of job failover or delay), and they will contact
> > cluster
> > > > > >> operators
> > > > > >> > to solve it, or even tell the cluster operators which machine
> is
> > > > > >> > problematic. From this point of view, I think the people who
> > > really
> > > > > need
> > > > > >> > the manual blocking are the users, and it’s just performed by
> > the
> > > > > >> cluster
> > > > > >> > operator, so I think the manual blocking makes sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Chesnay
> > > > > >> > >
> > > > > >> > > We need to touch the logic of JM/SlotPool, because for
> > > > > >> MARK_BLOCKLISTED
> > > > > >> > , we need to know whether the slot is blocklisted when the
> task
> > is
> > > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the
> > > slot
> > > > > >> > directly to avoid assigning other tasks (of this job) on it.
> If
> > we
> > > > > only
> > > > > >> > maintain the blocklist information on the RM, JM needs to
> > retrieve
> > > > it
> > > > > by
> > > > > >> > RPC. I think the performance overhead of that is relatively
> > large,
> > > > so
> > > > > I
> > > > > >> > think it's worth maintaining the blocklist information on the
> JM
> > > > side
> > > > > >> and
> > > > > >> > syncing them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Роман
> > > > > >> > >
> > > > > >> > >     a) “Probably storing inside Zookeeper/Configmap might be
> > > > helpful
> > > > > >> > here.”  Can you explain it in detail? I don't fully understand
> > > that.
> > > > > In
> > > > > >> my
> > > > > >> > opinion, non-active and active are the same, and no special
> > > > treatment
> > > > > is
> > > > > >> > required.
> > > > > >> > >
> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense, I will
> > add
> > > it
> > > > > to
> > > > > >> > FLIP.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > @Yang
> > > > > >> > >
> > > > > >> > > As mentioned above, AFAK, the external system cannot support
> > the
> > > > > >> > MARK_BLOCKLISTED action.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Looking forward to your further feedback.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Best,
> > > > > >> > >
> > > > > >> > > Lijie
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> > > > > >> > >>
> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > > > > >> > >>
> > > > > >> > >> I want to share some thoughts about Flink cluster
> operations.
> > > > > >> > >>
> > > > > >> > >> In the production environment, the SRE(aka Site Reliability
> > > > > Engineer)
> > > > > >> > >> already has many tools to detect the unstable nodes, which
> > > could
> > > > > take
> > > > > >> > the
> > > > > >> > >> system logs/metrics into consideration.
> > > > > >> > >> Then they use graceful-decomission in YARN and taint in K8s
> > to
> > > > > >> prevent
> > > > > >> > new
> > > > > >> > >> allocations on these unstable nodes.
> > > > > >> > >> At last, they will evict all the containers and pods
> running
> > on
> > > > > these
> > > > > >> > nodes.
> > > > > >> > >> This mechanism also works for planned maintenance. So I am
> > > afraid
> > > > > >> this
> > > > > >> > is
> > > > > >> > >> not the typical use case for FLIP-224.
> > > > > >> > >>
> > > > > >> > >> If we only support to block nodes manually, then I could
> not
> > > see
> > > > > >> > >> the obvious advantages compared with current SRE's
> > approach(via
> > > > > *yarn
> > > > > >> > >> rmadmin or kubectl taint*).
> > > > > >> > >> At least, we need to have a pluggable component which could
> > > > expose
> > > > > >> the
> > > > > >> > >> potential unstable nodes automatically and block them if
> > > enabled
> > > > > >> > explicitly.
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> Best,
> > > > > >> > >> Yang
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> > > > > >> > >>
> > > > > >> > >> > Thanks for the proposal, Lijie.
> > > > > >> > >> >
> > > > > >> > >> > This is an interesting feature and discussion, and
> somewhat
> > > > > related
> > > > > >> > to the
> > > > > >> > >> > design principle about how people should operate Flink.
> > > > > >> > >> >
> > > > > >> > >> > I think there are three things involved in this FLIP.
> > > > > >> > >> >      a) Detect and report the unstable node.
> > > > > >> > >> >      b) Collect the information of the unstable node and
> > > form a
> > > > > >> > blocklist.
> > > > > >> > >> >      c) Take the action to block nodes.
> > > > > >> > >> >
> > > > > >> > >> > My two cents:
> > > > > >> > >> >
> > > > > >> > >> > 1. It looks like people all agree that Flink should have
> > c).
> > > It
> > > > > is
> > > > > >> > not only
> > > > > >> > >> > useful for cases of node failures, but also handy for
> some
> > > > > planned
> > > > > >> > >> > maintenance.
> > > > > >> > >> >
> > > > > >> > >> > 2. People have different opinions on b), i.e. who should
> be
> > > the
> > > > > >> brain
> > > > > >> > to
> > > > > >> > >> > make the decision to block a node. I think this largely
> > > depends
> > > > > on
> > > > > >> > who we
> > > > > >> > >> > talk to. Different users would probably give different
> > > answers.
> > > > > For
> > > > > >> > people
> > > > > >> > >> > who do have a centralized node health management service,
> > let
> > > > > Flink
> > > > > >> > do just
> > > > > >> > >> > do a) and c) would be preferred. So essentially Flink
> would
> > > be
> > > > > one
> > > > > >> of
> > > > > >> > the
> > > > > >> > >> > sources that may detect unstable nodes, report it to that
> > > > > service,
> > > > > >> > and then
> > > > > >> > >> > take the command from that service to block the
> problematic
> > > > > nodes.
> > > > > >> On
> > > > > >> > the
> > > > > >> > >> > other hand, for users who do not have such a service,
> > simply
> > > > > >> letting
> > > > > >> > Flink
> > > > > >> > >> > be clever by itself to block the suspicious nodes might
> be
> > > > > desired
> > > > > >> to
> > > > > >> > >> > ensure the jobs are running smoothly.
> > > > > >> > >> >
> > > > > >> > >> > So that indicates a) and b) here should be pluggable /
> > > > optional.
> > > > > >> > >> >
> > > > > >> > >> > In light of this, maybe it would make sense to have
> > something
> > > > > >> > pluggable
> > > > > >> > >> > like a UnstableNodeReporter which exposes unstable nodes
> > > > > actively.
> > > > > >> (A
> > > > > >> > more
> > > > > >> > >> > general interface should be JobInfoReporter<T> which can
> be
> > > > used
> > > > > to
> > > > > >> > report
> > > > > >> > >> > any information of type <T>. But I'll just keep the scope
> > > > > relevant
> > > > > >> to
> > > > > >> > this
> > > > > >> > >> > FLIP here). Personally speaking, I think it is OK to
> have a
> > > > > default
> > > > > >> > >> > implementation of a reporter which just tells Flink to
> take
> > > > > action
> > > > > >> to
> > > > > >> > block
> > > > > >> > >> > problematic nodes and also unblocks them after timeout.
> > > > > >> > >> >
> > > > > >> > >> > Thanks,
> > > > > >> > >> >
> > > > > >> > >> > Jiangjie (Becket) Qin
> > > > > >> > >> >
> > > > > >> > >> >
> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > > > ro.v.boyko@gmail.com
> > > > > >
> > > > > >> > wrote:
> > > > > >> > >> >
> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > > > > >> > >> > >
> > > > > >> > >> > > If it's possible I'd like to participate in
> development.
> > > > > >> > >> > >
> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply - we
> should
> > > > > consider
> > > > > >> > to move
> > > > > >> > >> > > somehow the information of blocklisted nodes/TMs from
> > > active
> > > > > >> > >> > > ResourceManager to non-active ones. Probably storing
> > inside
> > > > > >> > >> > > Zookeeper/Configmap might be helpful here.
> > > > > >> > >> > >
> > > > > >> > >> > > And I agree with Martijn that a lot of organizations
> > don't
> > > > want
> > > > > >> to
> > > > > >> > expose
> > > > > >> > >> > > such API for a cluster user group. But I think it's
> > > necessary
> > > > > to
> > > > > >> > have the
> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for
> > avoiding
> > > > > >> incorrect
> > > > > >> > >> > > automatic behaviour.
> > > > > >> > >> > >
> > > > > >> > >> > > And another one small suggestion - I think it would be
> > > better
> > > > > to
> > > > > >> > extend
> > > > > >> > >> > the
> > > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field
> and
> > > > fill
> > > > > it
> > > > > >> > at the
> > > > > >> > >> > > item creation. This simple addition will allow to:
> > > > > >> > >> > >
> > > > > >> > >> > >    -
> > > > > >> > >> > >
> > > > > >> > >> > >    Provide the ability to users to setup the exact time
> > of
> > > > > >> > blocklist end
> > > > > >> > >> > >    through RestAPI
> > > > > >> > >> > >    -
> > > > > >> > >> > >
> > > > > >> > >> > >    Not being tied to a single value of
> > > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > > > > >> > >> > >
> > > > > >> > >> > >
> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > > > > >> chesnay@apache.org>
> > > > > >> > >> > wrote:
> > > > > >> > >> > >
> > > > > >> > >> > > > I do share the concern between blurring the lines a
> > bit.
> > > > > >> > >> > > >
> > > > > >> > >> > > > That said, I'd prefer to not have any auto-detection
> > and
> > > > only
> > > > > >> > have an
> > > > > >> > >> > > > opt-in mechanism
> > > > > >> > >> > > > to manually block processes/nodes. To me this sounds
> > yet
> > > > > again
> > > > > >> > like one
> > > > > >> > >> > > > of those
> > > > > >> > >> > > > magical mechanisms that will rarely work just right.
> > > > > >> > >> > > > An external system can leverage way more information
> > > after
> > > > > all.
> > > > > >> > >> > > >
> > > > > >> > >> > > > Moreover, I'm quite concerned about the complexity of
> > > this
> > > > > >> > proposal.
> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing between
> > > > components;
> > > > > >> > >> > adjustments
> > > > > >> > >> > > > to the
> > > > > >> > >> > > > slot and resource protocol.
> > > > > >> > >> > > >
> > > > > >> > >> > > > In a way it seems overly complicated.
> > > > > >> > >> > > >
> > > > > >> > >> > > > If we look at it purely from an active resource
> > > management
> > > > > >> > perspective,
> > > > > >> > >> > > > then there
> > > > > >> > >> > > > isn't really a need to touch the slot protocol at all
> > (or
> > > > in
> > > > > >> fact
> > > > > >> > to
> > > > > >> > >> > > > anything in the JobMaster),
> > > > > >> > >> > > > because there isn't any point in keeping around
> blocked
> > > TMs
> > > > > in
> > > > > >> the
> > > > > >> > >> > first
> > > > > >> > >> > > > place.
> > > > > >> > >> > > > They'd just be idling, potentially shutting down
> after
> > a
> > > > > while
> > > > > >> by
> > > > > >> > the
> > > > > >> > >> > RM
> > > > > >> > >> > > > because of
> > > > > >> > >> > > > it (unless we _also_ touch that logic).
> > > > > >> > >> > > > Here the blocking of a process (be it by blocking the
> > > > process
> > > > > >> or
> > > > > >> > node)
> > > > > >> > >> > is
> > > > > >> > >> > > > equivalent with shutting down the blocked
> process(es).
> > > > > >> > >> > > > Once the block is lifted we can just spin it back up.
> > > > > >> > >> > > >
> > > > > >> > >> > > > And I do wonder whether we couldn't apply the same
> line
> > > of
> > > > > >> > thinking to
> > > > > >> > >> > > > standalone resource management.
> > > > > >> > >> > > > Here being able to stop/restart a process/node
> manually
> > > > > should
> > > > > >> be
> > > > > >> > a
> > > > > >> > >> > core
> > > > > >> > >> > > > requirement for a Flink deployment anyway.
> > > > > >> > >> > > >
> > > > > >> > >> > > >
> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > > > >> > >> > > > > Hi everyone,
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Thanks for creating this FLIP. I can understand the
> > > > problem
> > > > > >> and
> > > > > >> > I see
> > > > > >> > >> > > > value
> > > > > >> > >> > > > > in the automatic detection and blocklisting. I do
> > have
> > > > some
> > > > > >> > concerns
> > > > > >> > >> > > with
> > > > > >> > >> > > > > the ability to manually specify to be blocked
> > > resources.
> > > > I
> > > > > >> have
> > > > > >> > two
> > > > > >> > >> > > > > concerns;
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > * Most organizations explicitly have a separation
> of
> > > > > >> concerns,
> > > > > >> > >> > meaning
> > > > > >> > >> > > > that
> > > > > >> > >> > > > > there's a group who's responsible for managing a
> > > cluster
> > > > > and
> > > > > >> > there's
> > > > > >> > >> > a
> > > > > >> > >> > > > user
> > > > > >> > >> > > > > group who uses that cluster. With the introduction
> of
> > > > this
> > > > > >> > mechanism,
> > > > > >> > >> > > the
> > > > > >> > >> > > > > latter group now can influence the responsibility
> of
> > > the
> > > > > >> first
> > > > > >> > group.
> > > > > >> > >> > > So
> > > > > >> > >> > > > it
> > > > > >> > >> > > > > can be possible that someone from the user group
> > blocks
> > > > > >> > something,
> > > > > >> > >> > > which
> > > > > >> > >> > > > > causes an outage (which could result in paging
> > > mechanism
> > > > > >> > triggering
> > > > > >> > >> > > etc)
> > > > > >> > >> > > > > which impacts the first group.
> > > > > >> > >> > > > > * How big is the group of people who can go through
> > the
> > > > > >> process
> > > > > >> > of
> > > > > >> > >> > > > manually
> > > > > >> > >> > > > > identifying a node that isn't behaving as it should
> > > be? I
> > > > > do
> > > > > >> > think
> > > > > >> > >> > this
> > > > > >> > >> > > > > group is relatively limited. Does it then make
> sense
> > to
> > > > > >> > introduce
> > > > > >> > >> > such
> > > > > >> > >> > > a
> > > > > >> > >> > > > > feature, which would only be used by a really small
> > > user
> > > > > >> group
> > > > > >> > of
> > > > > >> > >> > > Flink?
> > > > > >> > >> > > > We
> > > > > >> > >> > > > > still have to maintain, test and support such a
> > > feature.
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > I'm +1 for the autodetection features, but I'm
> > leaning
> > > > > >> towards
> > > > > >> > not
> > > > > >> > >> > > > exposing
> > > > > >> > >> > > > > this to the user group but having this available
> > > strictly
> > > > > for
> > > > > >> > cluster
> > > > > >> > >> > > > > operators. They could then also set up their
> > > > > >> > paging/metrics/logging
> > > > > >> > >> > > > system
> > > > > >> > >> > > > > to take this into account.
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Best regards,
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Martijn Visser
> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82
> > > > > >> > >> > > > > https://github.com/MartijnVisser
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > > > > karmagyz@gmail.com
> > > > > >> >
> > > > > >> > wrote:
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> +1 for the overall proposal. Just share some cents
> > > here:
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> - Why do we need to expose
> > > > > >> > >> > > > >>
> > cluster.resource-blacklist.item.timeout-check-interval
> > > > to
> > > > > >> the
> > > > > >> > user?
> > > > > >> > >> > > > >> I think the semantics of
> > > > > >> > `cluster.resource-blacklist.item.timeout`
> > > > > >> > >> > is
> > > > > >> > >> > > > >> sufficient for the user. How to guarantee the
> > timeout
> > > > > >> > mechanism is
> > > > > >> > >> > > > >> Flink's internal implementation. I think it will
> be
> > > very
> > > > > >> > confusing
> > > > > >> > >> > and
> > > > > >> > >> > > > >> we do not need to expose it to users.
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> - ResourceManager can notify the exception of a
> task
> > > > > >> manager to
> > > > > >> > >> > > > >> `BlacklistHandler` as well.
> > > > > >> > >> > > > >> For example, the slot allocation might fail in
> case
> > > the
> > > > > >> target
> > > > > >> > task
> > > > > >> > >> > > > >> manager is busy or has a network jitter. I don't
> > mean
> > > we
> > > > > >> need
> > > > > >> > to
> > > > > >> > >> > cover
> > > > > >> > >> > > > >> this case in this version, but we can also open a
> > > > > >> > `notifyException`
> > > > > >> > >> > in
> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> - Before we sync the blocklist to ResourceManager,
> > > will
> > > > > the
> > > > > >> > slot of
> > > > > >> > >> > a
> > > > > >> > >> > > > >> blocked task manager continues to be released and
> > > > > allocated?
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> Best,
> > > > > >> > >> > > > >> Yangze Guo
> > > > > >> > >> > > > >>
> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > > > > >> > >> > wangdachui9501@gmail.com>
> > > > > >> > >> > > > >> wrote:
> > > > > >> > >> > > > >>> Hi Konstantin,
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Thanks for your feedback. I will response your 4
> > > > remarks:
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I
> > > think
> > > > > >> > “BlockList”
> > > > > >> > >> > is
> > > > > >> > >> > > > >> good
> > > > > >> > >> > > > >>> enough, and I will change it in FLIP.
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a good
> idea.
> > > > Based
> > > > > >> on
> > > > > >> > the
> > > > > >> > >> > > > above, I
> > > > > >> > >> > > > >>> would change REST API as following:
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means
> that
> > > all
> > > > > >> task
> > > > > >> > >> > managers
> > > > > >> > >> > > > on
> > > > > >> > >> > > > >>> this node are blocklisted. All slots on these TMs
> > are
> > > > not
> > > > > >> > >> > available.
> > > > > >> > >> > > > This
> > > > > >> > >> > > > >>> is actually a bit like TM losts, but these TMs
> are
> > > not
> > > > > >> really
> > > > > >> > lost,
> > > > > >> > >> > > > they
> > > > > >> > >> > > > >>> are in an unavailable status, and they are still
> > > > > registered
> > > > > >> > in this
> > > > > >> > >> > > > flink
> > > > > >> > >> > > > >>> cluster. They will be available again once the
> > > > > >> corresponding
> > > > > >> > >> > > blocklist
> > > > > >> > >> > > > >> item
> > > > > >> > >> > > > >>> is removed. This behavior is the same in
> > > > > active/non-active
> > > > > >> > >> > clusters.
> > > > > >> > >> > > > >>> However in the active clusters, these TMs may be
> > > > released
> > > > > >> due
> > > > > >> > to
> > > > > >> > >> > idle
> > > > > >> > >> > > > >>> timeouts.
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The
> > > > reasons
> > > > > >> are
> > > > > >> > as
> > > > > >> > >> > > > >> following:
> > > > > >> > >> > > > >>> a) The timeout will not affect users adding or
> > > removing
> > > > > >> items
> > > > > >> > via
> > > > > >> > >> > > REST
> > > > > >> > >> > > > >> API,
> > > > > >> > >> > > > >>> and users can disable it by configuring it to
> > > > > >> Long.MAX_VALUE .
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> b) Some node problems can recover after a period
> of
> > > > time
> > > > > >> > (such as
> > > > > >> > >> > > > machine
> > > > > >> > >> > > > >>> hotspots), in which case users may prefer that
> > Flink
> > > > can
> > > > > do
> > > > > >> > this
> > > > > >> > >> > > > >>> automatically instead of requiring the user to do
> > it
> > > > > >> manually.
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Best,
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Lijie
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org>
> 于2022年4月27日周三
> > > > > >> 19:23写道：
> > > > > >> > >> > > > >>>
> > > > > >> > >> > > > >>>> Hi Lijie,
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to only support
> > > > > manually
> > > > > >> > blocking
> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different
> > > strategies
> > > > > can
> > > > > >> > also be
> > > > > >> > >> > > > >>>> maintained outside of Apache Flink.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> A few remarks:
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due
> > to
> > > > the
> > > > > >> > >> > controversy
> > > > > >> > >> > > > >> around
> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about
> > > this
> > > > > >> topic a
> > > > > >> > >> > while
> > > > > >> > >> > > > >> back
> > > > > >> > >> > > > >>>> and there was generally a consensus to avoid the
> > > term
> > > > > >> > blacklist &
> > > > > >> > >> > > > >> whitelist
> > > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> > > > > "quarantined"
> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly
> > different
> > > > > >> design
> > > > > >> > as
> > > > > >> > >> > verbs
> > > > > >> > >> > > > >> like
> > > > > >> > >> > > > >>>> add/remove often considered an anti-pattern for
> > REST
> > > > > APIs.
> > > > > >> > POST
> > > > > >> > >> > on a
> > > > > >> > >> > > > >> list
> > > > > >> > >> > > > >>>> item is generally the standard to add items.
> > DELETE
> > > on
> > > > > the
> > > > > >> > >> > > individual
> > > > > >> > >> > > > >>>> resource is standard to remove an item.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> POST <host>/quarantine/items
> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> We could also consider to separate taskmanagers
> > and
> > > > > nodes
> > > > > >> in
> > > > > >> > the
> > > > > >> > >> > > REST
> > > > > >> > >> > > > >> API
> > > > > >> > >> > > > >>>> (and internal data structures). Any opinion on
> > this?
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > > > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > > > > >> > >> > > > >>>> DELETE
> <host>/quarantine/taskmanager/<identifier>
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with
> non-active
> > > > > >> resource
> > > > > >> > >> > > managers,
> > > > > >> > >> > > > >> i.e.
> > > > > >> > >> > > > >>>> standalone or reactive mode?
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> 4) To keep the implementation even more minimal,
> > do
> > > we
> > > > > >> need
> > > > > >> > the
> > > > > >> > >> > > > timeout
> > > > > >> > >> > > > >>>> behavior? If items are added/removed manually we
> > > could
> > > > > >> > delegate
> > > > > >> > >> > this
> > > > > >> > >> > > > >> to the
> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout behavior
> > > would
> > > > > >> better
> > > > > >> > fit
> > > > > >> > >> > > into
> > > > > >> > >> > > > >>>> specific strategies at a later point.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Looking forward to your thoughts.
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Cheers and thank you,
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Konstantin
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> [1]
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > > > >> > >> > > > >>>> [2]
> > > https://issues.apache.org/jira/browse/FLINK-18209
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie
> > > Wang
> > > > <
> > > > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> > > > > >> > >> > > > >>>>
> > > > > >> > >> > > > >>>>> Hi all,
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Flink job failures may happen due to cluster
> node
> > > > > issues
> > > > > >> > >> > > > >> (insufficient
> > > > > >> > >> > > > >>>> disk
> > > > > >> > >> > > > >>>>> space, bad hardware, network abnormalities).
> > Flink
> > > > will
> > > > > >> > take care
> > > > > >> > >> > > of
> > > > > >> > >> > > > >> the
> > > > > >> > >> > > > >>>>> failures and redeploy the tasks. However, due
> to
> > > data
> > > > > >> > locality
> > > > > >> > >> > and
> > > > > >> > >> > > > >>>> limited
> > > > > >> > >> > > > >>>>> resources, the new tasks are very likely to be
> > > > > redeployed
> > > > > >> > to the
> > > > > >> > >> > > same
> > > > > >> > >> > > > >>>>> nodes, which will result in continuous task
> > > > > abnormalities
> > > > > >> > and
> > > > > >> > >> > > affect
> > > > > >> > >> > > > >> job
> > > > > >> > >> > > > >>>>> progress.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Currently, Flink users need to manually
> identify
> > > the
> > > > > >> > problematic
> > > > > >> > >> > > > >> node and
> > > > > >> > >> > > > >>>>> take it offline to solve this problem. But this
> > > > > approach
> > > > > >> has
> > > > > >> > >> > > > >> following
> > > > > >> > >> > > > >>>>> disadvantages:
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy
> process.
> > > > Users
> > > > > >> may
> > > > > >> > need
> > > > > >> > >> > to
> > > > > >> > >> > > > >>>> contact
> > > > > >> > >> > > > >>>>> cluster administors to do this. The operation
> can
> > > > even
> > > > > be
> > > > > >> > >> > dangerous
> > > > > >> > >> > > > >> and
> > > > > >> > >> > > > >>>> not
> > > > > >> > >> > > > >>>>> allowed during some important business events.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of
> problems
> > > > > manually
> > > > > >> > would
> > > > > >> > >> > be
> > > > > >> > >> > > > >> slow
> > > > > >> > >> > > > >>>> and
> > > > > >> > >> > > > >>>>> a waste of human resources.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
> > > > > >> introduce a
> > > > > >> > >> > > blacklist
> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out problematic
> > > > > resources.
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> You can find more details in FLIP-224[1].
> Looking
> > > > > forward
> > > > > >> > to your
> > > > > >> > >> > > > >>>> feedback.
> > > > > >> > >> > > > >>>>> [1]
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Best,
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > > >>>>> Lijie
> > > > > >> > >> > > > >>>>>
> > > > > >> > >> > > >
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Roman Boyko
> > > e.: ro.v.boyko@gmail.com
> > >
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Becket Qin <be...@gmail.com>.

Hi Lijie,

Thanks for updating the FLIP. It looks like the public interface section
did not fully reflect all the user sensible behavior and API. Can you put
everything that users may be aware of there? That would include the REST
API, metrics, configurations, public java Interfaces or pluggables that
users may see or implement by themselves, as well as a brief summary of the
behavior of the public API.

Besides that, I have a few questions:

1. According to the conversation in the discussion thread, it looks like
the BlockAction will have "MARK_BLOCKLISTED" and
"MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can you add
that to the public interface as well?

2. At this point, the "Cause" field in the BlockingItem is a Throwable and
is not reflected in the REST API. Should that be included in the query
response? And should we change that field to be a String so users may
specify the cause via the REST API when they block some nodes / TMs?

3. Would it be useful to allow users to have different timeouts for
different blocked items? So while there is a default timeout, users can
also override it via the REST API when they block an entity.

4. Regarding the ADD operation, if the specified item is already there,
will the block item be overridden? For example, if the user wants to extend
the timeout of a blocked item, can they just  issue an ADD command again?

5. I am not quite familiar with the details of this, but is there a source
of truth for the blocked list? I think it might be good to have a single
source of truth for the blocked list and just propagate that list to
different components to take the action of actually blocking the resource.

Thanks,

Jiangjie (Becket) Qin

On Mon, May 9, 2022 at 5:54 PM Lijie Wang <wa...@gmail.com> wrote:

> Hi everyone,
>
> Based on the discussion in the mailing list, I updated the FLIP doc, the
> changes include:
> 1. Changed the description of the motivation section to more clearly
> describe the problem this FLIP is trying to solve.
> 2. Only  *Manually* is supported.
> 3. Adopted some suggestions, such as *endTimestamp*.
>
> Best,
> Lijie
>
>
> Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：
>
> > Hi Lijie!
> >
> >
> >
> >
> > *a) “Probably storing inside Zookeeper/Configmap might be helpfulhere.”
> > Can you explain it in detail? I don't fully understand that. In
> myopinion,
> > non-active and active are the same, and no special treatment isrequired.*
> >
> > Sorry this was a misunderstanding from my side. I thought we were talking
> > about the HA mode (but not about Active and Standalone ResourceManager).
> > And the original question was - how to handle the blacklisted nodes list
> at
> > the moment of leader change? Should we simply forget about them or try to
> > pre-save that list on the remote storage?
> >
> > On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com> wrote:
> >
> > > Thanks Lijie and ZhuZhu for the explanation.
> > >
> > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is indeed
> > > some functionalities the external tools(e.g. kubectl taint) could not
> > > support.
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> > >
> > > > Thanks for your feedback, Jiangang and Martijn.
> > > >
> > > > @Jiangang
> > > >
> > > >
> > > > > For auto-detecting, I wonder how to make the strategy and mark a
> node
> > > > blocked?
> > > >
> > > > In fact, we currently plan to not support auto-detection in this
> FLIP.
> > > The
> > > > part about auto-detection may be continued in a separate FLIP in the
> > > > future. Some guys have the same concerns as you, and the correctness
> > and
> > > > necessity of auto-detection may require further discussion in the
> > future.
> > > >
> > > > > In session mode, multi jobs can fail on the same bad node and the
> > node
> > > > should be marked blocked.
> > > > By design, the blocklist information will be shared among all jobs
> in a
> > > > cluster/session. The JM will sync blocklist information with RM.
> > > >
> > > > @Martijn
> > > >
> > > > > I agree with Yang Wang on this.
> > > > As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just
> > > limits
> > > > the load of the node and does not  kill all the processes on it) is
> > also
> > > > important, and we think that external systems (*yarn rmadmin or
> kubectl
> > > > taint*) cannot support it. So we think it makes sense even only
> > > *manually*.
> > > >
> > > > > I also agree with Chesnay that magical mechanisms are indeed super
> > hard
> > > > to get right.
> > > > Yes, as you see, Jiangang(and a few others) have the same concern.
> > > > However, we currently plan to not support auto-detection in this
> FLIP,
> > > and
> > > > only *manually*. In addition, I'd like to say that the FLIP provides
> a
> > > > mechanism to support MARK_BLOCKLISTED and
> > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > > > the auto-detection may be done by external systems.
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
> > > >
> > > > > > If we only support to block nodes manually, then I could not see
> > > > > the obvious advantages compared with current SRE's approach(via
> *yarn
> > > > > rmadmin or kubectl taint*).
> > > > >
> > > > > I agree with Yang Wang on this.
> > > > >
> > > > > >  To me this sounds yet again like one of those magical mechanisms
> > > that
> > > > > will rarely work just right.
> > > > >
> > > > > I also agree with Chesnay that magical mechanisms are indeed super
> > hard
> > > > to
> > > > > get right.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <
> liujiangangpeng@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > >> Thanks for the valuable design. The auto-detecting can decrease
> > great
> > > > work
> > > > >> for us. We have implemented the similar feature in our inner flink
> > > > >> version.
> > > > >> Below is something that I care about:
> > > > >>
> > > > >>    1. For auto-detecting, I wonder how to make the strategy and
> > mark a
> > > > >> node
> > > > >>    blocked? Sometimes the blocked node is hard to be detected, for
> > > > >> example,
> > > > >>    the upper node or the down node will be blocked when network
> > > > >> unreachable.
> > > > >>    2. I see that the strategy is made in JobMaster side. How about
> > > > >>    implementing the similar logic in resource manager? In session
> > > mode,
> > > > >> multi
> > > > >>    jobs can fail on the same bad node and the node should be
> marked
> > > > >> blocked.
> > > > >>    If the job makes the strategy, the node may be not marked
> blocked
> > > if
> > > > >> the
> > > > >>    fail times don't exceed the threshold.
> > > > >>
> > > > >>
> > > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > > > >>
> > > > >> > Thank you for all your feedback!
> > > > >> >
> > > > >> > Besides the answers from Lijie, I'd like to share some of my
> > > thoughts:
> > > > >> > 1. Whether to enable automatical blocklist
> > > > >> > Generally speaking, it is not a goal of FLIP-224.
> > > > >> > The automatical way should be something built upon the blocklist
> > > > >> > mechanism and well decoupled. It was designed to be a
> configurable
> > > > >> > blocklist strategy, but I think we can further decouple it by
> > > > >> > introducing a abnormal node detector, as Becket suggested, which
> > > just
> > > > >> > uses the blocklist mechanism once bad nodes are detected.
> However,
> > > it
> > > > >> > should be a separate FLIP with further dev discussions and
> > feedback
> > > > >> > from users. I also agree with Becket that different users have
> > > > different
> > > > >> > requirements, and we should listen to them.
> > > > >> >
> > > > >> > 2. Is it enough to just take away abnormal nodes externally
> > > > >> > My answer is no. As Lijie has mentioned, we need a way to avoid
> > > > >> > deploying tasks to temporary hot nodes. In this case, users may
> > just
> > > > >> > want to limit the load of the node and do not want to kill all
> the
> > > > >> > processes on it. Another case is the speculative execution[1]
> > which
> > > > >> > may also leverage this feature to avoid starting mirror tasks on
> > > slow
> > > > >> > nodes.
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Zhu
> > > > >> >
> > > > >> > [1]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > > >> >
> > > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
> > > > >> >
> > > > >> > >
> > > > >> > > Hi everyone,
> > > > >> > >
> > > > >> > >
> > > > >> > > Thanks for your feedback.
> > > > >> > >
> > > > >> > >
> > > > >> > > There's one detail that I'd like to re-emphasize here because
> it
> > > can
> > > > >> > affect the value and design of the blocklist mechanism (perhaps
> I
> > > > should
> > > > >> > highlight it in the FLIP). We propose two actions in FLIP:
> > > > >> > >
> > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as
> > > blocked.
> > > > >> > Future slots should not be allocated from the blocked task
> manager
> > > or
> > > > >> node.
> > > > >> > But slots that are already allocated will not be affected. A
> > typical
> > > > >> > application scenario is to mitigate machine hotspots. In this
> > case,
> > > we
> > > > >> hope
> > > > >> > that subsequent resource allocations will not be on the hot
> > machine,
> > > > but
> > > > >> > tasks currently running on it should not be affected.
> > > > >> > >
> > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager
> or
> > > > node
> > > > >> as
> > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks will be
> > > > >> restarted on
> > > > >> > non-blocked task managers.
> > > > >> > >
> > > > >> > > For the above 2 actions, the former may more highlight the
> > meaning
> > > > of
> > > > >> > this FLIP, because the external system cannot do that.
> > > > >> > >
> > > > >> > >
> > > > >> > > Regarding *Manually* and *Automatically*, I basically agree
> with
> > > > >> @Becket
> > > > >> > Qin: different users have different answers. Not all users’
> > > deployment
> > > > >> > environments have a special external system that can perform the
> > > > anomaly
> > > > >> > detection. In addition, adding pluggable/optional auto-detection
> > > > doesn't
> > > > >> > require much extra work on top of manual specification.
> > > > >> > >
> > > > >> > >
> > > > >> > > I will answer your other questions one by one.
> > > > >> > >
> > > > >> > >
> > > > >> > > @Yangze
> > > > >> > >
> > > > >> > > a) I think you are right, we do not need to expose the
> > > > >> > `cluster.resource-blocklist.item.timeout-check-interval` to
> users.
> > > > >> > >
> > > > >> > > b) We can abstract the `notifyException` to a separate
> interface
> > > > >> (maybe
> > > > >> > BlocklistExceptionListener), and the
> > ResourceManagerBlocklistHandler
> > > > can
> > > > >> > implement it in the future.
> > > > >> > >
> > > > >> > >
> > > > >> > > @Martijn
> > > > >> > >
> > > > >> > > a) I also think the manual blocking should be done by cluster
> > > > >> operators.
> > > > >> > >
> > > > >> > > b) I think manual blocking makes sense, because according to
> my
> > > > >> > experience, users are often the first to perceive the machine
> > > problems
> > > > >> > (because of job failover or delay), and they will contact
> cluster
> > > > >> operators
> > > > >> > to solve it, or even tell the cluster operators which machine is
> > > > >> > problematic. From this point of view, I think the people who
> > really
> > > > need
> > > > >> > the manual blocking are the users, and it’s just performed by
> the
> > > > >> cluster
> > > > >> > operator, so I think the manual blocking makes sense.
> > > > >> > >
> > > > >> > >
> > > > >> > > @Chesnay
> > > > >> > >
> > > > >> > > We need to touch the logic of JM/SlotPool, because for
> > > > >> MARK_BLOCKLISTED
> > > > >> > , we need to know whether the slot is blocklisted when the task
> is
> > > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the
> > slot
> > > > >> > directly to avoid assigning other tasks (of this job) on it. If
> we
> > > > only
> > > > >> > maintain the blocklist information on the RM, JM needs to
> retrieve
> > > it
> > > > by
> > > > >> > RPC. I think the performance overhead of that is relatively
> large,
> > > so
> > > > I
> > > > >> > think it's worth maintaining the blocklist information on the JM
> > > side
> > > > >> and
> > > > >> > syncing them.
> > > > >> > >
> > > > >> > >
> > > > >> > > @Роман
> > > > >> > >
> > > > >> > >     a) “Probably storing inside Zookeeper/Configmap might be
> > > helpful
> > > > >> > here.”  Can you explain it in detail? I don't fully understand
> > that.
> > > > In
> > > > >> my
> > > > >> > opinion, non-active and active are the same, and no special
> > > treatment
> > > > is
> > > > >> > required.
> > > > >> > >
> > > > >> > > b) I agree with you, the `endTimestamp` makes sense, I will
> add
> > it
> > > > to
> > > > >> > FLIP.
> > > > >> > >
> > > > >> > >
> > > > >> > > @Yang
> > > > >> > >
> > > > >> > > As mentioned above, AFAK, the external system cannot support
> the
> > > > >> > MARK_BLOCKLISTED action.
> > > > >> > >
> > > > >> > >
> > > > >> > > Looking forward to your further feedback.
> > > > >> > >
> > > > >> > >
> > > > >> > > Best,
> > > > >> > >
> > > > >> > > Lijie
> > > > >> > >
> > > > >> > >
> > > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> > > > >> > >>
> > > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > > > >> > >>
> > > > >> > >> I want to share some thoughts about Flink cluster operations.
> > > > >> > >>
> > > > >> > >> In the production environment, the SRE(aka Site Reliability
> > > > Engineer)
> > > > >> > >> already has many tools to detect the unstable nodes, which
> > could
> > > > take
> > > > >> > the
> > > > >> > >> system logs/metrics into consideration.
> > > > >> > >> Then they use graceful-decomission in YARN and taint in K8s
> to
> > > > >> prevent
> > > > >> > new
> > > > >> > >> allocations on these unstable nodes.
> > > > >> > >> At last, they will evict all the containers and pods running
> on
> > > > these
> > > > >> > nodes.
> > > > >> > >> This mechanism also works for planned maintenance. So I am
> > afraid
> > > > >> this
> > > > >> > is
> > > > >> > >> not the typical use case for FLIP-224.
> > > > >> > >>
> > > > >> > >> If we only support to block nodes manually, then I could not
> > see
> > > > >> > >> the obvious advantages compared with current SRE's
> approach(via
> > > > *yarn
> > > > >> > >> rmadmin or kubectl taint*).
> > > > >> > >> At least, we need to have a pluggable component which could
> > > expose
> > > > >> the
> > > > >> > >> potential unstable nodes automatically and block them if
> > enabled
> > > > >> > explicitly.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> Best,
> > > > >> > >> Yang
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> > > > >> > >>
> > > > >> > >> > Thanks for the proposal, Lijie.
> > > > >> > >> >
> > > > >> > >> > This is an interesting feature and discussion, and somewhat
> > > > related
> > > > >> > to the
> > > > >> > >> > design principle about how people should operate Flink.
> > > > >> > >> >
> > > > >> > >> > I think there are three things involved in this FLIP.
> > > > >> > >> >      a) Detect and report the unstable node.
> > > > >> > >> >      b) Collect the information of the unstable node and
> > form a
> > > > >> > blocklist.
> > > > >> > >> >      c) Take the action to block nodes.
> > > > >> > >> >
> > > > >> > >> > My two cents:
> > > > >> > >> >
> > > > >> > >> > 1. It looks like people all agree that Flink should have
> c).
> > It
> > > > is
> > > > >> > not only
> > > > >> > >> > useful for cases of node failures, but also handy for some
> > > > planned
> > > > >> > >> > maintenance.
> > > > >> > >> >
> > > > >> > >> > 2. People have different opinions on b), i.e. who should be
> > the
> > > > >> brain
> > > > >> > to
> > > > >> > >> > make the decision to block a node. I think this largely
> > depends
> > > > on
> > > > >> > who we
> > > > >> > >> > talk to. Different users would probably give different
> > answers.
> > > > For
> > > > >> > people
> > > > >> > >> > who do have a centralized node health management service,
> let
> > > > Flink
> > > > >> > do just
> > > > >> > >> > do a) and c) would be preferred. So essentially Flink would
> > be
> > > > one
> > > > >> of
> > > > >> > the
> > > > >> > >> > sources that may detect unstable nodes, report it to that
> > > > service,
> > > > >> > and then
> > > > >> > >> > take the command from that service to block the problematic
> > > > nodes.
> > > > >> On
> > > > >> > the
> > > > >> > >> > other hand, for users who do not have such a service,
> simply
> > > > >> letting
> > > > >> > Flink
> > > > >> > >> > be clever by itself to block the suspicious nodes might be
> > > > desired
> > > > >> to
> > > > >> > >> > ensure the jobs are running smoothly.
> > > > >> > >> >
> > > > >> > >> > So that indicates a) and b) here should be pluggable /
> > > optional.
> > > > >> > >> >
> > > > >> > >> > In light of this, maybe it would make sense to have
> something
> > > > >> > pluggable
> > > > >> > >> > like a UnstableNodeReporter which exposes unstable nodes
> > > > actively.
> > > > >> (A
> > > > >> > more
> > > > >> > >> > general interface should be JobInfoReporter<T> which can be
> > > used
> > > > to
> > > > >> > report
> > > > >> > >> > any information of type <T>. But I'll just keep the scope
> > > > relevant
> > > > >> to
> > > > >> > this
> > > > >> > >> > FLIP here). Personally speaking, I think it is OK to have a
> > > > default
> > > > >> > >> > implementation of a reporter which just tells Flink to take
> > > > action
> > > > >> to
> > > > >> > block
> > > > >> > >> > problematic nodes and also unblocks them after timeout.
> > > > >> > >> >
> > > > >> > >> > Thanks,
> > > > >> > >> >
> > > > >> > >> > Jiangjie (Becket) Qin
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > > ro.v.boyko@gmail.com
> > > > >
> > > > >> > wrote:
> > > > >> > >> >
> > > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > > > >> > >> > >
> > > > >> > >> > > If it's possible I'd like to participate in development.
> > > > >> > >> > >
> > > > >> > >> > > I agree with 3rd point of Konstantin's reply - we should
> > > > consider
> > > > >> > to move
> > > > >> > >> > > somehow the information of blocklisted nodes/TMs from
> > active
> > > > >> > >> > > ResourceManager to non-active ones. Probably storing
> inside
> > > > >> > >> > > Zookeeper/Configmap might be helpful here.
> > > > >> > >> > >
> > > > >> > >> > > And I agree with Martijn that a lot of organizations
> don't
> > > want
> > > > >> to
> > > > >> > expose
> > > > >> > >> > > such API for a cluster user group. But I think it's
> > necessary
> > > > to
> > > > >> > have the
> > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for
> avoiding
> > > > >> incorrect
> > > > >> > >> > > automatic behaviour.
> > > > >> > >> > >
> > > > >> > >> > > And another one small suggestion - I think it would be
> > better
> > > > to
> > > > >> > extend
> > > > >> > >> > the
> > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field and
> > > fill
> > > > it
> > > > >> > at the
> > > > >> > >> > > item creation. This simple addition will allow to:
> > > > >> > >> > >
> > > > >> > >> > >    -
> > > > >> > >> > >
> > > > >> > >> > >    Provide the ability to users to setup the exact time
> of
> > > > >> > blocklist end
> > > > >> > >> > >    through RestAPI
> > > > >> > >> > >    -
> > > > >> > >> > >
> > > > >> > >> > >    Not being tied to a single value of
> > > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > > > >> chesnay@apache.org>
> > > > >> > >> > wrote:
> > > > >> > >> > >
> > > > >> > >> > > > I do share the concern between blurring the lines a
> bit.
> > > > >> > >> > > >
> > > > >> > >> > > > That said, I'd prefer to not have any auto-detection
> and
> > > only
> > > > >> > have an
> > > > >> > >> > > > opt-in mechanism
> > > > >> > >> > > > to manually block processes/nodes. To me this sounds
> yet
> > > > again
> > > > >> > like one
> > > > >> > >> > > > of those
> > > > >> > >> > > > magical mechanisms that will rarely work just right.
> > > > >> > >> > > > An external system can leverage way more information
> > after
> > > > all.
> > > > >> > >> > > >
> > > > >> > >> > > > Moreover, I'm quite concerned about the complexity of
> > this
> > > > >> > proposal.
> > > > >> > >> > > > Tracking on both the RM/JM side; syncing between
> > > components;
> > > > >> > >> > adjustments
> > > > >> > >> > > > to the
> > > > >> > >> > > > slot and resource protocol.
> > > > >> > >> > > >
> > > > >> > >> > > > In a way it seems overly complicated.
> > > > >> > >> > > >
> > > > >> > >> > > > If we look at it purely from an active resource
> > management
> > > > >> > perspective,
> > > > >> > >> > > > then there
> > > > >> > >> > > > isn't really a need to touch the slot protocol at all
> (or
> > > in
> > > > >> fact
> > > > >> > to
> > > > >> > >> > > > anything in the JobMaster),
> > > > >> > >> > > > because there isn't any point in keeping around blocked
> > TMs
> > > > in
> > > > >> the
> > > > >> > >> > first
> > > > >> > >> > > > place.
> > > > >> > >> > > > They'd just be idling, potentially shutting down after
> a
> > > > while
> > > > >> by
> > > > >> > the
> > > > >> > >> > RM
> > > > >> > >> > > > because of
> > > > >> > >> > > > it (unless we _also_ touch that logic).
> > > > >> > >> > > > Here the blocking of a process (be it by blocking the
> > > process
> > > > >> or
> > > > >> > node)
> > > > >> > >> > is
> > > > >> > >> > > > equivalent with shutting down the blocked process(es).
> > > > >> > >> > > > Once the block is lifted we can just spin it back up.
> > > > >> > >> > > >
> > > > >> > >> > > > And I do wonder whether we couldn't apply the same line
> > of
> > > > >> > thinking to
> > > > >> > >> > > > standalone resource management.
> > > > >> > >> > > > Here being able to stop/restart a process/node manually
> > > > should
> > > > >> be
> > > > >> > a
> > > > >> > >> > core
> > > > >> > >> > > > requirement for a Flink deployment anyway.
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > > >> > >> > > > > Hi everyone,
> > > > >> > >> > > > >
> > > > >> > >> > > > > Thanks for creating this FLIP. I can understand the
> > > problem
> > > > >> and
> > > > >> > I see
> > > > >> > >> > > > value
> > > > >> > >> > > > > in the automatic detection and blocklisting. I do
> have
> > > some
> > > > >> > concerns
> > > > >> > >> > > with
> > > > >> > >> > > > > the ability to manually specify to be blocked
> > resources.
> > > I
> > > > >> have
> > > > >> > two
> > > > >> > >> > > > > concerns;
> > > > >> > >> > > > >
> > > > >> > >> > > > > * Most organizations explicitly have a separation of
> > > > >> concerns,
> > > > >> > >> > meaning
> > > > >> > >> > > > that
> > > > >> > >> > > > > there's a group who's responsible for managing a
> > cluster
> > > > and
> > > > >> > there's
> > > > >> > >> > a
> > > > >> > >> > > > user
> > > > >> > >> > > > > group who uses that cluster. With the introduction of
> > > this
> > > > >> > mechanism,
> > > > >> > >> > > the
> > > > >> > >> > > > > latter group now can influence the responsibility of
> > the
> > > > >> first
> > > > >> > group.
> > > > >> > >> > > So
> > > > >> > >> > > > it
> > > > >> > >> > > > > can be possible that someone from the user group
> blocks
> > > > >> > something,
> > > > >> > >> > > which
> > > > >> > >> > > > > causes an outage (which could result in paging
> > mechanism
> > > > >> > triggering
> > > > >> > >> > > etc)
> > > > >> > >> > > > > which impacts the first group.
> > > > >> > >> > > > > * How big is the group of people who can go through
> the
> > > > >> process
> > > > >> > of
> > > > >> > >> > > > manually
> > > > >> > >> > > > > identifying a node that isn't behaving as it should
> > be? I
> > > > do
> > > > >> > think
> > > > >> > >> > this
> > > > >> > >> > > > > group is relatively limited. Does it then make sense
> to
> > > > >> > introduce
> > > > >> > >> > such
> > > > >> > >> > > a
> > > > >> > >> > > > > feature, which would only be used by a really small
> > user
> > > > >> group
> > > > >> > of
> > > > >> > >> > > Flink?
> > > > >> > >> > > > We
> > > > >> > >> > > > > still have to maintain, test and support such a
> > feature.
> > > > >> > >> > > > >
> > > > >> > >> > > > > I'm +1 for the autodetection features, but I'm
> leaning
> > > > >> towards
> > > > >> > not
> > > > >> > >> > > > exposing
> > > > >> > >> > > > > this to the user group but having this available
> > strictly
> > > > for
> > > > >> > cluster
> > > > >> > >> > > > > operators. They could then also set up their
> > > > >> > paging/metrics/logging
> > > > >> > >> > > > system
> > > > >> > >> > > > > to take this into account.
> > > > >> > >> > > > >
> > > > >> > >> > > > > Best regards,
> > > > >> > >> > > > >
> > > > >> > >> > > > > Martijn Visser
> > > > >> > >> > > > > https://twitter.com/MartijnVisser82
> > > > >> > >> > > > > https://github.com/MartijnVisser
> > > > >> > >> > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > > > karmagyz@gmail.com
> > > > >> >
> > > > >> > wrote:
> > > > >> > >> > > > >
> > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> +1 for the overall proposal. Just share some cents
> > here:
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> - Why do we need to expose
> > > > >> > >> > > > >>
> cluster.resource-blacklist.item.timeout-check-interval
> > > to
> > > > >> the
> > > > >> > user?
> > > > >> > >> > > > >> I think the semantics of
> > > > >> > `cluster.resource-blacklist.item.timeout`
> > > > >> > >> > is
> > > > >> > >> > > > >> sufficient for the user. How to guarantee the
> timeout
> > > > >> > mechanism is
> > > > >> > >> > > > >> Flink's internal implementation. I think it will be
> > very
> > > > >> > confusing
> > > > >> > >> > and
> > > > >> > >> > > > >> we do not need to expose it to users.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> - ResourceManager can notify the exception of a task
> > > > >> manager to
> > > > >> > >> > > > >> `BlacklistHandler` as well.
> > > > >> > >> > > > >> For example, the slot allocation might fail in case
> > the
> > > > >> target
> > > > >> > task
> > > > >> > >> > > > >> manager is busy or has a network jitter. I don't
> mean
> > we
> > > > >> need
> > > > >> > to
> > > > >> > >> > cover
> > > > >> > >> > > > >> this case in this version, but we can also open a
> > > > >> > `notifyException`
> > > > >> > >> > in
> > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> - Before we sync the blocklist to ResourceManager,
> > will
> > > > the
> > > > >> > slot of
> > > > >> > >> > a
> > > > >> > >> > > > >> blocked task manager continues to be released and
> > > > allocated?
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> Best,
> > > > >> > >> > > > >> Yangze Guo
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > > > >> > >> > wangdachui9501@gmail.com>
> > > > >> > >> > > > >> wrote:
> > > > >> > >> > > > >>> Hi Konstantin,
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> Thanks for your feedback. I will response your 4
> > > remarks:
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I
> > think
> > > > >> > “BlockList”
> > > > >> > >> > is
> > > > >> > >> > > > >> good
> > > > >> > >> > > > >>> enough, and I will change it in FLIP.
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a good idea.
> > > Based
> > > > >> on
> > > > >> > the
> > > > >> > >> > > > above, I
> > > > >> > >> > > > >>> would change REST API as following:
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that
> > all
> > > > >> task
> > > > >> > >> > managers
> > > > >> > >> > > > on
> > > > >> > >> > > > >>> this node are blocklisted. All slots on these TMs
> are
> > > not
> > > > >> > >> > available.
> > > > >> > >> > > > This
> > > > >> > >> > > > >>> is actually a bit like TM losts, but these TMs are
> > not
> > > > >> really
> > > > >> > lost,
> > > > >> > >> > > > they
> > > > >> > >> > > > >>> are in an unavailable status, and they are still
> > > > registered
> > > > >> > in this
> > > > >> > >> > > > flink
> > > > >> > >> > > > >>> cluster. They will be available again once the
> > > > >> corresponding
> > > > >> > >> > > blocklist
> > > > >> > >> > > > >> item
> > > > >> > >> > > > >>> is removed. This behavior is the same in
> > > > active/non-active
> > > > >> > >> > clusters.
> > > > >> > >> > > > >>> However in the active clusters, these TMs may be
> > > released
> > > > >> due
> > > > >> > to
> > > > >> > >> > idle
> > > > >> > >> > > > >>> timeouts.
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The
> > > reasons
> > > > >> are
> > > > >> > as
> > > > >> > >> > > > >> following:
> > > > >> > >> > > > >>> a) The timeout will not affect users adding or
> > removing
> > > > >> items
> > > > >> > via
> > > > >> > >> > > REST
> > > > >> > >> > > > >> API,
> > > > >> > >> > > > >>> and users can disable it by configuring it to
> > > > >> Long.MAX_VALUE .
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> b) Some node problems can recover after a period of
> > > time
> > > > >> > (such as
> > > > >> > >> > > > machine
> > > > >> > >> > > > >>> hotspots), in which case users may prefer that
> Flink
> > > can
> > > > do
> > > > >> > this
> > > > >> > >> > > > >>> automatically instead of requiring the user to do
> it
> > > > >> manually.
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> Best,
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> Lijie
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三
> > > > >> 19:23写道：
> > > > >> > >> > > > >>>
> > > > >> > >> > > > >>>> Hi Lijie,
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> I think, this makes sense and +1 to only support
> > > > manually
> > > > >> > blocking
> > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different
> > strategies
> > > > can
> > > > >> > also be
> > > > >> > >> > > > >>>> maintained outside of Apache Flink.
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> A few remarks:
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due
> to
> > > the
> > > > >> > >> > controversy
> > > > >> > >> > > > >> around
> > > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about
> > this
> > > > >> topic a
> > > > >> > >> > while
> > > > >> > >> > > > >> back
> > > > >> > >> > > > >>>> and there was generally a consensus to avoid the
> > term
> > > > >> > blacklist &
> > > > >> > >> > > > >> whitelist
> > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> > > > "quarantined"
> > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly
> different
> > > > >> design
> > > > >> > as
> > > > >> > >> > verbs
> > > > >> > >> > > > >> like
> > > > >> > >> > > > >>>> add/remove often considered an anti-pattern for
> REST
> > > > APIs.
> > > > >> > POST
> > > > >> > >> > on a
> > > > >> > >> > > > >> list
> > > > >> > >> > > > >>>> item is generally the standard to add items.
> DELETE
> > on
> > > > the
> > > > >> > >> > > individual
> > > > >> > >> > > > >>>> resource is standard to remove an item.
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> POST <host>/quarantine/items
> > > > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> We could also consider to separate taskmanagers
> and
> > > > nodes
> > > > >> in
> > > > >> > the
> > > > >> > >> > > REST
> > > > >> > >> > > > >> API
> > > > >> > >> > > > >>>> (and internal data structures). Any opinion on
> this?
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > > > >> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> 3) How would blocking nodes behave with non-active
> > > > >> resource
> > > > >> > >> > > managers,
> > > > >> > >> > > > >> i.e.
> > > > >> > >> > > > >>>> standalone or reactive mode?
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> 4) To keep the implementation even more minimal,
> do
> > we
> > > > >> need
> > > > >> > the
> > > > >> > >> > > > timeout
> > > > >> > >> > > > >>>> behavior? If items are added/removed manually we
> > could
> > > > >> > delegate
> > > > >> > >> > this
> > > > >> > >> > > > >> to the
> > > > >> > >> > > > >>>> user easily. In my opinion the timeout behavior
> > would
> > > > >> better
> > > > >> > fit
> > > > >> > >> > > into
> > > > >> > >> > > > >>>> specific strategies at a later point.
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> Looking forward to your thoughts.
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> Cheers and thank you,
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> Konstantin
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> [1]
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > > >> > >> > > > >>>> [2]
> > https://issues.apache.org/jira/browse/FLINK-18209
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie
> > Wang
> > > <
> > > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> > > > >> > >> > > > >>>>
> > > > >> > >> > > > >>>>> Hi all,
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> Flink job failures may happen due to cluster node
> > > > issues
> > > > >> > >> > > > >> (insufficient
> > > > >> > >> > > > >>>> disk
> > > > >> > >> > > > >>>>> space, bad hardware, network abnormalities).
> Flink
> > > will
> > > > >> > take care
> > > > >> > >> > > of
> > > > >> > >> > > > >> the
> > > > >> > >> > > > >>>>> failures and redeploy the tasks. However, due to
> > data
> > > > >> > locality
> > > > >> > >> > and
> > > > >> > >> > > > >>>> limited
> > > > >> > >> > > > >>>>> resources, the new tasks are very likely to be
> > > > redeployed
> > > > >> > to the
> > > > >> > >> > > same
> > > > >> > >> > > > >>>>> nodes, which will result in continuous task
> > > > abnormalities
> > > > >> > and
> > > > >> > >> > > affect
> > > > >> > >> > > > >> job
> > > > >> > >> > > > >>>>> progress.
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> Currently, Flink users need to manually identify
> > the
> > > > >> > problematic
> > > > >> > >> > > > >> node and
> > > > >> > >> > > > >>>>> take it offline to solve this problem. But this
> > > > approach
> > > > >> has
> > > > >> > >> > > > >> following
> > > > >> > >> > > > >>>>> disadvantages:
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy process.
> > > Users
> > > > >> may
> > > > >> > need
> > > > >> > >> > to
> > > > >> > >> > > > >>>> contact
> > > > >> > >> > > > >>>>> cluster administors to do this. The operation can
> > > even
> > > > be
> > > > >> > >> > dangerous
> > > > >> > >> > > > >> and
> > > > >> > >> > > > >>>> not
> > > > >> > >> > > > >>>>> allowed during some important business events.
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of problems
> > > > manually
> > > > >> > would
> > > > >> > >> > be
> > > > >> > >> > > > >> slow
> > > > >> > >> > > > >>>> and
> > > > >> > >> > > > >>>>> a waste of human resources.
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
> > > > >> introduce a
> > > > >> > >> > > blacklist
> > > > >> > >> > > > >>>>> mechanism for Flink to filter out problematic
> > > > resources.
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking
> > > > forward
> > > > >> > to your
> > > > >> > >> > > > >>>> feedback.
> > > > >> > >> > > > >>>>> [1]
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> Best,
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > > >>>>> Lijie
> > > > >> > >> > > > >>>>>
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Roman Boyko
> > e.: ro.v.boyko@gmail.com
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi everyone,

Based on the discussion in the mailing list, I updated the FLIP doc, the
changes include:
1. Changed the description of the motivation section to more clearly
describe the problem this FLIP is trying to solve.
2. Only  *Manually* is supported.
3. Adopted some suggestions, such as *endTimestamp*.

Best,
Lijie


Roman Boyko <ro...@gmail.com> 于2022年5月7日周六 19:25写道：

> Hi Lijie!
>
>
>
>
> *a) “Probably storing inside Zookeeper/Configmap might be helpfulhere.”
> Can you explain it in detail? I don't fully understand that. In myopinion,
> non-active and active are the same, and no special treatment isrequired.*
>
> Sorry this was a misunderstanding from my side. I thought we were talking
> about the HA mode (but not about Active and Standalone ResourceManager).
> And the original question was - how to handle the blacklisted nodes list at
> the moment of leader change? Should we simply forget about them or try to
> pre-save that list on the remote storage?
>
> On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com> wrote:
>
> > Thanks Lijie and ZhuZhu for the explanation.
> >
> > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is indeed
> > some functionalities the external tools(e.g. kubectl taint) could not
> > support.
> >
> >
> > Best,
> > Yang
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
> >
> > > Thanks for your feedback, Jiangang and Martijn.
> > >
> > > @Jiangang
> > >
> > >
> > > > For auto-detecting, I wonder how to make the strategy and mark a node
> > > blocked?
> > >
> > > In fact, we currently plan to not support auto-detection in this FLIP.
> > The
> > > part about auto-detection may be continued in a separate FLIP in the
> > > future. Some guys have the same concerns as you, and the correctness
> and
> > > necessity of auto-detection may require further discussion in the
> future.
> > >
> > > > In session mode, multi jobs can fail on the same bad node and the
> node
> > > should be marked blocked.
> > > By design, the blocklist information will be shared among all jobs in a
> > > cluster/session. The JM will sync blocklist information with RM.
> > >
> > > @Martijn
> > >
> > > > I agree with Yang Wang on this.
> > > As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just
> > limits
> > > the load of the node and does not  kill all the processes on it) is
> also
> > > important, and we think that external systems (*yarn rmadmin or kubectl
> > > taint*) cannot support it. So we think it makes sense even only
> > *manually*.
> > >
> > > > I also agree with Chesnay that magical mechanisms are indeed super
> hard
> > > to get right.
> > > Yes, as you see, Jiangang(and a few others) have the same concern.
> > > However, we currently plan to not support auto-detection in this FLIP,
> > and
> > > only *manually*. In addition, I'd like to say that the FLIP provides a
> > > mechanism to support MARK_BLOCKLISTED and
> > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > > the auto-detection may be done by external systems.
> > >
> > > Best,
> > > Lijie
> > >
> > > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
> > >
> > > > > If we only support to block nodes manually, then I could not see
> > > > the obvious advantages compared with current SRE's approach(via *yarn
> > > > rmadmin or kubectl taint*).
> > > >
> > > > I agree with Yang Wang on this.
> > > >
> > > > >  To me this sounds yet again like one of those magical mechanisms
> > that
> > > > will rarely work just right.
> > > >
> > > > I also agree with Chesnay that magical mechanisms are indeed super
> hard
> > > to
> > > > get right.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <liujiangangpeng@gmail.com
> >
> > > > wrote:
> > > >
> > > >> Thanks for the valuable design. The auto-detecting can decrease
> great
> > > work
> > > >> for us. We have implemented the similar feature in our inner flink
> > > >> version.
> > > >> Below is something that I care about:
> > > >>
> > > >>    1. For auto-detecting, I wonder how to make the strategy and
> mark a
> > > >> node
> > > >>    blocked? Sometimes the blocked node is hard to be detected, for
> > > >> example,
> > > >>    the upper node or the down node will be blocked when network
> > > >> unreachable.
> > > >>    2. I see that the strategy is made in JobMaster side. How about
> > > >>    implementing the similar logic in resource manager? In session
> > mode,
> > > >> multi
> > > >>    jobs can fail on the same bad node and the node should be marked
> > > >> blocked.
> > > >>    If the job makes the strategy, the node may be not marked blocked
> > if
> > > >> the
> > > >>    fail times don't exceed the threshold.
> > > >>
> > > >>
> > > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > > >>
> > > >> > Thank you for all your feedback!
> > > >> >
> > > >> > Besides the answers from Lijie, I'd like to share some of my
> > thoughts:
> > > >> > 1. Whether to enable automatical blocklist
> > > >> > Generally speaking, it is not a goal of FLIP-224.
> > > >> > The automatical way should be something built upon the blocklist
> > > >> > mechanism and well decoupled. It was designed to be a configurable
> > > >> > blocklist strategy, but I think we can further decouple it by
> > > >> > introducing a abnormal node detector, as Becket suggested, which
> > just
> > > >> > uses the blocklist mechanism once bad nodes are detected. However,
> > it
> > > >> > should be a separate FLIP with further dev discussions and
> feedback
> > > >> > from users. I also agree with Becket that different users have
> > > different
> > > >> > requirements, and we should listen to them.
> > > >> >
> > > >> > 2. Is it enough to just take away abnormal nodes externally
> > > >> > My answer is no. As Lijie has mentioned, we need a way to avoid
> > > >> > deploying tasks to temporary hot nodes. In this case, users may
> just
> > > >> > want to limit the load of the node and do not want to kill all the
> > > >> > processes on it. Another case is the speculative execution[1]
> which
> > > >> > may also leverage this feature to avoid starting mirror tasks on
> > slow
> > > >> > nodes.
> > > >> >
> > > >> > Thanks,
> > > >> > Zhu
> > > >> >
> > > >> > [1]
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > >> >
> > > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
> > > >> >
> > > >> > >
> > > >> > > Hi everyone,
> > > >> > >
> > > >> > >
> > > >> > > Thanks for your feedback.
> > > >> > >
> > > >> > >
> > > >> > > There's one detail that I'd like to re-emphasize here because it
> > can
> > > >> > affect the value and design of the blocklist mechanism (perhaps I
> > > should
> > > >> > highlight it in the FLIP). We propose two actions in FLIP:
> > > >> > >
> > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as
> > blocked.
> > > >> > Future slots should not be allocated from the blocked task manager
> > or
> > > >> node.
> > > >> > But slots that are already allocated will not be affected. A
> typical
> > > >> > application scenario is to mitigate machine hotspots. In this
> case,
> > we
> > > >> hope
> > > >> > that subsequent resource allocations will not be on the hot
> machine,
> > > but
> > > >> > tasks currently running on it should not be affected.
> > > >> > >
> > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or
> > > node
> > > >> as
> > > >> > blocked, and evacuate all tasks on it. Evacuated tasks will be
> > > >> restarted on
> > > >> > non-blocked task managers.
> > > >> > >
> > > >> > > For the above 2 actions, the former may more highlight the
> meaning
> > > of
> > > >> > this FLIP, because the external system cannot do that.
> > > >> > >
> > > >> > >
> > > >> > > Regarding *Manually* and *Automatically*, I basically agree with
> > > >> @Becket
> > > >> > Qin: different users have different answers. Not all users’
> > deployment
> > > >> > environments have a special external system that can perform the
> > > anomaly
> > > >> > detection. In addition, adding pluggable/optional auto-detection
> > > doesn't
> > > >> > require much extra work on top of manual specification.
> > > >> > >
> > > >> > >
> > > >> > > I will answer your other questions one by one.
> > > >> > >
> > > >> > >
> > > >> > > @Yangze
> > > >> > >
> > > >> > > a) I think you are right, we do not need to expose the
> > > >> > `cluster.resource-blocklist.item.timeout-check-interval` to users.
> > > >> > >
> > > >> > > b) We can abstract the `notifyException` to a separate interface
> > > >> (maybe
> > > >> > BlocklistExceptionListener), and the
> ResourceManagerBlocklistHandler
> > > can
> > > >> > implement it in the future.
> > > >> > >
> > > >> > >
> > > >> > > @Martijn
> > > >> > >
> > > >> > > a) I also think the manual blocking should be done by cluster
> > > >> operators.
> > > >> > >
> > > >> > > b) I think manual blocking makes sense, because according to my
> > > >> > experience, users are often the first to perceive the machine
> > problems
> > > >> > (because of job failover or delay), and they will contact cluster
> > > >> operators
> > > >> > to solve it, or even tell the cluster operators which machine is
> > > >> > problematic. From this point of view, I think the people who
> really
> > > need
> > > >> > the manual blocking are the users, and it’s just performed by the
> > > >> cluster
> > > >> > operator, so I think the manual blocking makes sense.
> > > >> > >
> > > >> > >
> > > >> > > @Chesnay
> > > >> > >
> > > >> > > We need to touch the logic of JM/SlotPool, because for
> > > >> MARK_BLOCKLISTED
> > > >> > , we need to know whether the slot is blocklisted when the task is
> > > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the
> slot
> > > >> > directly to avoid assigning other tasks (of this job) on it. If we
> > > only
> > > >> > maintain the blocklist information on the RM, JM needs to retrieve
> > it
> > > by
> > > >> > RPC. I think the performance overhead of that is relatively large,
> > so
> > > I
> > > >> > think it's worth maintaining the blocklist information on the JM
> > side
> > > >> and
> > > >> > syncing them.
> > > >> > >
> > > >> > >
> > > >> > > @Роман
> > > >> > >
> > > >> > >     a) “Probably storing inside Zookeeper/Configmap might be
> > helpful
> > > >> > here.”  Can you explain it in detail? I don't fully understand
> that.
> > > In
> > > >> my
> > > >> > opinion, non-active and active are the same, and no special
> > treatment
> > > is
> > > >> > required.
> > > >> > >
> > > >> > > b) I agree with you, the `endTimestamp` makes sense, I will add
> it
> > > to
> > > >> > FLIP.
> > > >> > >
> > > >> > >
> > > >> > > @Yang
> > > >> > >
> > > >> > > As mentioned above, AFAK, the external system cannot support the
> > > >> > MARK_BLOCKLISTED action.
> > > >> > >
> > > >> > >
> > > >> > > Looking forward to your further feedback.
> > > >> > >
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Lijie
> > > >> > >
> > > >> > >
> > > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> > > >> > >>
> > > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > > >> > >>
> > > >> > >> I want to share some thoughts about Flink cluster operations.
> > > >> > >>
> > > >> > >> In the production environment, the SRE(aka Site Reliability
> > > Engineer)
> > > >> > >> already has many tools to detect the unstable nodes, which
> could
> > > take
> > > >> > the
> > > >> > >> system logs/metrics into consideration.
> > > >> > >> Then they use graceful-decomission in YARN and taint in K8s to
> > > >> prevent
> > > >> > new
> > > >> > >> allocations on these unstable nodes.
> > > >> > >> At last, they will evict all the containers and pods running on
> > > these
> > > >> > nodes.
> > > >> > >> This mechanism also works for planned maintenance. So I am
> afraid
> > > >> this
> > > >> > is
> > > >> > >> not the typical use case for FLIP-224.
> > > >> > >>
> > > >> > >> If we only support to block nodes manually, then I could not
> see
> > > >> > >> the obvious advantages compared with current SRE's approach(via
> > > *yarn
> > > >> > >> rmadmin or kubectl taint*).
> > > >> > >> At least, we need to have a pluggable component which could
> > expose
> > > >> the
> > > >> > >> potential unstable nodes automatically and block them if
> enabled
> > > >> > explicitly.
> > > >> > >>
> > > >> > >>
> > > >> > >> Best,
> > > >> > >> Yang
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> > > >> > >>
> > > >> > >> > Thanks for the proposal, Lijie.
> > > >> > >> >
> > > >> > >> > This is an interesting feature and discussion, and somewhat
> > > related
> > > >> > to the
> > > >> > >> > design principle about how people should operate Flink.
> > > >> > >> >
> > > >> > >> > I think there are three things involved in this FLIP.
> > > >> > >> >      a) Detect and report the unstable node.
> > > >> > >> >      b) Collect the information of the unstable node and
> form a
> > > >> > blocklist.
> > > >> > >> >      c) Take the action to block nodes.
> > > >> > >> >
> > > >> > >> > My two cents:
> > > >> > >> >
> > > >> > >> > 1. It looks like people all agree that Flink should have c).
> It
> > > is
> > > >> > not only
> > > >> > >> > useful for cases of node failures, but also handy for some
> > > planned
> > > >> > >> > maintenance.
> > > >> > >> >
> > > >> > >> > 2. People have different opinions on b), i.e. who should be
> the
> > > >> brain
> > > >> > to
> > > >> > >> > make the decision to block a node. I think this largely
> depends
> > > on
> > > >> > who we
> > > >> > >> > talk to. Different users would probably give different
> answers.
> > > For
> > > >> > people
> > > >> > >> > who do have a centralized node health management service, let
> > > Flink
> > > >> > do just
> > > >> > >> > do a) and c) would be preferred. So essentially Flink would
> be
> > > one
> > > >> of
> > > >> > the
> > > >> > >> > sources that may detect unstable nodes, report it to that
> > > service,
> > > >> > and then
> > > >> > >> > take the command from that service to block the problematic
> > > nodes.
> > > >> On
> > > >> > the
> > > >> > >> > other hand, for users who do not have such a service, simply
> > > >> letting
> > > >> > Flink
> > > >> > >> > be clever by itself to block the suspicious nodes might be
> > > desired
> > > >> to
> > > >> > >> > ensure the jobs are running smoothly.
> > > >> > >> >
> > > >> > >> > So that indicates a) and b) here should be pluggable /
> > optional.
> > > >> > >> >
> > > >> > >> > In light of this, maybe it would make sense to have something
> > > >> > pluggable
> > > >> > >> > like a UnstableNodeReporter which exposes unstable nodes
> > > actively.
> > > >> (A
> > > >> > more
> > > >> > >> > general interface should be JobInfoReporter<T> which can be
> > used
> > > to
> > > >> > report
> > > >> > >> > any information of type <T>. But I'll just keep the scope
> > > relevant
> > > >> to
> > > >> > this
> > > >> > >> > FLIP here). Personally speaking, I think it is OK to have a
> > > default
> > > >> > >> > implementation of a reporter which just tells Flink to take
> > > action
> > > >> to
> > > >> > block
> > > >> > >> > problematic nodes and also unblocks them after timeout.
> > > >> > >> >
> > > >> > >> > Thanks,
> > > >> > >> >
> > > >> > >> > Jiangjie (Becket) Qin
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> > ro.v.boyko@gmail.com
> > > >
> > > >> > wrote:
> > > >> > >> >
> > > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > > >> > >> > >
> > > >> > >> > > If it's possible I'd like to participate in development.
> > > >> > >> > >
> > > >> > >> > > I agree with 3rd point of Konstantin's reply - we should
> > > consider
> > > >> > to move
> > > >> > >> > > somehow the information of blocklisted nodes/TMs from
> active
> > > >> > >> > > ResourceManager to non-active ones. Probably storing inside
> > > >> > >> > > Zookeeper/Configmap might be helpful here.
> > > >> > >> > >
> > > >> > >> > > And I agree with Martijn that a lot of organizations don't
> > want
> > > >> to
> > > >> > expose
> > > >> > >> > > such API for a cluster user group. But I think it's
> necessary
> > > to
> > > >> > have the
> > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding
> > > >> incorrect
> > > >> > >> > > automatic behaviour.
> > > >> > >> > >
> > > >> > >> > > And another one small suggestion - I think it would be
> better
> > > to
> > > >> > extend
> > > >> > >> > the
> > > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field and
> > fill
> > > it
> > > >> > at the
> > > >> > >> > > item creation. This simple addition will allow to:
> > > >> > >> > >
> > > >> > >> > >    -
> > > >> > >> > >
> > > >> > >> > >    Provide the ability to users to setup the exact time of
> > > >> > blocklist end
> > > >> > >> > >    through RestAPI
> > > >> > >> > >    -
> > > >> > >> > >
> > > >> > >> > >    Not being tied to a single value of
> > > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > > >> > >> > >
> > > >> > >> > >
> > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > > >> chesnay@apache.org>
> > > >> > >> > wrote:
> > > >> > >> > >
> > > >> > >> > > > I do share the concern between blurring the lines a bit.
> > > >> > >> > > >
> > > >> > >> > > > That said, I'd prefer to not have any auto-detection and
> > only
> > > >> > have an
> > > >> > >> > > > opt-in mechanism
> > > >> > >> > > > to manually block processes/nodes. To me this sounds yet
> > > again
> > > >> > like one
> > > >> > >> > > > of those
> > > >> > >> > > > magical mechanisms that will rarely work just right.
> > > >> > >> > > > An external system can leverage way more information
> after
> > > all.
> > > >> > >> > > >
> > > >> > >> > > > Moreover, I'm quite concerned about the complexity of
> this
> > > >> > proposal.
> > > >> > >> > > > Tracking on both the RM/JM side; syncing between
> > components;
> > > >> > >> > adjustments
> > > >> > >> > > > to the
> > > >> > >> > > > slot and resource protocol.
> > > >> > >> > > >
> > > >> > >> > > > In a way it seems overly complicated.
> > > >> > >> > > >
> > > >> > >> > > > If we look at it purely from an active resource
> management
> > > >> > perspective,
> > > >> > >> > > > then there
> > > >> > >> > > > isn't really a need to touch the slot protocol at all (or
> > in
> > > >> fact
> > > >> > to
> > > >> > >> > > > anything in the JobMaster),
> > > >> > >> > > > because there isn't any point in keeping around blocked
> TMs
> > > in
> > > >> the
> > > >> > >> > first
> > > >> > >> > > > place.
> > > >> > >> > > > They'd just be idling, potentially shutting down after a
> > > while
> > > >> by
> > > >> > the
> > > >> > >> > RM
> > > >> > >> > > > because of
> > > >> > >> > > > it (unless we _also_ touch that logic).
> > > >> > >> > > > Here the blocking of a process (be it by blocking the
> > process
> > > >> or
> > > >> > node)
> > > >> > >> > is
> > > >> > >> > > > equivalent with shutting down the blocked process(es).
> > > >> > >> > > > Once the block is lifted we can just spin it back up.
> > > >> > >> > > >
> > > >> > >> > > > And I do wonder whether we couldn't apply the same line
> of
> > > >> > thinking to
> > > >> > >> > > > standalone resource management.
> > > >> > >> > > > Here being able to stop/restart a process/node manually
> > > should
> > > >> be
> > > >> > a
> > > >> > >> > core
> > > >> > >> > > > requirement for a Flink deployment anyway.
> > > >> > >> > > >
> > > >> > >> > > >
> > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > >> > >> > > > > Hi everyone,
> > > >> > >> > > > >
> > > >> > >> > > > > Thanks for creating this FLIP. I can understand the
> > problem
> > > >> and
> > > >> > I see
> > > >> > >> > > > value
> > > >> > >> > > > > in the automatic detection and blocklisting. I do have
> > some
> > > >> > concerns
> > > >> > >> > > with
> > > >> > >> > > > > the ability to manually specify to be blocked
> resources.
> > I
> > > >> have
> > > >> > two
> > > >> > >> > > > > concerns;
> > > >> > >> > > > >
> > > >> > >> > > > > * Most organizations explicitly have a separation of
> > > >> concerns,
> > > >> > >> > meaning
> > > >> > >> > > > that
> > > >> > >> > > > > there's a group who's responsible for managing a
> cluster
> > > and
> > > >> > there's
> > > >> > >> > a
> > > >> > >> > > > user
> > > >> > >> > > > > group who uses that cluster. With the introduction of
> > this
> > > >> > mechanism,
> > > >> > >> > > the
> > > >> > >> > > > > latter group now can influence the responsibility of
> the
> > > >> first
> > > >> > group.
> > > >> > >> > > So
> > > >> > >> > > > it
> > > >> > >> > > > > can be possible that someone from the user group blocks
> > > >> > something,
> > > >> > >> > > which
> > > >> > >> > > > > causes an outage (which could result in paging
> mechanism
> > > >> > triggering
> > > >> > >> > > etc)
> > > >> > >> > > > > which impacts the first group.
> > > >> > >> > > > > * How big is the group of people who can go through the
> > > >> process
> > > >> > of
> > > >> > >> > > > manually
> > > >> > >> > > > > identifying a node that isn't behaving as it should
> be? I
> > > do
> > > >> > think
> > > >> > >> > this
> > > >> > >> > > > > group is relatively limited. Does it then make sense to
> > > >> > introduce
> > > >> > >> > such
> > > >> > >> > > a
> > > >> > >> > > > > feature, which would only be used by a really small
> user
> > > >> group
> > > >> > of
> > > >> > >> > > Flink?
> > > >> > >> > > > We
> > > >> > >> > > > > still have to maintain, test and support such a
> feature.
> > > >> > >> > > > >
> > > >> > >> > > > > I'm +1 for the autodetection features, but I'm leaning
> > > >> towards
> > > >> > not
> > > >> > >> > > > exposing
> > > >> > >> > > > > this to the user group but having this available
> strictly
> > > for
> > > >> > cluster
> > > >> > >> > > > > operators. They could then also set up their
> > > >> > paging/metrics/logging
> > > >> > >> > > > system
> > > >> > >> > > > > to take this into account.
> > > >> > >> > > > >
> > > >> > >> > > > > Best regards,
> > > >> > >> > > > >
> > > >> > >> > > > > Martijn Visser
> > > >> > >> > > > > https://twitter.com/MartijnVisser82
> > > >> > >> > > > > https://github.com/MartijnVisser
> > > >> > >> > > > >
> > > >> > >> > > > >
> > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > > karmagyz@gmail.com
> > > >> >
> > > >> > wrote:
> > > >> > >> > > > >
> > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > > >> > >> > > > >>
> > > >> > >> > > > >> +1 for the overall proposal. Just share some cents
> here:
> > > >> > >> > > > >>
> > > >> > >> > > > >> - Why do we need to expose
> > > >> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval
> > to
> > > >> the
> > > >> > user?
> > > >> > >> > > > >> I think the semantics of
> > > >> > `cluster.resource-blacklist.item.timeout`
> > > >> > >> > is
> > > >> > >> > > > >> sufficient for the user. How to guarantee the timeout
> > > >> > mechanism is
> > > >> > >> > > > >> Flink's internal implementation. I think it will be
> very
> > > >> > confusing
> > > >> > >> > and
> > > >> > >> > > > >> we do not need to expose it to users.
> > > >> > >> > > > >>
> > > >> > >> > > > >> - ResourceManager can notify the exception of a task
> > > >> manager to
> > > >> > >> > > > >> `BlacklistHandler` as well.
> > > >> > >> > > > >> For example, the slot allocation might fail in case
> the
> > > >> target
> > > >> > task
> > > >> > >> > > > >> manager is busy or has a network jitter. I don't mean
> we
> > > >> need
> > > >> > to
> > > >> > >> > cover
> > > >> > >> > > > >> this case in this version, but we can also open a
> > > >> > `notifyException`
> > > >> > >> > in
> > > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > > >> > >> > > > >>
> > > >> > >> > > > >> - Before we sync the blocklist to ResourceManager,
> will
> > > the
> > > >> > slot of
> > > >> > >> > a
> > > >> > >> > > > >> blocked task manager continues to be released and
> > > allocated?
> > > >> > >> > > > >>
> > > >> > >> > > > >> Best,
> > > >> > >> > > > >> Yangze Guo
> > > >> > >> > > > >>
> > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > > >> > >> > wangdachui9501@gmail.com>
> > > >> > >> > > > >> wrote:
> > > >> > >> > > > >>> Hi Konstantin,
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> Thanks for your feedback. I will response your 4
> > remarks:
> > > >> > >> > > > >>>
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I
> think
> > > >> > “BlockList”
> > > >> > >> > is
> > > >> > >> > > > >> good
> > > >> > >> > > > >>> enough, and I will change it in FLIP.
> > > >> > >> > > > >>>
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> 2) Your suggestion for the REST API is a good idea.
> > Based
> > > >> on
> > > >> > the
> > > >> > >> > > > above, I
> > > >> > >> > > > >>> would change REST API as following:
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > > >> > >> > > > >>>
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that
> all
> > > >> task
> > > >> > >> > managers
> > > >> > >> > > > on
> > > >> > >> > > > >>> this node are blocklisted. All slots on these TMs are
> > not
> > > >> > >> > available.
> > > >> > >> > > > This
> > > >> > >> > > > >>> is actually a bit like TM losts, but these TMs are
> not
> > > >> really
> > > >> > lost,
> > > >> > >> > > > they
> > > >> > >> > > > >>> are in an unavailable status, and they are still
> > > registered
> > > >> > in this
> > > >> > >> > > > flink
> > > >> > >> > > > >>> cluster. They will be available again once the
> > > >> corresponding
> > > >> > >> > > blocklist
> > > >> > >> > > > >> item
> > > >> > >> > > > >>> is removed. This behavior is the same in
> > > active/non-active
> > > >> > >> > clusters.
> > > >> > >> > > > >>> However in the active clusters, these TMs may be
> > released
> > > >> due
> > > >> > to
> > > >> > >> > idle
> > > >> > >> > > > >>> timeouts.
> > > >> > >> > > > >>>
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The
> > reasons
> > > >> are
> > > >> > as
> > > >> > >> > > > >> following:
> > > >> > >> > > > >>> a) The timeout will not affect users adding or
> removing
> > > >> items
> > > >> > via
> > > >> > >> > > REST
> > > >> > >> > > > >> API,
> > > >> > >> > > > >>> and users can disable it by configuring it to
> > > >> Long.MAX_VALUE .
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> b) Some node problems can recover after a period of
> > time
> > > >> > (such as
> > > >> > >> > > > machine
> > > >> > >> > > > >>> hotspots), in which case users may prefer that Flink
> > can
> > > do
> > > >> > this
> > > >> > >> > > > >>> automatically instead of requiring the user to do it
> > > >> manually.
> > > >> > >> > > > >>>
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> Best,
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> Lijie
> > > >> > >> > > > >>>
> > > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三
> > > >> 19:23写道：
> > > >> > >> > > > >>>
> > > >> > >> > > > >>>> Hi Lijie,
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> I think, this makes sense and +1 to only support
> > > manually
> > > >> > blocking
> > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different
> strategies
> > > can
> > > >> > also be
> > > >> > >> > > > >>>> maintained outside of Apache Flink.
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> A few remarks:
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to
> > the
> > > >> > >> > controversy
> > > >> > >> > > > >> around
> > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about
> this
> > > >> topic a
> > > >> > >> > while
> > > >> > >> > > > >> back
> > > >> > >> > > > >>>> and there was generally a consensus to avoid the
> term
> > > >> > blacklist &
> > > >> > >> > > > >> whitelist
> > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> > > "quarantined"
> > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different
> > > >> design
> > > >> > as
> > > >> > >> > verbs
> > > >> > >> > > > >> like
> > > >> > >> > > > >>>> add/remove often considered an anti-pattern for REST
> > > APIs.
> > > >> > POST
> > > >> > >> > on a
> > > >> > >> > > > >> list
> > > >> > >> > > > >>>> item is generally the standard to add items. DELETE
> on
> > > the
> > > >> > >> > > individual
> > > >> > >> > > > >>>> resource is standard to remove an item.
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> POST <host>/quarantine/items
> > > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> We could also consider to separate taskmanagers and
> > > nodes
> > > >> in
> > > >> > the
> > > >> > >> > > REST
> > > >> > >> > > > >> API
> > > >> > >> > > > >>>> (and internal data structures). Any opinion on this?
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > > >> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> 3) How would blocking nodes behave with non-active
> > > >> resource
> > > >> > >> > > managers,
> > > >> > >> > > > >> i.e.
> > > >> > >> > > > >>>> standalone or reactive mode?
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> 4) To keep the implementation even more minimal, do
> we
> > > >> need
> > > >> > the
> > > >> > >> > > > timeout
> > > >> > >> > > > >>>> behavior? If items are added/removed manually we
> could
> > > >> > delegate
> > > >> > >> > this
> > > >> > >> > > > >> to the
> > > >> > >> > > > >>>> user easily. In my opinion the timeout behavior
> would
> > > >> better
> > > >> > fit
> > > >> > >> > > into
> > > >> > >> > > > >>>> specific strategies at a later point.
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> Looking forward to your thoughts.
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> Cheers and thank you,
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> Konstantin
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> [1]
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> >
> > > >>
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > >> > >> > > > >>>> [2]
> https://issues.apache.org/jira/browse/FLINK-18209
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie
> Wang
> > <
> > > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> > > >> > >> > > > >>>>
> > > >> > >> > > > >>>>> Hi all,
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> Flink job failures may happen due to cluster node
> > > issues
> > > >> > >> > > > >> (insufficient
> > > >> > >> > > > >>>> disk
> > > >> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink
> > will
> > > >> > take care
> > > >> > >> > > of
> > > >> > >> > > > >> the
> > > >> > >> > > > >>>>> failures and redeploy the tasks. However, due to
> data
> > > >> > locality
> > > >> > >> > and
> > > >> > >> > > > >>>> limited
> > > >> > >> > > > >>>>> resources, the new tasks are very likely to be
> > > redeployed
> > > >> > to the
> > > >> > >> > > same
> > > >> > >> > > > >>>>> nodes, which will result in continuous task
> > > abnormalities
> > > >> > and
> > > >> > >> > > affect
> > > >> > >> > > > >> job
> > > >> > >> > > > >>>>> progress.
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> Currently, Flink users need to manually identify
> the
> > > >> > problematic
> > > >> > >> > > > >> node and
> > > >> > >> > > > >>>>> take it offline to solve this problem. But this
> > > approach
> > > >> has
> > > >> > >> > > > >> following
> > > >> > >> > > > >>>>> disadvantages:
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy process.
> > Users
> > > >> may
> > > >> > need
> > > >> > >> > to
> > > >> > >> > > > >>>> contact
> > > >> > >> > > > >>>>> cluster administors to do this. The operation can
> > even
> > > be
> > > >> > >> > dangerous
> > > >> > >> > > > >> and
> > > >> > >> > > > >>>> not
> > > >> > >> > > > >>>>> allowed during some important business events.
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> 2. Identifying and solving this kind of problems
> > > manually
> > > >> > would
> > > >> > >> > be
> > > >> > >> > > > >> slow
> > > >> > >> > > > >>>> and
> > > >> > >> > > > >>>>> a waste of human resources.
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
> > > >> introduce a
> > > >> > >> > > blacklist
> > > >> > >> > > > >>>>> mechanism for Flink to filter out problematic
> > > resources.
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking
> > > forward
> > > >> > to your
> > > >> > >> > > > >>>> feedback.
> > > >> > >> > > > >>>>> [1]
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> Best,
> > > >> > >> > > > >>>>>
> > > >> > >> > > > >>>>> Lijie
> > > >> > >> > > > >>>>>
> > > >> > >> > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
>
>
> --
> Best regards,
> Roman Boyko
> e.: ro.v.boyko@gmail.com
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Roman Boyko <ro...@gmail.com>.

Hi Lijie!




*a) “Probably storing inside Zookeeper/Configmap might be helpfulhere.”
Can you explain it in detail? I don't fully understand that. In myopinion,
non-active and active are the same, and no special treatment isrequired.*

Sorry this was a misunderstanding from my side. I thought we were talking
about the HA mode (but not about Active and Standalone ResourceManager).
And the original question was - how to handle the blacklisted nodes list at
the moment of leader change? Should we simply forget about them or try to
pre-save that list on the remote storage?

On Sat, 7 May 2022 at 10:51, Yang Wang <da...@gmail.com> wrote:

> Thanks Lijie and ZhuZhu for the explanation.
>
> I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is indeed
> some functionalities the external tools(e.g. kubectl taint) could not
> support.
>
>
> Best,
> Yang
>
> Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：
>
> > Thanks for your feedback, Jiangang and Martijn.
> >
> > @Jiangang
> >
> >
> > > For auto-detecting, I wonder how to make the strategy and mark a node
> > blocked?
> >
> > In fact, we currently plan to not support auto-detection in this FLIP.
> The
> > part about auto-detection may be continued in a separate FLIP in the
> > future. Some guys have the same concerns as you, and the correctness and
> > necessity of auto-detection may require further discussion in the future.
> >
> > > In session mode, multi jobs can fail on the same bad node and the node
> > should be marked blocked.
> > By design, the blocklist information will be shared among all jobs in a
> > cluster/session. The JM will sync blocklist information with RM.
> >
> > @Martijn
> >
> > > I agree with Yang Wang on this.
> > As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just
> limits
> > the load of the node and does not  kill all the processes on it) is also
> > important, and we think that external systems (*yarn rmadmin or kubectl
> > taint*) cannot support it. So we think it makes sense even only
> *manually*.
> >
> > > I also agree with Chesnay that magical mechanisms are indeed super hard
> > to get right.
> > Yes, as you see, Jiangang(and a few others) have the same concern.
> > However, we currently plan to not support auto-detection in this FLIP,
> and
> > only *manually*. In addition, I'd like to say that the FLIP provides a
> > mechanism to support MARK_BLOCKLISTED and
> > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > the auto-detection may be done by external systems.
> >
> > Best,
> > Lijie
> >
> > Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
> >
> > > > If we only support to block nodes manually, then I could not see
> > > the obvious advantages compared with current SRE's approach(via *yarn
> > > rmadmin or kubectl taint*).
> > >
> > > I agree with Yang Wang on this.
> > >
> > > >  To me this sounds yet again like one of those magical mechanisms
> that
> > > will rarely work just right.
> > >
> > > I also agree with Chesnay that magical mechanisms are indeed super hard
> > to
> > > get right.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <li...@gmail.com>
> > > wrote:
> > >
> > >> Thanks for the valuable design. The auto-detecting can decrease great
> > work
> > >> for us. We have implemented the similar feature in our inner flink
> > >> version.
> > >> Below is something that I care about:
> > >>
> > >>    1. For auto-detecting, I wonder how to make the strategy and mark a
> > >> node
> > >>    blocked? Sometimes the blocked node is hard to be detected, for
> > >> example,
> > >>    the upper node or the down node will be blocked when network
> > >> unreachable.
> > >>    2. I see that the strategy is made in JobMaster side. How about
> > >>    implementing the similar logic in resource manager? In session
> mode,
> > >> multi
> > >>    jobs can fail on the same bad node and the node should be marked
> > >> blocked.
> > >>    If the job makes the strategy, the node may be not marked blocked
> if
> > >> the
> > >>    fail times don't exceed the threshold.
> > >>
> > >>
> > >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> > >>
> > >> > Thank you for all your feedback!
> > >> >
> > >> > Besides the answers from Lijie, I'd like to share some of my
> thoughts:
> > >> > 1. Whether to enable automatical blocklist
> > >> > Generally speaking, it is not a goal of FLIP-224.
> > >> > The automatical way should be something built upon the blocklist
> > >> > mechanism and well decoupled. It was designed to be a configurable
> > >> > blocklist strategy, but I think we can further decouple it by
> > >> > introducing a abnormal node detector, as Becket suggested, which
> just
> > >> > uses the blocklist mechanism once bad nodes are detected. However,
> it
> > >> > should be a separate FLIP with further dev discussions and feedback
> > >> > from users. I also agree with Becket that different users have
> > different
> > >> > requirements, and we should listen to them.
> > >> >
> > >> > 2. Is it enough to just take away abnormal nodes externally
> > >> > My answer is no. As Lijie has mentioned, we need a way to avoid
> > >> > deploying tasks to temporary hot nodes. In this case, users may just
> > >> > want to limit the load of the node and do not want to kill all the
> > >> > processes on it. Another case is the speculative execution[1] which
> > >> > may also leverage this feature to avoid starting mirror tasks on
> slow
> > >> > nodes.
> > >> >
> > >> > Thanks,
> > >> > Zhu
> > >> >
> > >> > [1]
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > >> >
> > >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
> > >> >
> > >> > >
> > >> > > Hi everyone,
> > >> > >
> > >> > >
> > >> > > Thanks for your feedback.
> > >> > >
> > >> > >
> > >> > > There's one detail that I'd like to re-emphasize here because it
> can
> > >> > affect the value and design of the blocklist mechanism (perhaps I
> > should
> > >> > highlight it in the FLIP). We propose two actions in FLIP:
> > >> > >
> > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as
> blocked.
> > >> > Future slots should not be allocated from the blocked task manager
> or
> > >> node.
> > >> > But slots that are already allocated will not be affected. A typical
> > >> > application scenario is to mitigate machine hotspots. In this case,
> we
> > >> hope
> > >> > that subsequent resource allocations will not be on the hot machine,
> > but
> > >> > tasks currently running on it should not be affected.
> > >> > >
> > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or
> > node
> > >> as
> > >> > blocked, and evacuate all tasks on it. Evacuated tasks will be
> > >> restarted on
> > >> > non-blocked task managers.
> > >> > >
> > >> > > For the above 2 actions, the former may more highlight the meaning
> > of
> > >> > this FLIP, because the external system cannot do that.
> > >> > >
> > >> > >
> > >> > > Regarding *Manually* and *Automatically*, I basically agree with
> > >> @Becket
> > >> > Qin: different users have different answers. Not all users’
> deployment
> > >> > environments have a special external system that can perform the
> > anomaly
> > >> > detection. In addition, adding pluggable/optional auto-detection
> > doesn't
> > >> > require much extra work on top of manual specification.
> > >> > >
> > >> > >
> > >> > > I will answer your other questions one by one.
> > >> > >
> > >> > >
> > >> > > @Yangze
> > >> > >
> > >> > > a) I think you are right, we do not need to expose the
> > >> > `cluster.resource-blocklist.item.timeout-check-interval` to users.
> > >> > >
> > >> > > b) We can abstract the `notifyException` to a separate interface
> > >> (maybe
> > >> > BlocklistExceptionListener), and the ResourceManagerBlocklistHandler
> > can
> > >> > implement it in the future.
> > >> > >
> > >> > >
> > >> > > @Martijn
> > >> > >
> > >> > > a) I also think the manual blocking should be done by cluster
> > >> operators.
> > >> > >
> > >> > > b) I think manual blocking makes sense, because according to my
> > >> > experience, users are often the first to perceive the machine
> problems
> > >> > (because of job failover or delay), and they will contact cluster
> > >> operators
> > >> > to solve it, or even tell the cluster operators which machine is
> > >> > problematic. From this point of view, I think the people who really
> > need
> > >> > the manual blocking are the users, and it’s just performed by the
> > >> cluster
> > >> > operator, so I think the manual blocking makes sense.
> > >> > >
> > >> > >
> > >> > > @Chesnay
> > >> > >
> > >> > > We need to touch the logic of JM/SlotPool, because for
> > >> MARK_BLOCKLISTED
> > >> > , we need to know whether the slot is blocklisted when the task is
> > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
> > >> > directly to avoid assigning other tasks (of this job) on it. If we
> > only
> > >> > maintain the blocklist information on the RM, JM needs to retrieve
> it
> > by
> > >> > RPC. I think the performance overhead of that is relatively large,
> so
> > I
> > >> > think it's worth maintaining the blocklist information on the JM
> side
> > >> and
> > >> > syncing them.
> > >> > >
> > >> > >
> > >> > > @Роман
> > >> > >
> > >> > >     a) “Probably storing inside Zookeeper/Configmap might be
> helpful
> > >> > here.”  Can you explain it in detail? I don't fully understand that.
> > In
> > >> my
> > >> > opinion, non-active and active are the same, and no special
> treatment
> > is
> > >> > required.
> > >> > >
> > >> > > b) I agree with you, the `endTimestamp` makes sense, I will add it
> > to
> > >> > FLIP.
> > >> > >
> > >> > >
> > >> > > @Yang
> > >> > >
> > >> > > As mentioned above, AFAK, the external system cannot support the
> > >> > MARK_BLOCKLISTED action.
> > >> > >
> > >> > >
> > >> > > Looking forward to your further feedback.
> > >> > >
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Lijie
> > >> > >
> > >> > >
> > >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> > >> > >>
> > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > >> > >>
> > >> > >> I want to share some thoughts about Flink cluster operations.
> > >> > >>
> > >> > >> In the production environment, the SRE(aka Site Reliability
> > Engineer)
> > >> > >> already has many tools to detect the unstable nodes, which could
> > take
> > >> > the
> > >> > >> system logs/metrics into consideration.
> > >> > >> Then they use graceful-decomission in YARN and taint in K8s to
> > >> prevent
> > >> > new
> > >> > >> allocations on these unstable nodes.
> > >> > >> At last, they will evict all the containers and pods running on
> > these
> > >> > nodes.
> > >> > >> This mechanism also works for planned maintenance. So I am afraid
> > >> this
> > >> > is
> > >> > >> not the typical use case for FLIP-224.
> > >> > >>
> > >> > >> If we only support to block nodes manually, then I could not see
> > >> > >> the obvious advantages compared with current SRE's approach(via
> > *yarn
> > >> > >> rmadmin or kubectl taint*).
> > >> > >> At least, we need to have a pluggable component which could
> expose
> > >> the
> > >> > >> potential unstable nodes automatically and block them if enabled
> > >> > explicitly.
> > >> > >>
> > >> > >>
> > >> > >> Best,
> > >> > >> Yang
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> > >> > >>
> > >> > >> > Thanks for the proposal, Lijie.
> > >> > >> >
> > >> > >> > This is an interesting feature and discussion, and somewhat
> > related
> > >> > to the
> > >> > >> > design principle about how people should operate Flink.
> > >> > >> >
> > >> > >> > I think there are three things involved in this FLIP.
> > >> > >> >      a) Detect and report the unstable node.
> > >> > >> >      b) Collect the information of the unstable node and form a
> > >> > blocklist.
> > >> > >> >      c) Take the action to block nodes.
> > >> > >> >
> > >> > >> > My two cents:
> > >> > >> >
> > >> > >> > 1. It looks like people all agree that Flink should have c). It
> > is
> > >> > not only
> > >> > >> > useful for cases of node failures, but also handy for some
> > planned
> > >> > >> > maintenance.
> > >> > >> >
> > >> > >> > 2. People have different opinions on b), i.e. who should be the
> > >> brain
> > >> > to
> > >> > >> > make the decision to block a node. I think this largely depends
> > on
> > >> > who we
> > >> > >> > talk to. Different users would probably give different answers.
> > For
> > >> > people
> > >> > >> > who do have a centralized node health management service, let
> > Flink
> > >> > do just
> > >> > >> > do a) and c) would be preferred. So essentially Flink would be
> > one
> > >> of
> > >> > the
> > >> > >> > sources that may detect unstable nodes, report it to that
> > service,
> > >> > and then
> > >> > >> > take the command from that service to block the problematic
> > nodes.
> > >> On
> > >> > the
> > >> > >> > other hand, for users who do not have such a service, simply
> > >> letting
> > >> > Flink
> > >> > >> > be clever by itself to block the suspicious nodes might be
> > desired
> > >> to
> > >> > >> > ensure the jobs are running smoothly.
> > >> > >> >
> > >> > >> > So that indicates a) and b) here should be pluggable /
> optional.
> > >> > >> >
> > >> > >> > In light of this, maybe it would make sense to have something
> > >> > pluggable
> > >> > >> > like a UnstableNodeReporter which exposes unstable nodes
> > actively.
> > >> (A
> > >> > more
> > >> > >> > general interface should be JobInfoReporter<T> which can be
> used
> > to
> > >> > report
> > >> > >> > any information of type <T>. But I'll just keep the scope
> > relevant
> > >> to
> > >> > this
> > >> > >> > FLIP here). Personally speaking, I think it is OK to have a
> > default
> > >> > >> > implementation of a reporter which just tells Flink to take
> > action
> > >> to
> > >> > block
> > >> > >> > problematic nodes and also unblocks them after timeout.
> > >> > >> >
> > >> > >> > Thanks,
> > >> > >> >
> > >> > >> > Jiangjie (Becket) Qin
> > >> > >> >
> > >> > >> >
> > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> ro.v.boyko@gmail.com
> > >
> > >> > wrote:
> > >> > >> >
> > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > >> > >> > >
> > >> > >> > > If it's possible I'd like to participate in development.
> > >> > >> > >
> > >> > >> > > I agree with 3rd point of Konstantin's reply - we should
> > consider
> > >> > to move
> > >> > >> > > somehow the information of blocklisted nodes/TMs from active
> > >> > >> > > ResourceManager to non-active ones. Probably storing inside
> > >> > >> > > Zookeeper/Configmap might be helpful here.
> > >> > >> > >
> > >> > >> > > And I agree with Martijn that a lot of organizations don't
> want
> > >> to
> > >> > expose
> > >> > >> > > such API for a cluster user group. But I think it's necessary
> > to
> > >> > have the
> > >> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding
> > >> incorrect
> > >> > >> > > automatic behaviour.
> > >> > >> > >
> > >> > >> > > And another one small suggestion - I think it would be better
> > to
> > >> > extend
> > >> > >> > the
> > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field and
> fill
> > it
> > >> > at the
> > >> > >> > > item creation. This simple addition will allow to:
> > >> > >> > >
> > >> > >> > >    -
> > >> > >> > >
> > >> > >> > >    Provide the ability to users to setup the exact time of
> > >> > blocklist end
> > >> > >> > >    through RestAPI
> > >> > >> > >    -
> > >> > >> > >
> > >> > >> > >    Not being tied to a single value of
> > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > >> chesnay@apache.org>
> > >> > >> > wrote:
> > >> > >> > >
> > >> > >> > > > I do share the concern between blurring the lines a bit.
> > >> > >> > > >
> > >> > >> > > > That said, I'd prefer to not have any auto-detection and
> only
> > >> > have an
> > >> > >> > > > opt-in mechanism
> > >> > >> > > > to manually block processes/nodes. To me this sounds yet
> > again
> > >> > like one
> > >> > >> > > > of those
> > >> > >> > > > magical mechanisms that will rarely work just right.
> > >> > >> > > > An external system can leverage way more information after
> > all.
> > >> > >> > > >
> > >> > >> > > > Moreover, I'm quite concerned about the complexity of this
> > >> > proposal.
> > >> > >> > > > Tracking on both the RM/JM side; syncing between
> components;
> > >> > >> > adjustments
> > >> > >> > > > to the
> > >> > >> > > > slot and resource protocol.
> > >> > >> > > >
> > >> > >> > > > In a way it seems overly complicated.
> > >> > >> > > >
> > >> > >> > > > If we look at it purely from an active resource management
> > >> > perspective,
> > >> > >> > > > then there
> > >> > >> > > > isn't really a need to touch the slot protocol at all (or
> in
> > >> fact
> > >> > to
> > >> > >> > > > anything in the JobMaster),
> > >> > >> > > > because there isn't any point in keeping around blocked TMs
> > in
> > >> the
> > >> > >> > first
> > >> > >> > > > place.
> > >> > >> > > > They'd just be idling, potentially shutting down after a
> > while
> > >> by
> > >> > the
> > >> > >> > RM
> > >> > >> > > > because of
> > >> > >> > > > it (unless we _also_ touch that logic).
> > >> > >> > > > Here the blocking of a process (be it by blocking the
> process
> > >> or
> > >> > node)
> > >> > >> > is
> > >> > >> > > > equivalent with shutting down the blocked process(es).
> > >> > >> > > > Once the block is lifted we can just spin it back up.
> > >> > >> > > >
> > >> > >> > > > And I do wonder whether we couldn't apply the same line of
> > >> > thinking to
> > >> > >> > > > standalone resource management.
> > >> > >> > > > Here being able to stop/restart a process/node manually
> > should
> > >> be
> > >> > a
> > >> > >> > core
> > >> > >> > > > requirement for a Flink deployment anyway.
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > >> > >> > > > > Hi everyone,
> > >> > >> > > > >
> > >> > >> > > > > Thanks for creating this FLIP. I can understand the
> problem
> > >> and
> > >> > I see
> > >> > >> > > > value
> > >> > >> > > > > in the automatic detection and blocklisting. I do have
> some
> > >> > concerns
> > >> > >> > > with
> > >> > >> > > > > the ability to manually specify to be blocked resources.
> I
> > >> have
> > >> > two
> > >> > >> > > > > concerns;
> > >> > >> > > > >
> > >> > >> > > > > * Most organizations explicitly have a separation of
> > >> concerns,
> > >> > >> > meaning
> > >> > >> > > > that
> > >> > >> > > > > there's a group who's responsible for managing a cluster
> > and
> > >> > there's
> > >> > >> > a
> > >> > >> > > > user
> > >> > >> > > > > group who uses that cluster. With the introduction of
> this
> > >> > mechanism,
> > >> > >> > > the
> > >> > >> > > > > latter group now can influence the responsibility of the
> > >> first
> > >> > group.
> > >> > >> > > So
> > >> > >> > > > it
> > >> > >> > > > > can be possible that someone from the user group blocks
> > >> > something,
> > >> > >> > > which
> > >> > >> > > > > causes an outage (which could result in paging mechanism
> > >> > triggering
> > >> > >> > > etc)
> > >> > >> > > > > which impacts the first group.
> > >> > >> > > > > * How big is the group of people who can go through the
> > >> process
> > >> > of
> > >> > >> > > > manually
> > >> > >> > > > > identifying a node that isn't behaving as it should be? I
> > do
> > >> > think
> > >> > >> > this
> > >> > >> > > > > group is relatively limited. Does it then make sense to
> > >> > introduce
> > >> > >> > such
> > >> > >> > > a
> > >> > >> > > > > feature, which would only be used by a really small user
> > >> group
> > >> > of
> > >> > >> > > Flink?
> > >> > >> > > > We
> > >> > >> > > > > still have to maintain, test and support such a feature.
> > >> > >> > > > >
> > >> > >> > > > > I'm +1 for the autodetection features, but I'm leaning
> > >> towards
> > >> > not
> > >> > >> > > > exposing
> > >> > >> > > > > this to the user group but having this available strictly
> > for
> > >> > cluster
> > >> > >> > > > > operators. They could then also set up their
> > >> > paging/metrics/logging
> > >> > >> > > > system
> > >> > >> > > > > to take this into account.
> > >> > >> > > > >
> > >> > >> > > > > Best regards,
> > >> > >> > > > >
> > >> > >> > > > > Martijn Visser
> > >> > >> > > > > https://twitter.com/MartijnVisser82
> > >> > >> > > > > https://github.com/MartijnVisser
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > karmagyz@gmail.com
> > >> >
> > >> > wrote:
> > >> > >> > > > >
> > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > >> > >> > > > >>
> > >> > >> > > > >> +1 for the overall proposal. Just share some cents here:
> > >> > >> > > > >>
> > >> > >> > > > >> - Why do we need to expose
> > >> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval
> to
> > >> the
> > >> > user?
> > >> > >> > > > >> I think the semantics of
> > >> > `cluster.resource-blacklist.item.timeout`
> > >> > >> > is
> > >> > >> > > > >> sufficient for the user. How to guarantee the timeout
> > >> > mechanism is
> > >> > >> > > > >> Flink's internal implementation. I think it will be very
> > >> > confusing
> > >> > >> > and
> > >> > >> > > > >> we do not need to expose it to users.
> > >> > >> > > > >>
> > >> > >> > > > >> - ResourceManager can notify the exception of a task
> > >> manager to
> > >> > >> > > > >> `BlacklistHandler` as well.
> > >> > >> > > > >> For example, the slot allocation might fail in case the
> > >> target
> > >> > task
> > >> > >> > > > >> manager is busy or has a network jitter. I don't mean we
> > >> need
> > >> > to
> > >> > >> > cover
> > >> > >> > > > >> this case in this version, but we can also open a
> > >> > `notifyException`
> > >> > >> > in
> > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > >> > >> > > > >>
> > >> > >> > > > >> - Before we sync the blocklist to ResourceManager, will
> > the
> > >> > slot of
> > >> > >> > a
> > >> > >> > > > >> blocked task manager continues to be released and
> > allocated?
> > >> > >> > > > >>
> > >> > >> > > > >> Best,
> > >> > >> > > > >> Yangze Guo
> > >> > >> > > > >>
> > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > >> > >> > wangdachui9501@gmail.com>
> > >> > >> > > > >> wrote:
> > >> > >> > > > >>> Hi Konstantin,
> > >> > >> > > > >>>
> > >> > >> > > > >>> Thanks for your feedback. I will response your 4
> remarks:
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I think
> > >> > “BlockList”
> > >> > >> > is
> > >> > >> > > > >> good
> > >> > >> > > > >>> enough, and I will change it in FLIP.
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 2) Your suggestion for the REST API is a good idea.
> Based
> > >> on
> > >> > the
> > >> > >> > > > above, I
> > >> > >> > > > >>> would change REST API as following:
> > >> > >> > > > >>>
> > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > >> > >> > > > >>>
> > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > >> > >> > > > >>>
> > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > >> > >> > > > >>>
> > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that all
> > >> task
> > >> > >> > managers
> > >> > >> > > > on
> > >> > >> > > > >>> this node are blocklisted. All slots on these TMs are
> not
> > >> > >> > available.
> > >> > >> > > > This
> > >> > >> > > > >>> is actually a bit like TM losts, but these TMs are not
> > >> really
> > >> > lost,
> > >> > >> > > > they
> > >> > >> > > > >>> are in an unavailable status, and they are still
> > registered
> > >> > in this
> > >> > >> > > > flink
> > >> > >> > > > >>> cluster. They will be available again once the
> > >> corresponding
> > >> > >> > > blocklist
> > >> > >> > > > >> item
> > >> > >> > > > >>> is removed. This behavior is the same in
> > active/non-active
> > >> > >> > clusters.
> > >> > >> > > > >>> However in the active clusters, these TMs may be
> released
> > >> due
> > >> > to
> > >> > >> > idle
> > >> > >> > > > >>> timeouts.
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The
> reasons
> > >> are
> > >> > as
> > >> > >> > > > >> following:
> > >> > >> > > > >>> a) The timeout will not affect users adding or removing
> > >> items
> > >> > via
> > >> > >> > > REST
> > >> > >> > > > >> API,
> > >> > >> > > > >>> and users can disable it by configuring it to
> > >> Long.MAX_VALUE .
> > >> > >> > > > >>>
> > >> > >> > > > >>> b) Some node problems can recover after a period of
> time
> > >> > (such as
> > >> > >> > > > machine
> > >> > >> > > > >>> hotspots), in which case users may prefer that Flink
> can
> > do
> > >> > this
> > >> > >> > > > >>> automatically instead of requiring the user to do it
> > >> manually.
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> Best,
> > >> > >> > > > >>>
> > >> > >> > > > >>> Lijie
> > >> > >> > > > >>>
> > >> > >> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三
> > >> 19:23写道：
> > >> > >> > > > >>>
> > >> > >> > > > >>>> Hi Lijie,
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> I think, this makes sense and +1 to only support
> > manually
> > >> > blocking
> > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different strategies
> > can
> > >> > also be
> > >> > >> > > > >>>> maintained outside of Apache Flink.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> A few remarks:
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to
> the
> > >> > >> > controversy
> > >> > >> > > > >> around
> > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about this
> > >> topic a
> > >> > >> > while
> > >> > >> > > > >> back
> > >> > >> > > > >>>> and there was generally a consensus to avoid the term
> > >> > blacklist &
> > >> > >> > > > >> whitelist
> > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> > "quarantined"
> > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different
> > >> design
> > >> > as
> > >> > >> > verbs
> > >> > >> > > > >> like
> > >> > >> > > > >>>> add/remove often considered an anti-pattern for REST
> > APIs.
> > >> > POST
> > >> > >> > on a
> > >> > >> > > > >> list
> > >> > >> > > > >>>> item is generally the standard to add items. DELETE on
> > the
> > >> > >> > > individual
> > >> > >> > > > >>>> resource is standard to remove an item.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> POST <host>/quarantine/items
> > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> We could also consider to separate taskmanagers and
> > nodes
> > >> in
> > >> > the
> > >> > >> > > REST
> > >> > >> > > > >> API
> > >> > >> > > > >>>> (and internal data structures). Any opinion on this?
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > >> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> 3) How would blocking nodes behave with non-active
> > >> resource
> > >> > >> > > managers,
> > >> > >> > > > >> i.e.
> > >> > >> > > > >>>> standalone or reactive mode?
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> 4) To keep the implementation even more minimal, do we
> > >> need
> > >> > the
> > >> > >> > > > timeout
> > >> > >> > > > >>>> behavior? If items are added/removed manually we could
> > >> > delegate
> > >> > >> > this
> > >> > >> > > > >> to the
> > >> > >> > > > >>>> user easily. In my opinion the timeout behavior would
> > >> better
> > >> > fit
> > >> > >> > > into
> > >> > >> > > > >>>> specific strategies at a later point.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Looking forward to your thoughts.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Cheers and thank you,
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Konstantin
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> [1]
> > >> > >> > > > >>>>
> > >> > >> > > > >>>>
> > >> > >> > > > >>
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> >
> > >>
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > >> > >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang
> <
> > >> > >> > > > >>>> wangdachui9501@gmail.com>:
> > >> > >> > > > >>>>
> > >> > >> > > > >>>>> Hi all,
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Flink job failures may happen due to cluster node
> > issues
> > >> > >> > > > >> (insufficient
> > >> > >> > > > >>>> disk
> > >> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink
> will
> > >> > take care
> > >> > >> > > of
> > >> > >> > > > >> the
> > >> > >> > > > >>>>> failures and redeploy the tasks. However, due to data
> > >> > locality
> > >> > >> > and
> > >> > >> > > > >>>> limited
> > >> > >> > > > >>>>> resources, the new tasks are very likely to be
> > redeployed
> > >> > to the
> > >> > >> > > same
> > >> > >> > > > >>>>> nodes, which will result in continuous task
> > abnormalities
> > >> > and
> > >> > >> > > affect
> > >> > >> > > > >> job
> > >> > >> > > > >>>>> progress.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Currently, Flink users need to manually identify the
> > >> > problematic
> > >> > >> > > > >> node and
> > >> > >> > > > >>>>> take it offline to solve this problem. But this
> > approach
> > >> has
> > >> > >> > > > >> following
> > >> > >> > > > >>>>> disadvantages:
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy process.
> Users
> > >> may
> > >> > need
> > >> > >> > to
> > >> > >> > > > >>>> contact
> > >> > >> > > > >>>>> cluster administors to do this. The operation can
> even
> > be
> > >> > >> > dangerous
> > >> > >> > > > >> and
> > >> > >> > > > >>>> not
> > >> > >> > > > >>>>> allowed during some important business events.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> 2. Identifying and solving this kind of problems
> > manually
> > >> > would
> > >> > >> > be
> > >> > >> > > > >> slow
> > >> > >> > > > >>>> and
> > >> > >> > > > >>>>> a waste of human resources.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
> > >> introduce a
> > >> > >> > > blacklist
> > >> > >> > > > >>>>> mechanism for Flink to filter out problematic
> > resources.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking
> > forward
> > >> > to your
> > >> > >> > > > >>>> feedback.
> > >> > >> > > > >>>>> [1]
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>>
> > >> > >> > > > >>
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Best,
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Lijie
> > >> > >> > > > >>>>>
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> >
> > >>
> > >
> >
>


-- 
Best regards,
Roman Boyko
e.: ro.v.boyko@gmail.com

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Yang Wang <da...@gmail.com>.

Thanks Lijie and ZhuZhu for the explanation.

I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is indeed
some functionalities the external tools(e.g. kubectl taint) could not
support.


Best,
Yang

Lijie Wang <wa...@gmail.com> 于2022年5月6日周五 22:18写道：

> Thanks for your feedback, Jiangang and Martijn.
>
> @Jiangang
>
>
> > For auto-detecting, I wonder how to make the strategy and mark a node
> blocked?
>
> In fact, we currently plan to not support auto-detection in this FLIP. The
> part about auto-detection may be continued in a separate FLIP in the
> future. Some guys have the same concerns as you, and the correctness and
> necessity of auto-detection may require further discussion in the future.
>
> > In session mode, multi jobs can fail on the same bad node and the node
> should be marked blocked.
> By design, the blocklist information will be shared among all jobs in a
> cluster/session. The JM will sync blocklist information with RM.
>
> @Martijn
>
> > I agree with Yang Wang on this.
> As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just limits
> the load of the node and does not  kill all the processes on it) is also
> important, and we think that external systems (*yarn rmadmin or kubectl
> taint*) cannot support it. So we think it makes sense even only *manually*.
>
> > I also agree with Chesnay that magical mechanisms are indeed super hard
> to get right.
> Yes, as you see, Jiangang(and a few others) have the same concern.
> However, we currently plan to not support auto-detection in this FLIP, and
> only *manually*. In addition, I'd like to say that the FLIP provides a
> mechanism to support MARK_BLOCKLISTED and
> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> the auto-detection may be done by external systems.
>
> Best,
> Lijie
>
> Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：
>
> > > If we only support to block nodes manually, then I could not see
> > the obvious advantages compared with current SRE's approach(via *yarn
> > rmadmin or kubectl taint*).
> >
> > I agree with Yang Wang on this.
> >
> > >  To me this sounds yet again like one of those magical mechanisms that
> > will rarely work just right.
> >
> > I also agree with Chesnay that magical mechanisms are indeed super hard
> to
> > get right.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Fri, 6 May 2022 at 12:03, Jiangang Liu <li...@gmail.com>
> > wrote:
> >
> >> Thanks for the valuable design. The auto-detecting can decrease great
> work
> >> for us. We have implemented the similar feature in our inner flink
> >> version.
> >> Below is something that I care about:
> >>
> >>    1. For auto-detecting, I wonder how to make the strategy and mark a
> >> node
> >>    blocked? Sometimes the blocked node is hard to be detected, for
> >> example,
> >>    the upper node or the down node will be blocked when network
> >> unreachable.
> >>    2. I see that the strategy is made in JobMaster side. How about
> >>    implementing the similar logic in resource manager? In session mode,
> >> multi
> >>    jobs can fail on the same bad node and the node should be marked
> >> blocked.
> >>    If the job makes the strategy, the node may be not marked blocked if
> >> the
> >>    fail times don't exceed the threshold.
> >>
> >>
> >> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
> >>
> >> > Thank you for all your feedback!
> >> >
> >> > Besides the answers from Lijie, I'd like to share some of my thoughts:
> >> > 1. Whether to enable automatical blocklist
> >> > Generally speaking, it is not a goal of FLIP-224.
> >> > The automatical way should be something built upon the blocklist
> >> > mechanism and well decoupled. It was designed to be a configurable
> >> > blocklist strategy, but I think we can further decouple it by
> >> > introducing a abnormal node detector, as Becket suggested, which just
> >> > uses the blocklist mechanism once bad nodes are detected. However, it
> >> > should be a separate FLIP with further dev discussions and feedback
> >> > from users. I also agree with Becket that different users have
> different
> >> > requirements, and we should listen to them.
> >> >
> >> > 2. Is it enough to just take away abnormal nodes externally
> >> > My answer is no. As Lijie has mentioned, we need a way to avoid
> >> > deploying tasks to temporary hot nodes. In this case, users may just
> >> > want to limit the load of the node and do not want to kill all the
> >> > processes on it. Another case is the speculative execution[1] which
> >> > may also leverage this feature to avoid starting mirror tasks on slow
> >> > nodes.
> >> >
> >> > Thanks,
> >> > Zhu
> >> >
> >> > [1]
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> >
> >> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
> >> >
> >> > >
> >> > > Hi everyone,
> >> > >
> >> > >
> >> > > Thanks for your feedback.
> >> > >
> >> > >
> >> > > There's one detail that I'd like to re-emphasize here because it can
> >> > affect the value and design of the blocklist mechanism (perhaps I
> should
> >> > highlight it in the FLIP). We propose two actions in FLIP:
> >> > >
> >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as blocked.
> >> > Future slots should not be allocated from the blocked task manager or
> >> node.
> >> > But slots that are already allocated will not be affected. A typical
> >> > application scenario is to mitigate machine hotspots. In this case, we
> >> hope
> >> > that subsequent resource allocations will not be on the hot machine,
> but
> >> > tasks currently running on it should not be affected.
> >> > >
> >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or
> node
> >> as
> >> > blocked, and evacuate all tasks on it. Evacuated tasks will be
> >> restarted on
> >> > non-blocked task managers.
> >> > >
> >> > > For the above 2 actions, the former may more highlight the meaning
> of
> >> > this FLIP, because the external system cannot do that.
> >> > >
> >> > >
> >> > > Regarding *Manually* and *Automatically*, I basically agree with
> >> @Becket
> >> > Qin: different users have different answers. Not all users’ deployment
> >> > environments have a special external system that can perform the
> anomaly
> >> > detection. In addition, adding pluggable/optional auto-detection
> doesn't
> >> > require much extra work on top of manual specification.
> >> > >
> >> > >
> >> > > I will answer your other questions one by one.
> >> > >
> >> > >
> >> > > @Yangze
> >> > >
> >> > > a) I think you are right, we do not need to expose the
> >> > `cluster.resource-blocklist.item.timeout-check-interval` to users.
> >> > >
> >> > > b) We can abstract the `notifyException` to a separate interface
> >> (maybe
> >> > BlocklistExceptionListener), and the ResourceManagerBlocklistHandler
> can
> >> > implement it in the future.
> >> > >
> >> > >
> >> > > @Martijn
> >> > >
> >> > > a) I also think the manual blocking should be done by cluster
> >> operators.
> >> > >
> >> > > b) I think manual blocking makes sense, because according to my
> >> > experience, users are often the first to perceive the machine problems
> >> > (because of job failover or delay), and they will contact cluster
> >> operators
> >> > to solve it, or even tell the cluster operators which machine is
> >> > problematic. From this point of view, I think the people who really
> need
> >> > the manual blocking are the users, and it’s just performed by the
> >> cluster
> >> > operator, so I think the manual blocking makes sense.
> >> > >
> >> > >
> >> > > @Chesnay
> >> > >
> >> > > We need to touch the logic of JM/SlotPool, because for
> >> MARK_BLOCKLISTED
> >> > , we need to know whether the slot is blocklisted when the task is
> >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
> >> > directly to avoid assigning other tasks (of this job) on it. If we
> only
> >> > maintain the blocklist information on the RM, JM needs to retrieve it
> by
> >> > RPC. I think the performance overhead of that is relatively large, so
> I
> >> > think it's worth maintaining the blocklist information on the JM side
> >> and
> >> > syncing them.
> >> > >
> >> > >
> >> > > @Роман
> >> > >
> >> > >     a) “Probably storing inside Zookeeper/Configmap might be helpful
> >> > here.”  Can you explain it in detail? I don't fully understand that.
> In
> >> my
> >> > opinion, non-active and active are the same, and no special treatment
> is
> >> > required.
> >> > >
> >> > > b) I agree with you, the `endTimestamp` makes sense, I will add it
> to
> >> > FLIP.
> >> > >
> >> > >
> >> > > @Yang
> >> > >
> >> > > As mentioned above, AFAK, the external system cannot support the
> >> > MARK_BLOCKLISTED action.
> >> > >
> >> > >
> >> > > Looking forward to your further feedback.
> >> > >
> >> > >
> >> > > Best,
> >> > >
> >> > > Lijie
> >> > >
> >> > >
> >> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> >> > >>
> >> > >> Thanks Lijie and Zhu for creating the proposal.
> >> > >>
> >> > >> I want to share some thoughts about Flink cluster operations.
> >> > >>
> >> > >> In the production environment, the SRE(aka Site Reliability
> Engineer)
> >> > >> already has many tools to detect the unstable nodes, which could
> take
> >> > the
> >> > >> system logs/metrics into consideration.
> >> > >> Then they use graceful-decomission in YARN and taint in K8s to
> >> prevent
> >> > new
> >> > >> allocations on these unstable nodes.
> >> > >> At last, they will evict all the containers and pods running on
> these
> >> > nodes.
> >> > >> This mechanism also works for planned maintenance. So I am afraid
> >> this
> >> > is
> >> > >> not the typical use case for FLIP-224.
> >> > >>
> >> > >> If we only support to block nodes manually, then I could not see
> >> > >> the obvious advantages compared with current SRE's approach(via
> *yarn
> >> > >> rmadmin or kubectl taint*).
> >> > >> At least, we need to have a pluggable component which could expose
> >> the
> >> > >> potential unstable nodes automatically and block them if enabled
> >> > explicitly.
> >> > >>
> >> > >>
> >> > >> Best,
> >> > >> Yang
> >> > >>
> >> > >>
> >> > >>
> >> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> >> > >>
> >> > >> > Thanks for the proposal, Lijie.
> >> > >> >
> >> > >> > This is an interesting feature and discussion, and somewhat
> related
> >> > to the
> >> > >> > design principle about how people should operate Flink.
> >> > >> >
> >> > >> > I think there are three things involved in this FLIP.
> >> > >> >      a) Detect and report the unstable node.
> >> > >> >      b) Collect the information of the unstable node and form a
> >> > blocklist.
> >> > >> >      c) Take the action to block nodes.
> >> > >> >
> >> > >> > My two cents:
> >> > >> >
> >> > >> > 1. It looks like people all agree that Flink should have c). It
> is
> >> > not only
> >> > >> > useful for cases of node failures, but also handy for some
> planned
> >> > >> > maintenance.
> >> > >> >
> >> > >> > 2. People have different opinions on b), i.e. who should be the
> >> brain
> >> > to
> >> > >> > make the decision to block a node. I think this largely depends
> on
> >> > who we
> >> > >> > talk to. Different users would probably give different answers.
> For
> >> > people
> >> > >> > who do have a centralized node health management service, let
> Flink
> >> > do just
> >> > >> > do a) and c) would be preferred. So essentially Flink would be
> one
> >> of
> >> > the
> >> > >> > sources that may detect unstable nodes, report it to that
> service,
> >> > and then
> >> > >> > take the command from that service to block the problematic
> nodes.
> >> On
> >> > the
> >> > >> > other hand, for users who do not have such a service, simply
> >> letting
> >> > Flink
> >> > >> > be clever by itself to block the suspicious nodes might be
> desired
> >> to
> >> > >> > ensure the jobs are running smoothly.
> >> > >> >
> >> > >> > So that indicates a) and b) here should be pluggable / optional.
> >> > >> >
> >> > >> > In light of this, maybe it would make sense to have something
> >> > pluggable
> >> > >> > like a UnstableNodeReporter which exposes unstable nodes
> actively.
> >> (A
> >> > more
> >> > >> > general interface should be JobInfoReporter<T> which can be used
> to
> >> > report
> >> > >> > any information of type <T>. But I'll just keep the scope
> relevant
> >> to
> >> > this
> >> > >> > FLIP here). Personally speaking, I think it is OK to have a
> default
> >> > >> > implementation of a reporter which just tells Flink to take
> action
> >> to
> >> > block
> >> > >> > problematic nodes and also unblocks them after timeout.
> >> > >> >
> >> > >> > Thanks,
> >> > >> >
> >> > >> > Jiangjie (Becket) Qin
> >> > >> >
> >> > >> >
> >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro.v.boyko@gmail.com
> >
> >> > wrote:
> >> > >> >
> >> > >> > > Thanks for good initiative, Lijie and Zhu!
> >> > >> > >
> >> > >> > > If it's possible I'd like to participate in development.
> >> > >> > >
> >> > >> > > I agree with 3rd point of Konstantin's reply - we should
> consider
> >> > to move
> >> > >> > > somehow the information of blocklisted nodes/TMs from active
> >> > >> > > ResourceManager to non-active ones. Probably storing inside
> >> > >> > > Zookeeper/Configmap might be helpful here.
> >> > >> > >
> >> > >> > > And I agree with Martijn that a lot of organizations don't want
> >> to
> >> > expose
> >> > >> > > such API for a cluster user group. But I think it's necessary
> to
> >> > have the
> >> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding
> >> incorrect
> >> > >> > > automatic behaviour.
> >> > >> > >
> >> > >> > > And another one small suggestion - I think it would be better
> to
> >> > extend
> >> > >> > the
> >> > >> > > *BlocklistedItem* class with the *endTimestamp* field and fill
> it
> >> > at the
> >> > >> > > item creation. This simple addition will allow to:
> >> > >> > >
> >> > >> > >    -
> >> > >> > >
> >> > >> > >    Provide the ability to users to setup the exact time of
> >> > blocklist end
> >> > >> > >    through RestAPI
> >> > >> > >    -
> >> > >> > >
> >> > >> > >    Not being tied to a single value of
> >> > >> > >    *cluster.resource-blacklist.item.timeout*
> >> > >> > >
> >> > >> > >
> >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> >> chesnay@apache.org>
> >> > >> > wrote:
> >> > >> > >
> >> > >> > > > I do share the concern between blurring the lines a bit.
> >> > >> > > >
> >> > >> > > > That said, I'd prefer to not have any auto-detection and only
> >> > have an
> >> > >> > > > opt-in mechanism
> >> > >> > > > to manually block processes/nodes. To me this sounds yet
> again
> >> > like one
> >> > >> > > > of those
> >> > >> > > > magical mechanisms that will rarely work just right.
> >> > >> > > > An external system can leverage way more information after
> all.
> >> > >> > > >
> >> > >> > > > Moreover, I'm quite concerned about the complexity of this
> >> > proposal.
> >> > >> > > > Tracking on both the RM/JM side; syncing between components;
> >> > >> > adjustments
> >> > >> > > > to the
> >> > >> > > > slot and resource protocol.
> >> > >> > > >
> >> > >> > > > In a way it seems overly complicated.
> >> > >> > > >
> >> > >> > > > If we look at it purely from an active resource management
> >> > perspective,
> >> > >> > > > then there
> >> > >> > > > isn't really a need to touch the slot protocol at all (or in
> >> fact
> >> > to
> >> > >> > > > anything in the JobMaster),
> >> > >> > > > because there isn't any point in keeping around blocked TMs
> in
> >> the
> >> > >> > first
> >> > >> > > > place.
> >> > >> > > > They'd just be idling, potentially shutting down after a
> while
> >> by
> >> > the
> >> > >> > RM
> >> > >> > > > because of
> >> > >> > > > it (unless we _also_ touch that logic).
> >> > >> > > > Here the blocking of a process (be it by blocking the process
> >> or
> >> > node)
> >> > >> > is
> >> > >> > > > equivalent with shutting down the blocked process(es).
> >> > >> > > > Once the block is lifted we can just spin it back up.
> >> > >> > > >
> >> > >> > > > And I do wonder whether we couldn't apply the same line of
> >> > thinking to
> >> > >> > > > standalone resource management.
> >> > >> > > > Here being able to stop/restart a process/node manually
> should
> >> be
> >> > a
> >> > >> > core
> >> > >> > > > requirement for a Flink deployment anyway.
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> >> > >> > > > > Hi everyone,
> >> > >> > > > >
> >> > >> > > > > Thanks for creating this FLIP. I can understand the problem
> >> and
> >> > I see
> >> > >> > > > value
> >> > >> > > > > in the automatic detection and blocklisting. I do have some
> >> > concerns
> >> > >> > > with
> >> > >> > > > > the ability to manually specify to be blocked resources. I
> >> have
> >> > two
> >> > >> > > > > concerns;
> >> > >> > > > >
> >> > >> > > > > * Most organizations explicitly have a separation of
> >> concerns,
> >> > >> > meaning
> >> > >> > > > that
> >> > >> > > > > there's a group who's responsible for managing a cluster
> and
> >> > there's
> >> > >> > a
> >> > >> > > > user
> >> > >> > > > > group who uses that cluster. With the introduction of this
> >> > mechanism,
> >> > >> > > the
> >> > >> > > > > latter group now can influence the responsibility of the
> >> first
> >> > group.
> >> > >> > > So
> >> > >> > > > it
> >> > >> > > > > can be possible that someone from the user group blocks
> >> > something,
> >> > >> > > which
> >> > >> > > > > causes an outage (which could result in paging mechanism
> >> > triggering
> >> > >> > > etc)
> >> > >> > > > > which impacts the first group.
> >> > >> > > > > * How big is the group of people who can go through the
> >> process
> >> > of
> >> > >> > > > manually
> >> > >> > > > > identifying a node that isn't behaving as it should be? I
> do
> >> > think
> >> > >> > this
> >> > >> > > > > group is relatively limited. Does it then make sense to
> >> > introduce
> >> > >> > such
> >> > >> > > a
> >> > >> > > > > feature, which would only be used by a really small user
> >> group
> >> > of
> >> > >> > > Flink?
> >> > >> > > > We
> >> > >> > > > > still have to maintain, test and support such a feature.
> >> > >> > > > >
> >> > >> > > > > I'm +1 for the autodetection features, but I'm leaning
> >> towards
> >> > not
> >> > >> > > > exposing
> >> > >> > > > > this to the user group but having this available strictly
> for
> >> > cluster
> >> > >> > > > > operators. They could then also set up their
> >> > paging/metrics/logging
> >> > >> > > > system
> >> > >> > > > > to take this into account.
> >> > >> > > > >
> >> > >> > > > > Best regards,
> >> > >> > > > >
> >> > >> > > > > Martijn Visser
> >> > >> > > > > https://twitter.com/MartijnVisser82
> >> > >> > > > > https://github.com/MartijnVisser
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> karmagyz@gmail.com
> >> >
> >> > wrote:
> >> > >> > > > >
> >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> >> > >> > > > >>
> >> > >> > > > >> +1 for the overall proposal. Just share some cents here:
> >> > >> > > > >>
> >> > >> > > > >> - Why do we need to expose
> >> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval to
> >> the
> >> > user?
> >> > >> > > > >> I think the semantics of
> >> > `cluster.resource-blacklist.item.timeout`
> >> > >> > is
> >> > >> > > > >> sufficient for the user. How to guarantee the timeout
> >> > mechanism is
> >> > >> > > > >> Flink's internal implementation. I think it will be very
> >> > confusing
> >> > >> > and
> >> > >> > > > >> we do not need to expose it to users.
> >> > >> > > > >>
> >> > >> > > > >> - ResourceManager can notify the exception of a task
> >> manager to
> >> > >> > > > >> `BlacklistHandler` as well.
> >> > >> > > > >> For example, the slot allocation might fail in case the
> >> target
> >> > task
> >> > >> > > > >> manager is busy or has a network jitter. I don't mean we
> >> need
> >> > to
> >> > >> > cover
> >> > >> > > > >> this case in this version, but we can also open a
> >> > `notifyException`
> >> > >> > in
> >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> >> > >> > > > >>
> >> > >> > > > >> - Before we sync the blocklist to ResourceManager, will
> the
> >> > slot of
> >> > >> > a
> >> > >> > > > >> blocked task manager continues to be released and
> allocated?
> >> > >> > > > >>
> >> > >> > > > >> Best,
> >> > >> > > > >> Yangze Guo
> >> > >> > > > >>
> >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> >> > >> > wangdachui9501@gmail.com>
> >> > >> > > > >> wrote:
> >> > >> > > > >>> Hi Konstantin,
> >> > >> > > > >>>
> >> > >> > > > >>> Thanks for your feedback. I will response your 4 remarks:
> >> > >> > > > >>>
> >> > >> > > > >>>
> >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I think
> >> > “BlockList”
> >> > >> > is
> >> > >> > > > >> good
> >> > >> > > > >>> enough, and I will change it in FLIP.
> >> > >> > > > >>>
> >> > >> > > > >>>
> >> > >> > > > >>> 2) Your suggestion for the REST API is a good idea. Based
> >> on
> >> > the
> >> > >> > > > above, I
> >> > >> > > > >>> would change REST API as following:
> >> > >> > > > >>>
> >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> >> > >> > > > >>>
> >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> >> > >> > > > >>>
> >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> >> > >> > > > >>>
> >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> >> > >> > > > >>>
> >> > >> > > > >>>
> >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that all
> >> task
> >> > >> > managers
> >> > >> > > > on
> >> > >> > > > >>> this node are blocklisted. All slots on these TMs are not
> >> > >> > available.
> >> > >> > > > This
> >> > >> > > > >>> is actually a bit like TM losts, but these TMs are not
> >> really
> >> > lost,
> >> > >> > > > they
> >> > >> > > > >>> are in an unavailable status, and they are still
> registered
> >> > in this
> >> > >> > > > flink
> >> > >> > > > >>> cluster. They will be available again once the
> >> corresponding
> >> > >> > > blocklist
> >> > >> > > > >> item
> >> > >> > > > >>> is removed. This behavior is the same in
> active/non-active
> >> > >> > clusters.
> >> > >> > > > >>> However in the active clusters, these TMs may be released
> >> due
> >> > to
> >> > >> > idle
> >> > >> > > > >>> timeouts.
> >> > >> > > > >>>
> >> > >> > > > >>>
> >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The reasons
> >> are
> >> > as
> >> > >> > > > >> following:
> >> > >> > > > >>> a) The timeout will not affect users adding or removing
> >> items
> >> > via
> >> > >> > > REST
> >> > >> > > > >> API,
> >> > >> > > > >>> and users can disable it by configuring it to
> >> Long.MAX_VALUE .
> >> > >> > > > >>>
> >> > >> > > > >>> b) Some node problems can recover after a period of time
> >> > (such as
> >> > >> > > > machine
> >> > >> > > > >>> hotspots), in which case users may prefer that Flink can
> do
> >> > this
> >> > >> > > > >>> automatically instead of requiring the user to do it
> >> manually.
> >> > >> > > > >>>
> >> > >> > > > >>>
> >> > >> > > > >>> Best,
> >> > >> > > > >>>
> >> > >> > > > >>> Lijie
> >> > >> > > > >>>
> >> > >> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三
> >> 19:23写道：
> >> > >> > > > >>>
> >> > >> > > > >>>> Hi Lijie,
> >> > >> > > > >>>>
> >> > >> > > > >>>> I think, this makes sense and +1 to only support
> manually
> >> > blocking
> >> > >> > > > >>>> taskmanagers and nodes. Maybe the different strategies
> can
> >> > also be
> >> > >> > > > >>>> maintained outside of Apache Flink.
> >> > >> > > > >>>>
> >> > >> > > > >>>> A few remarks:
> >> > >> > > > >>>>
> >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to the
> >> > >> > controversy
> >> > >> > > > >> around
> >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about this
> >> topic a
> >> > >> > while
> >> > >> > > > >> back
> >> > >> > > > >>>> and there was generally a consensus to avoid the term
> >> > blacklist &
> >> > >> > > > >> whitelist
> >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> "quarantined"
> >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different
> >> design
> >> > as
> >> > >> > verbs
> >> > >> > > > >> like
> >> > >> > > > >>>> add/remove often considered an anti-pattern for REST
> APIs.
> >> > POST
> >> > >> > on a
> >> > >> > > > >> list
> >> > >> > > > >>>> item is generally the standard to add items. DELETE on
> the
> >> > >> > > individual
> >> > >> > > > >>>> resource is standard to remove an item.
> >> > >> > > > >>>>
> >> > >> > > > >>>> POST <host>/quarantine/items
> >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> >> > >> > > > >>>>
> >> > >> > > > >>>> We could also consider to separate taskmanagers and
> nodes
> >> in
> >> > the
> >> > >> > > REST
> >> > >> > > > >> API
> >> > >> > > > >>>> (and internal data structures). Any opinion on this?
> >> > >> > > > >>>>
> >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> >> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> >> > >> > > > >>>>
> >> > >> > > > >>>> 3) How would blocking nodes behave with non-active
> >> resource
> >> > >> > > managers,
> >> > >> > > > >> i.e.
> >> > >> > > > >>>> standalone or reactive mode?
> >> > >> > > > >>>>
> >> > >> > > > >>>> 4) To keep the implementation even more minimal, do we
> >> need
> >> > the
> >> > >> > > > timeout
> >> > >> > > > >>>> behavior? If items are added/removed manually we could
> >> > delegate
> >> > >> > this
> >> > >> > > > >> to the
> >> > >> > > > >>>> user easily. In my opinion the timeout behavior would
> >> better
> >> > fit
> >> > >> > > into
> >> > >> > > > >>>> specific strategies at a later point.
> >> > >> > > > >>>>
> >> > >> > > > >>>> Looking forward to your thoughts.
> >> > >> > > > >>>>
> >> > >> > > > >>>> Cheers and thank you,
> >> > >> > > > >>>>
> >> > >> > > > >>>> Konstantin
> >> > >> > > > >>>>
> >> > >> > > > >>>> [1]
> >> > >> > > > >>>>
> >> > >> > > > >>>>
> >> > >> > > > >>
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> >
> >>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >> > >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> >> > >> > > > >>>>
> >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> >> > >> > > > >>>> wangdachui9501@gmail.com>:
> >> > >> > > > >>>>
> >> > >> > > > >>>>> Hi all,
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> Flink job failures may happen due to cluster node
> issues
> >> > >> > > > >> (insufficient
> >> > >> > > > >>>> disk
> >> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink will
> >> > take care
> >> > >> > > of
> >> > >> > > > >> the
> >> > >> > > > >>>>> failures and redeploy the tasks. However, due to data
> >> > locality
> >> > >> > and
> >> > >> > > > >>>> limited
> >> > >> > > > >>>>> resources, the new tasks are very likely to be
> redeployed
> >> > to the
> >> > >> > > same
> >> > >> > > > >>>>> nodes, which will result in continuous task
> abnormalities
> >> > and
> >> > >> > > affect
> >> > >> > > > >> job
> >> > >> > > > >>>>> progress.
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> Currently, Flink users need to manually identify the
> >> > problematic
> >> > >> > > > >> node and
> >> > >> > > > >>>>> take it offline to solve this problem. But this
> approach
> >> has
> >> > >> > > > >> following
> >> > >> > > > >>>>> disadvantages:
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> 1. Taking a node offline can be a heavy process. Users
> >> may
> >> > need
> >> > >> > to
> >> > >> > > > >>>> contact
> >> > >> > > > >>>>> cluster administors to do this. The operation can even
> be
> >> > >> > dangerous
> >> > >> > > > >> and
> >> > >> > > > >>>> not
> >> > >> > > > >>>>> allowed during some important business events.
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> 2. Identifying and solving this kind of problems
> manually
> >> > would
> >> > >> > be
> >> > >> > > > >> slow
> >> > >> > > > >>>> and
> >> > >> > > > >>>>> a waste of human resources.
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
> >> introduce a
> >> > >> > > blacklist
> >> > >> > > > >>>>> mechanism for Flink to filter out problematic
> resources.
> >> > >> > > > >>>>>
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking
> forward
> >> > to your
> >> > >> > > > >>>> feedback.
> >> > >> > > > >>>>> [1]
> >> > >> > > > >>>>>
> >> > >> > > > >>>>>
> >> > >> > > > >>
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> Best,
> >> > >> > > > >>>>>
> >> > >> > > > >>>>> Lijie
> >> > >> > > > >>>>>
> >> > >> > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> >
> >>
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Thanks for your feedback, Jiangang and Martijn.

@Jiangang


> For auto-detecting, I wonder how to make the strategy and mark a node
blocked?

In fact, we currently plan to not support auto-detection in this FLIP. The
part about auto-detection may be continued in a separate FLIP in the
future. Some guys have the same concerns as you, and the correctness and
necessity of auto-detection may require further discussion in the future.

> In session mode, multi jobs can fail on the same bad node and the node
should be marked blocked.
By design, the blocklist information will be shared among all jobs in a
cluster/session. The JM will sync blocklist information with RM.

@Martijn

> I agree with Yang Wang on this.
As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just limits
the load of the node and does not  kill all the processes on it) is also
important, and we think that external systems (*yarn rmadmin or kubectl
taint*) cannot support it. So we think it makes sense even only *manually*.

> I also agree with Chesnay that magical mechanisms are indeed super hard
to get right.
Yes, as you see, Jiangang(and a few others) have the same concern.
However, we currently plan to not support auto-detection in this FLIP, and
only *manually*. In addition, I'd like to say that the FLIP provides a
mechanism to support MARK_BLOCKLISTED and MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
the auto-detection may be done by external systems.

Best,
Lijie

Martijn Visser <ma...@ververica.com> 于2022年5月6日周五 19:04写道：

> > If we only support to block nodes manually, then I could not see
> the obvious advantages compared with current SRE's approach(via *yarn
> rmadmin or kubectl taint*).
>
> I agree with Yang Wang on this.
>
> >  To me this sounds yet again like one of those magical mechanisms that
> will rarely work just right.
>
> I also agree with Chesnay that magical mechanisms are indeed super hard to
> get right.
>
> Best regards,
>
> Martijn
>
> On Fri, 6 May 2022 at 12:03, Jiangang Liu <li...@gmail.com>
> wrote:
>
>> Thanks for the valuable design. The auto-detecting can decrease great work
>> for us. We have implemented the similar feature in our inner flink
>> version.
>> Below is something that I care about:
>>
>>    1. For auto-detecting, I wonder how to make the strategy and mark a
>> node
>>    blocked? Sometimes the blocked node is hard to be detected, for
>> example,
>>    the upper node or the down node will be blocked when network
>> unreachable.
>>    2. I see that the strategy is made in JobMaster side. How about
>>    implementing the similar logic in resource manager? In session mode,
>> multi
>>    jobs can fail on the same bad node and the node should be marked
>> blocked.
>>    If the job makes the strategy, the node may be not marked blocked if
>> the
>>    fail times don't exceed the threshold.
>>
>>
>> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>>
>> > Thank you for all your feedback!
>> >
>> > Besides the answers from Lijie, I'd like to share some of my thoughts:
>> > 1. Whether to enable automatical blocklist
>> > Generally speaking, it is not a goal of FLIP-224.
>> > The automatical way should be something built upon the blocklist
>> > mechanism and well decoupled. It was designed to be a configurable
>> > blocklist strategy, but I think we can further decouple it by
>> > introducing a abnormal node detector, as Becket suggested, which just
>> > uses the blocklist mechanism once bad nodes are detected. However, it
>> > should be a separate FLIP with further dev discussions and feedback
>> > from users. I also agree with Becket that different users have different
>> > requirements, and we should listen to them.
>> >
>> > 2. Is it enough to just take away abnormal nodes externally
>> > My answer is no. As Lijie has mentioned, we need a way to avoid
>> > deploying tasks to temporary hot nodes. In this case, users may just
>> > want to limit the load of the node and do not want to kill all the
>> > processes on it. Another case is the speculative execution[1] which
>> > may also leverage this feature to avoid starting mirror tasks on slow
>> > nodes.
>> >
>> > Thanks,
>> > Zhu
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> >
>> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
>> >
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > > Thanks for your feedback.
>> > >
>> > >
>> > > There's one detail that I'd like to re-emphasize here because it can
>> > affect the value and design of the blocklist mechanism (perhaps I should
>> > highlight it in the FLIP). We propose two actions in FLIP:
>> > >
>> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as blocked.
>> > Future slots should not be allocated from the blocked task manager or
>> node.
>> > But slots that are already allocated will not be affected. A typical
>> > application scenario is to mitigate machine hotspots. In this case, we
>> hope
>> > that subsequent resource allocations will not be on the hot machine, but
>> > tasks currently running on it should not be affected.
>> > >
>> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or node
>> as
>> > blocked, and evacuate all tasks on it. Evacuated tasks will be
>> restarted on
>> > non-blocked task managers.
>> > >
>> > > For the above 2 actions, the former may more highlight the meaning of
>> > this FLIP, because the external system cannot do that.
>> > >
>> > >
>> > > Regarding *Manually* and *Automatically*, I basically agree with
>> @Becket
>> > Qin: different users have different answers. Not all users’ deployment
>> > environments have a special external system that can perform the anomaly
>> > detection. In addition, adding pluggable/optional auto-detection doesn't
>> > require much extra work on top of manual specification.
>> > >
>> > >
>> > > I will answer your other questions one by one.
>> > >
>> > >
>> > > @Yangze
>> > >
>> > > a) I think you are right, we do not need to expose the
>> > `cluster.resource-blocklist.item.timeout-check-interval` to users.
>> > >
>> > > b) We can abstract the `notifyException` to a separate interface
>> (maybe
>> > BlocklistExceptionListener), and the ResourceManagerBlocklistHandler can
>> > implement it in the future.
>> > >
>> > >
>> > > @Martijn
>> > >
>> > > a) I also think the manual blocking should be done by cluster
>> operators.
>> > >
>> > > b) I think manual blocking makes sense, because according to my
>> > experience, users are often the first to perceive the machine problems
>> > (because of job failover or delay), and they will contact cluster
>> operators
>> > to solve it, or even tell the cluster operators which machine is
>> > problematic. From this point of view, I think the people who really need
>> > the manual blocking are the users, and it’s just performed by the
>> cluster
>> > operator, so I think the manual blocking makes sense.
>> > >
>> > >
>> > > @Chesnay
>> > >
>> > > We need to touch the logic of JM/SlotPool, because for
>> MARK_BLOCKLISTED
>> > , we need to know whether the slot is blocklisted when the task is
>> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
>> > directly to avoid assigning other tasks (of this job) on it. If we only
>> > maintain the blocklist information on the RM, JM needs to retrieve it by
>> > RPC. I think the performance overhead of that is relatively large, so I
>> > think it's worth maintaining the blocklist information on the JM side
>> and
>> > syncing them.
>> > >
>> > >
>> > > @Роман
>> > >
>> > >     a) “Probably storing inside Zookeeper/Configmap might be helpful
>> > here.”  Can you explain it in detail? I don't fully understand that. In
>> my
>> > opinion, non-active and active are the same, and no special treatment is
>> > required.
>> > >
>> > > b) I agree with you, the `endTimestamp` makes sense, I will add it to
>> > FLIP.
>> > >
>> > >
>> > > @Yang
>> > >
>> > > As mentioned above, AFAK, the external system cannot support the
>> > MARK_BLOCKLISTED action.
>> > >
>> > >
>> > > Looking forward to your further feedback.
>> > >
>> > >
>> > > Best,
>> > >
>> > > Lijie
>> > >
>> > >
>> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
>> > >>
>> > >> Thanks Lijie and Zhu for creating the proposal.
>> > >>
>> > >> I want to share some thoughts about Flink cluster operations.
>> > >>
>> > >> In the production environment, the SRE(aka Site Reliability Engineer)
>> > >> already has many tools to detect the unstable nodes, which could take
>> > the
>> > >> system logs/metrics into consideration.
>> > >> Then they use graceful-decomission in YARN and taint in K8s to
>> prevent
>> > new
>> > >> allocations on these unstable nodes.
>> > >> At last, they will evict all the containers and pods running on these
>> > nodes.
>> > >> This mechanism also works for planned maintenance. So I am afraid
>> this
>> > is
>> > >> not the typical use case for FLIP-224.
>> > >>
>> > >> If we only support to block nodes manually, then I could not see
>> > >> the obvious advantages compared with current SRE's approach(via *yarn
>> > >> rmadmin or kubectl taint*).
>> > >> At least, we need to have a pluggable component which could expose
>> the
>> > >> potential unstable nodes automatically and block them if enabled
>> > explicitly.
>> > >>
>> > >>
>> > >> Best,
>> > >> Yang
>> > >>
>> > >>
>> > >>
>> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
>> > >>
>> > >> > Thanks for the proposal, Lijie.
>> > >> >
>> > >> > This is an interesting feature and discussion, and somewhat related
>> > to the
>> > >> > design principle about how people should operate Flink.
>> > >> >
>> > >> > I think there are three things involved in this FLIP.
>> > >> >      a) Detect and report the unstable node.
>> > >> >      b) Collect the information of the unstable node and form a
>> > blocklist.
>> > >> >      c) Take the action to block nodes.
>> > >> >
>> > >> > My two cents:
>> > >> >
>> > >> > 1. It looks like people all agree that Flink should have c). It is
>> > not only
>> > >> > useful for cases of node failures, but also handy for some planned
>> > >> > maintenance.
>> > >> >
>> > >> > 2. People have different opinions on b), i.e. who should be the
>> brain
>> > to
>> > >> > make the decision to block a node. I think this largely depends on
>> > who we
>> > >> > talk to. Different users would probably give different answers. For
>> > people
>> > >> > who do have a centralized node health management service, let Flink
>> > do just
>> > >> > do a) and c) would be preferred. So essentially Flink would be one
>> of
>> > the
>> > >> > sources that may detect unstable nodes, report it to that service,
>> > and then
>> > >> > take the command from that service to block the problematic nodes.
>> On
>> > the
>> > >> > other hand, for users who do not have such a service, simply
>> letting
>> > Flink
>> > >> > be clever by itself to block the suspicious nodes might be desired
>> to
>> > >> > ensure the jobs are running smoothly.
>> > >> >
>> > >> > So that indicates a) and b) here should be pluggable / optional.
>> > >> >
>> > >> > In light of this, maybe it would make sense to have something
>> > pluggable
>> > >> > like a UnstableNodeReporter which exposes unstable nodes actively.
>> (A
>> > more
>> > >> > general interface should be JobInfoReporter<T> which can be used to
>> > report
>> > >> > any information of type <T>. But I'll just keep the scope relevant
>> to
>> > this
>> > >> > FLIP here). Personally speaking, I think it is OK to have a default
>> > >> > implementation of a reporter which just tells Flink to take action
>> to
>> > block
>> > >> > problematic nodes and also unblocks them after timeout.
>> > >> >
>> > >> > Thanks,
>> > >> >
>> > >> > Jiangjie (Becket) Qin
>> > >> >
>> > >> >
>> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro...@gmail.com>
>> > wrote:
>> > >> >
>> > >> > > Thanks for good initiative, Lijie and Zhu!
>> > >> > >
>> > >> > > If it's possible I'd like to participate in development.
>> > >> > >
>> > >> > > I agree with 3rd point of Konstantin's reply - we should consider
>> > to move
>> > >> > > somehow the information of blocklisted nodes/TMs from active
>> > >> > > ResourceManager to non-active ones. Probably storing inside
>> > >> > > Zookeeper/Configmap might be helpful here.
>> > >> > >
>> > >> > > And I agree with Martijn that a lot of organizations don't want
>> to
>> > expose
>> > >> > > such API for a cluster user group. But I think it's necessary to
>> > have the
>> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding
>> incorrect
>> > >> > > automatic behaviour.
>> > >> > >
>> > >> > > And another one small suggestion - I think it would be better to
>> > extend
>> > >> > the
>> > >> > > *BlocklistedItem* class with the *endTimestamp* field and fill it
>> > at the
>> > >> > > item creation. This simple addition will allow to:
>> > >> > >
>> > >> > >    -
>> > >> > >
>> > >> > >    Provide the ability to users to setup the exact time of
>> > blocklist end
>> > >> > >    through RestAPI
>> > >> > >    -
>> > >> > >
>> > >> > >    Not being tied to a single value of
>> > >> > >    *cluster.resource-blacklist.item.timeout*
>> > >> > >
>> > >> > >
>> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
>> chesnay@apache.org>
>> > >> > wrote:
>> > >> > >
>> > >> > > > I do share the concern between blurring the lines a bit.
>> > >> > > >
>> > >> > > > That said, I'd prefer to not have any auto-detection and only
>> > have an
>> > >> > > > opt-in mechanism
>> > >> > > > to manually block processes/nodes. To me this sounds yet again
>> > like one
>> > >> > > > of those
>> > >> > > > magical mechanisms that will rarely work just right.
>> > >> > > > An external system can leverage way more information after all.
>> > >> > > >
>> > >> > > > Moreover, I'm quite concerned about the complexity of this
>> > proposal.
>> > >> > > > Tracking on both the RM/JM side; syncing between components;
>> > >> > adjustments
>> > >> > > > to the
>> > >> > > > slot and resource protocol.
>> > >> > > >
>> > >> > > > In a way it seems overly complicated.
>> > >> > > >
>> > >> > > > If we look at it purely from an active resource management
>> > perspective,
>> > >> > > > then there
>> > >> > > > isn't really a need to touch the slot protocol at all (or in
>> fact
>> > to
>> > >> > > > anything in the JobMaster),
>> > >> > > > because there isn't any point in keeping around blocked TMs in
>> the
>> > >> > first
>> > >> > > > place.
>> > >> > > > They'd just be idling, potentially shutting down after a while
>> by
>> > the
>> > >> > RM
>> > >> > > > because of
>> > >> > > > it (unless we _also_ touch that logic).
>> > >> > > > Here the blocking of a process (be it by blocking the process
>> or
>> > node)
>> > >> > is
>> > >> > > > equivalent with shutting down the blocked process(es).
>> > >> > > > Once the block is lifted we can just spin it back up.
>> > >> > > >
>> > >> > > > And I do wonder whether we couldn't apply the same line of
>> > thinking to
>> > >> > > > standalone resource management.
>> > >> > > > Here being able to stop/restart a process/node manually should
>> be
>> > a
>> > >> > core
>> > >> > > > requirement for a Flink deployment anyway.
>> > >> > > >
>> > >> > > >
>> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
>> > >> > > > > Hi everyone,
>> > >> > > > >
>> > >> > > > > Thanks for creating this FLIP. I can understand the problem
>> and
>> > I see
>> > >> > > > value
>> > >> > > > > in the automatic detection and blocklisting. I do have some
>> > concerns
>> > >> > > with
>> > >> > > > > the ability to manually specify to be blocked resources. I
>> have
>> > two
>> > >> > > > > concerns;
>> > >> > > > >
>> > >> > > > > * Most organizations explicitly have a separation of
>> concerns,
>> > >> > meaning
>> > >> > > > that
>> > >> > > > > there's a group who's responsible for managing a cluster and
>> > there's
>> > >> > a
>> > >> > > > user
>> > >> > > > > group who uses that cluster. With the introduction of this
>> > mechanism,
>> > >> > > the
>> > >> > > > > latter group now can influence the responsibility of the
>> first
>> > group.
>> > >> > > So
>> > >> > > > it
>> > >> > > > > can be possible that someone from the user group blocks
>> > something,
>> > >> > > which
>> > >> > > > > causes an outage (which could result in paging mechanism
>> > triggering
>> > >> > > etc)
>> > >> > > > > which impacts the first group.
>> > >> > > > > * How big is the group of people who can go through the
>> process
>> > of
>> > >> > > > manually
>> > >> > > > > identifying a node that isn't behaving as it should be? I do
>> > think
>> > >> > this
>> > >> > > > > group is relatively limited. Does it then make sense to
>> > introduce
>> > >> > such
>> > >> > > a
>> > >> > > > > feature, which would only be used by a really small user
>> group
>> > of
>> > >> > > Flink?
>> > >> > > > We
>> > >> > > > > still have to maintain, test and support such a feature.
>> > >> > > > >
>> > >> > > > > I'm +1 for the autodetection features, but I'm leaning
>> towards
>> > not
>> > >> > > > exposing
>> > >> > > > > this to the user group but having this available strictly for
>> > cluster
>> > >> > > > > operators. They could then also set up their
>> > paging/metrics/logging
>> > >> > > > system
>> > >> > > > > to take this into account.
>> > >> > > > >
>> > >> > > > > Best regards,
>> > >> > > > >
>> > >> > > > > Martijn Visser
>> > >> > > > > https://twitter.com/MartijnVisser82
>> > >> > > > > https://github.com/MartijnVisser
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <karmagyz@gmail.com
>> >
>> > wrote:
>> > >> > > > >
>> > >> > > > >> Thanks for driving this, Zhu and Lijie.
>> > >> > > > >>
>> > >> > > > >> +1 for the overall proposal. Just share some cents here:
>> > >> > > > >>
>> > >> > > > >> - Why do we need to expose
>> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval to
>> the
>> > user?
>> > >> > > > >> I think the semantics of
>> > `cluster.resource-blacklist.item.timeout`
>> > >> > is
>> > >> > > > >> sufficient for the user. How to guarantee the timeout
>> > mechanism is
>> > >> > > > >> Flink's internal implementation. I think it will be very
>> > confusing
>> > >> > and
>> > >> > > > >> we do not need to expose it to users.
>> > >> > > > >>
>> > >> > > > >> - ResourceManager can notify the exception of a task
>> manager to
>> > >> > > > >> `BlacklistHandler` as well.
>> > >> > > > >> For example, the slot allocation might fail in case the
>> target
>> > task
>> > >> > > > >> manager is busy or has a network jitter. I don't mean we
>> need
>> > to
>> > >> > cover
>> > >> > > > >> this case in this version, but we can also open a
>> > `notifyException`
>> > >> > in
>> > >> > > > >> `ResourceManagerBlacklistHandler`.
>> > >> > > > >>
>> > >> > > > >> - Before we sync the blocklist to ResourceManager, will the
>> > slot of
>> > >> > a
>> > >> > > > >> blocked task manager continues to be released and allocated?
>> > >> > > > >>
>> > >> > > > >> Best,
>> > >> > > > >> Yangze Guo
>> > >> > > > >>
>> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
>> > >> > wangdachui9501@gmail.com>
>> > >> > > > >> wrote:
>> > >> > > > >>> Hi Konstantin,
>> > >> > > > >>>
>> > >> > > > >>> Thanks for your feedback. I will response your 4 remarks:
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 1) Thanks for reminding me of the controversy. I think
>> > “BlockList”
>> > >> > is
>> > >> > > > >> good
>> > >> > > > >>> enough, and I will change it in FLIP.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 2) Your suggestion for the REST API is a good idea. Based
>> on
>> > the
>> > >> > > > above, I
>> > >> > > > >>> would change REST API as following:
>> > >> > > > >>>
>> > >> > > > >>> POST/GET <host>/blocklist/nodes
>> > >> > > > >>>
>> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
>> > >> > > > >>>
>> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
>> > >> > > > >>>
>> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that all
>> task
>> > >> > managers
>> > >> > > > on
>> > >> > > > >>> this node are blocklisted. All slots on these TMs are not
>> > >> > available.
>> > >> > > > This
>> > >> > > > >>> is actually a bit like TM losts, but these TMs are not
>> really
>> > lost,
>> > >> > > > they
>> > >> > > > >>> are in an unavailable status, and they are still registered
>> > in this
>> > >> > > > flink
>> > >> > > > >>> cluster. They will be available again once the
>> corresponding
>> > >> > > blocklist
>> > >> > > > >> item
>> > >> > > > >>> is removed. This behavior is the same in active/non-active
>> > >> > clusters.
>> > >> > > > >>> However in the active clusters, these TMs may be released
>> due
>> > to
>> > >> > idle
>> > >> > > > >>> timeouts.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The reasons
>> are
>> > as
>> > >> > > > >> following:
>> > >> > > > >>> a) The timeout will not affect users adding or removing
>> items
>> > via
>> > >> > > REST
>> > >> > > > >> API,
>> > >> > > > >>> and users can disable it by configuring it to
>> Long.MAX_VALUE .
>> > >> > > > >>>
>> > >> > > > >>> b) Some node problems can recover after a period of time
>> > (such as
>> > >> > > > machine
>> > >> > > > >>> hotspots), in which case users may prefer that Flink can do
>> > this
>> > >> > > > >>> automatically instead of requiring the user to do it
>> manually.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> Best,
>> > >> > > > >>>
>> > >> > > > >>> Lijie
>> > >> > > > >>>
>> > >> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三
>> 19:23写道：
>> > >> > > > >>>
>> > >> > > > >>>> Hi Lijie,
>> > >> > > > >>>>
>> > >> > > > >>>> I think, this makes sense and +1 to only support manually
>> > blocking
>> > >> > > > >>>> taskmanagers and nodes. Maybe the different strategies can
>> > also be
>> > >> > > > >>>> maintained outside of Apache Flink.
>> > >> > > > >>>>
>> > >> > > > >>>> A few remarks:
>> > >> > > > >>>>
>> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to the
>> > >> > controversy
>> > >> > > > >> around
>> > >> > > > >>>> the term? [1] There was also a Jira Ticket about this
>> topic a
>> > >> > while
>> > >> > > > >> back
>> > >> > > > >>>> and there was generally a consensus to avoid the term
>> > blacklist &
>> > >> > > > >> whitelist
>> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
>> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different
>> design
>> > as
>> > >> > verbs
>> > >> > > > >> like
>> > >> > > > >>>> add/remove often considered an anti-pattern for REST APIs.
>> > POST
>> > >> > on a
>> > >> > > > >> list
>> > >> > > > >>>> item is generally the standard to add items. DELETE on the
>> > >> > > individual
>> > >> > > > >>>> resource is standard to remove an item.
>> > >> > > > >>>>
>> > >> > > > >>>> POST <host>/quarantine/items
>> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
>> > >> > > > >>>>
>> > >> > > > >>>> We could also consider to separate taskmanagers and nodes
>> in
>> > the
>> > >> > > REST
>> > >> > > > >> API
>> > >> > > > >>>> (and internal data structures). Any opinion on this?
>> > >> > > > >>>>
>> > >> > > > >>>> POST/GET <host>/quarantine/nodes
>> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
>> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
>> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
>> > >> > > > >>>>
>> > >> > > > >>>> 3) How would blocking nodes behave with non-active
>> resource
>> > >> > > managers,
>> > >> > > > >> i.e.
>> > >> > > > >>>> standalone or reactive mode?
>> > >> > > > >>>>
>> > >> > > > >>>> 4) To keep the implementation even more minimal, do we
>> need
>> > the
>> > >> > > > timeout
>> > >> > > > >>>> behavior? If items are added/removed manually we could
>> > delegate
>> > >> > this
>> > >> > > > >> to the
>> > >> > > > >>>> user easily. In my opinion the timeout behavior would
>> better
>> > fit
>> > >> > > into
>> > >> > > > >>>> specific strategies at a later point.
>> > >> > > > >>>>
>> > >> > > > >>>> Looking forward to your thoughts.
>> > >> > > > >>>>
>> > >> > > > >>>> Cheers and thank you,
>> > >> > > > >>>>
>> > >> > > > >>>> Konstantin
>> > >> > > > >>>>
>> > >> > > > >>>> [1]
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>> > >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
>> > >> > > > >>>>
>> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
>> > >> > > > >>>> wangdachui9501@gmail.com>:
>> > >> > > > >>>>
>> > >> > > > >>>>> Hi all,
>> > >> > > > >>>>>
>> > >> > > > >>>>> Flink job failures may happen due to cluster node issues
>> > >> > > > >> (insufficient
>> > >> > > > >>>> disk
>> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink will
>> > take care
>> > >> > > of
>> > >> > > > >> the
>> > >> > > > >>>>> failures and redeploy the tasks. However, due to data
>> > locality
>> > >> > and
>> > >> > > > >>>> limited
>> > >> > > > >>>>> resources, the new tasks are very likely to be redeployed
>> > to the
>> > >> > > same
>> > >> > > > >>>>> nodes, which will result in continuous task abnormalities
>> > and
>> > >> > > affect
>> > >> > > > >> job
>> > >> > > > >>>>> progress.
>> > >> > > > >>>>>
>> > >> > > > >>>>> Currently, Flink users need to manually identify the
>> > problematic
>> > >> > > > >> node and
>> > >> > > > >>>>> take it offline to solve this problem. But this approach
>> has
>> > >> > > > >> following
>> > >> > > > >>>>> disadvantages:
>> > >> > > > >>>>>
>> > >> > > > >>>>> 1. Taking a node offline can be a heavy process. Users
>> may
>> > need
>> > >> > to
>> > >> > > > >>>> contact
>> > >> > > > >>>>> cluster administors to do this. The operation can even be
>> > >> > dangerous
>> > >> > > > >> and
>> > >> > > > >>>> not
>> > >> > > > >>>>> allowed during some important business events.
>> > >> > > > >>>>>
>> > >> > > > >>>>> 2. Identifying and solving this kind of problems manually
>> > would
>> > >> > be
>> > >> > > > >> slow
>> > >> > > > >>>> and
>> > >> > > > >>>>> a waste of human resources.
>> > >> > > > >>>>>
>> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
>> introduce a
>> > >> > > blacklist
>> > >> > > > >>>>> mechanism for Flink to filter out problematic resources.
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking forward
>> > to your
>> > >> > > > >>>> feedback.
>> > >> > > > >>>>> [1]
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>> > >> > > > >>>>>
>> > >> > > > >>>>> Best,
>> > >> > > > >>>>>
>> > >> > > > >>>>> Lijie
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Martijn Visser <ma...@ververica.com>.

> If we only support to block nodes manually, then I could not see
the obvious advantages compared with current SRE's approach(via *yarn
rmadmin or kubectl taint*).

I agree with Yang Wang on this.

>  To me this sounds yet again like one of those magical mechanisms that
will rarely work just right.

I also agree with Chesnay that magical mechanisms are indeed super hard to
get right.

Best regards,

Martijn

On Fri, 6 May 2022 at 12:03, Jiangang Liu <li...@gmail.com> wrote:

> Thanks for the valuable design. The auto-detecting can decrease great work
> for us. We have implemented the similar feature in our inner flink version.
> Below is something that I care about:
>
>    1. For auto-detecting, I wonder how to make the strategy and mark a node
>    blocked? Sometimes the blocked node is hard to be detected, for example,
>    the upper node or the down node will be blocked when network
> unreachable.
>    2. I see that the strategy is made in JobMaster side. How about
>    implementing the similar logic in resource manager? In session mode,
> multi
>    jobs can fail on the same bad node and the node should be marked
> blocked.
>    If the job makes the strategy, the node may be not marked blocked if the
>    fail times don't exceed the threshold.
>
>
> Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：
>
> > Thank you for all your feedback!
> >
> > Besides the answers from Lijie, I'd like to share some of my thoughts:
> > 1. Whether to enable automatical blocklist
> > Generally speaking, it is not a goal of FLIP-224.
> > The automatical way should be something built upon the blocklist
> > mechanism and well decoupled. It was designed to be a configurable
> > blocklist strategy, but I think we can further decouple it by
> > introducing a abnormal node detector, as Becket suggested, which just
> > uses the blocklist mechanism once bad nodes are detected. However, it
> > should be a separate FLIP with further dev discussions and feedback
> > from users. I also agree with Becket that different users have different
> > requirements, and we should listen to them.
> >
> > 2. Is it enough to just take away abnormal nodes externally
> > My answer is no. As Lijie has mentioned, we need a way to avoid
> > deploying tasks to temporary hot nodes. In this case, users may just
> > want to limit the load of the node and do not want to kill all the
> > processes on it. Another case is the speculative execution[1] which
> > may also leverage this feature to avoid starting mirror tasks on slow
> > nodes.
> >
> > Thanks,
> > Zhu
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >
> > Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
> >
> > >
> > > Hi everyone,
> > >
> > >
> > > Thanks for your feedback.
> > >
> > >
> > > There's one detail that I'd like to re-emphasize here because it can
> > affect the value and design of the blocklist mechanism (perhaps I should
> > highlight it in the FLIP). We propose two actions in FLIP:
> > >
> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as blocked.
> > Future slots should not be allocated from the blocked task manager or
> node.
> > But slots that are already allocated will not be affected. A typical
> > application scenario is to mitigate machine hotspots. In this case, we
> hope
> > that subsequent resource allocations will not be on the hot machine, but
> > tasks currently running on it should not be affected.
> > >
> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or node
> as
> > blocked, and evacuate all tasks on it. Evacuated tasks will be restarted
> on
> > non-blocked task managers.
> > >
> > > For the above 2 actions, the former may more highlight the meaning of
> > this FLIP, because the external system cannot do that.
> > >
> > >
> > > Regarding *Manually* and *Automatically*, I basically agree with
> @Becket
> > Qin: different users have different answers. Not all users’ deployment
> > environments have a special external system that can perform the anomaly
> > detection. In addition, adding pluggable/optional auto-detection doesn't
> > require much extra work on top of manual specification.
> > >
> > >
> > > I will answer your other questions one by one.
> > >
> > >
> > > @Yangze
> > >
> > > a) I think you are right, we do not need to expose the
> > `cluster.resource-blocklist.item.timeout-check-interval` to users.
> > >
> > > b) We can abstract the `notifyException` to a separate interface (maybe
> > BlocklistExceptionListener), and the ResourceManagerBlocklistHandler can
> > implement it in the future.
> > >
> > >
> > > @Martijn
> > >
> > > a) I also think the manual blocking should be done by cluster
> operators.
> > >
> > > b) I think manual blocking makes sense, because according to my
> > experience, users are often the first to perceive the machine problems
> > (because of job failover or delay), and they will contact cluster
> operators
> > to solve it, or even tell the cluster operators which machine is
> > problematic. From this point of view, I think the people who really need
> > the manual blocking are the users, and it’s just performed by the cluster
> > operator, so I think the manual blocking makes sense.
> > >
> > >
> > > @Chesnay
> > >
> > > We need to touch the logic of JM/SlotPool, because for MARK_BLOCKLISTED
> > , we need to know whether the slot is blocklisted when the task is
> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
> > directly to avoid assigning other tasks (of this job) on it. If we only
> > maintain the blocklist information on the RM, JM needs to retrieve it by
> > RPC. I think the performance overhead of that is relatively large, so I
> > think it's worth maintaining the blocklist information on the JM side and
> > syncing them.
> > >
> > >
> > > @Роман
> > >
> > >     a) “Probably storing inside Zookeeper/Configmap might be helpful
> > here.”  Can you explain it in detail? I don't fully understand that. In
> my
> > opinion, non-active and active are the same, and no special treatment is
> > required.
> > >
> > > b) I agree with you, the `endTimestamp` makes sense, I will add it to
> > FLIP.
> > >
> > >
> > > @Yang
> > >
> > > As mentioned above, AFAK, the external system cannot support the
> > MARK_BLOCKLISTED action.
> > >
> > >
> > > Looking forward to your further feedback.
> > >
> > >
> > > Best,
> > >
> > > Lijie
> > >
> > >
> > > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> > >>
> > >> Thanks Lijie and Zhu for creating the proposal.
> > >>
> > >> I want to share some thoughts about Flink cluster operations.
> > >>
> > >> In the production environment, the SRE(aka Site Reliability Engineer)
> > >> already has many tools to detect the unstable nodes, which could take
> > the
> > >> system logs/metrics into consideration.
> > >> Then they use graceful-decomission in YARN and taint in K8s to prevent
> > new
> > >> allocations on these unstable nodes.
> > >> At last, they will evict all the containers and pods running on these
> > nodes.
> > >> This mechanism also works for planned maintenance. So I am afraid this
> > is
> > >> not the typical use case for FLIP-224.
> > >>
> > >> If we only support to block nodes manually, then I could not see
> > >> the obvious advantages compared with current SRE's approach(via *yarn
> > >> rmadmin or kubectl taint*).
> > >> At least, we need to have a pluggable component which could expose the
> > >> potential unstable nodes automatically and block them if enabled
> > explicitly.
> > >>
> > >>
> > >> Best,
> > >> Yang
> > >>
> > >>
> > >>
> > >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> > >>
> > >> > Thanks for the proposal, Lijie.
> > >> >
> > >> > This is an interesting feature and discussion, and somewhat related
> > to the
> > >> > design principle about how people should operate Flink.
> > >> >
> > >> > I think there are three things involved in this FLIP.
> > >> >      a) Detect and report the unstable node.
> > >> >      b) Collect the information of the unstable node and form a
> > blocklist.
> > >> >      c) Take the action to block nodes.
> > >> >
> > >> > My two cents:
> > >> >
> > >> > 1. It looks like people all agree that Flink should have c). It is
> > not only
> > >> > useful for cases of node failures, but also handy for some planned
> > >> > maintenance.
> > >> >
> > >> > 2. People have different opinions on b), i.e. who should be the
> brain
> > to
> > >> > make the decision to block a node. I think this largely depends on
> > who we
> > >> > talk to. Different users would probably give different answers. For
> > people
> > >> > who do have a centralized node health management service, let Flink
> > do just
> > >> > do a) and c) would be preferred. So essentially Flink would be one
> of
> > the
> > >> > sources that may detect unstable nodes, report it to that service,
> > and then
> > >> > take the command from that service to block the problematic nodes.
> On
> > the
> > >> > other hand, for users who do not have such a service, simply letting
> > Flink
> > >> > be clever by itself to block the suspicious nodes might be desired
> to
> > >> > ensure the jobs are running smoothly.
> > >> >
> > >> > So that indicates a) and b) here should be pluggable / optional.
> > >> >
> > >> > In light of this, maybe it would make sense to have something
> > pluggable
> > >> > like a UnstableNodeReporter which exposes unstable nodes actively.
> (A
> > more
> > >> > general interface should be JobInfoReporter<T> which can be used to
> > report
> > >> > any information of type <T>. But I'll just keep the scope relevant
> to
> > this
> > >> > FLIP here). Personally speaking, I think it is OK to have a default
> > >> > implementation of a reporter which just tells Flink to take action
> to
> > block
> > >> > problematic nodes and also unblocks them after timeout.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jiangjie (Becket) Qin
> > >> >
> > >> >
> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro...@gmail.com>
> > wrote:
> > >> >
> > >> > > Thanks for good initiative, Lijie and Zhu!
> > >> > >
> > >> > > If it's possible I'd like to participate in development.
> > >> > >
> > >> > > I agree with 3rd point of Konstantin's reply - we should consider
> > to move
> > >> > > somehow the information of blocklisted nodes/TMs from active
> > >> > > ResourceManager to non-active ones. Probably storing inside
> > >> > > Zookeeper/Configmap might be helpful here.
> > >> > >
> > >> > > And I agree with Martijn that a lot of organizations don't want to
> > expose
> > >> > > such API for a cluster user group. But I think it's necessary to
> > have the
> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding
> incorrect
> > >> > > automatic behaviour.
> > >> > >
> > >> > > And another one small suggestion - I think it would be better to
> > extend
> > >> > the
> > >> > > *BlocklistedItem* class with the *endTimestamp* field and fill it
> > at the
> > >> > > item creation. This simple addition will allow to:
> > >> > >
> > >> > >    -
> > >> > >
> > >> > >    Provide the ability to users to setup the exact time of
> > blocklist end
> > >> > >    through RestAPI
> > >> > >    -
> > >> > >
> > >> > >    Not being tied to a single value of
> > >> > >    *cluster.resource-blacklist.item.timeout*
> > >> > >
> > >> > >
> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <chesnay@apache.org
> >
> > >> > wrote:
> > >> > >
> > >> > > > I do share the concern between blurring the lines a bit.
> > >> > > >
> > >> > > > That said, I'd prefer to not have any auto-detection and only
> > have an
> > >> > > > opt-in mechanism
> > >> > > > to manually block processes/nodes. To me this sounds yet again
> > like one
> > >> > > > of those
> > >> > > > magical mechanisms that will rarely work just right.
> > >> > > > An external system can leverage way more information after all.
> > >> > > >
> > >> > > > Moreover, I'm quite concerned about the complexity of this
> > proposal.
> > >> > > > Tracking on both the RM/JM side; syncing between components;
> > >> > adjustments
> > >> > > > to the
> > >> > > > slot and resource protocol.
> > >> > > >
> > >> > > > In a way it seems overly complicated.
> > >> > > >
> > >> > > > If we look at it purely from an active resource management
> > perspective,
> > >> > > > then there
> > >> > > > isn't really a need to touch the slot protocol at all (or in
> fact
> > to
> > >> > > > anything in the JobMaster),
> > >> > > > because there isn't any point in keeping around blocked TMs in
> the
> > >> > first
> > >> > > > place.
> > >> > > > They'd just be idling, potentially shutting down after a while
> by
> > the
> > >> > RM
> > >> > > > because of
> > >> > > > it (unless we _also_ touch that logic).
> > >> > > > Here the blocking of a process (be it by blocking the process or
> > node)
> > >> > is
> > >> > > > equivalent with shutting down the blocked process(es).
> > >> > > > Once the block is lifted we can just spin it back up.
> > >> > > >
> > >> > > > And I do wonder whether we couldn't apply the same line of
> > thinking to
> > >> > > > standalone resource management.
> > >> > > > Here being able to stop/restart a process/node manually should
> be
> > a
> > >> > core
> > >> > > > requirement for a Flink deployment anyway.
> > >> > > >
> > >> > > >
> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > >> > > > > Hi everyone,
> > >> > > > >
> > >> > > > > Thanks for creating this FLIP. I can understand the problem
> and
> > I see
> > >> > > > value
> > >> > > > > in the automatic detection and blocklisting. I do have some
> > concerns
> > >> > > with
> > >> > > > > the ability to manually specify to be blocked resources. I
> have
> > two
> > >> > > > > concerns;
> > >> > > > >
> > >> > > > > * Most organizations explicitly have a separation of concerns,
> > >> > meaning
> > >> > > > that
> > >> > > > > there's a group who's responsible for managing a cluster and
> > there's
> > >> > a
> > >> > > > user
> > >> > > > > group who uses that cluster. With the introduction of this
> > mechanism,
> > >> > > the
> > >> > > > > latter group now can influence the responsibility of the first
> > group.
> > >> > > So
> > >> > > > it
> > >> > > > > can be possible that someone from the user group blocks
> > something,
> > >> > > which
> > >> > > > > causes an outage (which could result in paging mechanism
> > triggering
> > >> > > etc)
> > >> > > > > which impacts the first group.
> > >> > > > > * How big is the group of people who can go through the
> process
> > of
> > >> > > > manually
> > >> > > > > identifying a node that isn't behaving as it should be? I do
> > think
> > >> > this
> > >> > > > > group is relatively limited. Does it then make sense to
> > introduce
> > >> > such
> > >> > > a
> > >> > > > > feature, which would only be used by a really small user group
> > of
> > >> > > Flink?
> > >> > > > We
> > >> > > > > still have to maintain, test and support such a feature.
> > >> > > > >
> > >> > > > > I'm +1 for the autodetection features, but I'm leaning towards
> > not
> > >> > > > exposing
> > >> > > > > this to the user group but having this available strictly for
> > cluster
> > >> > > > > operators. They could then also set up their
> > paging/metrics/logging
> > >> > > > system
> > >> > > > > to take this into account.
> > >> > > > >
> > >> > > > > Best regards,
> > >> > > > >
> > >> > > > > Martijn Visser
> > >> > > > > https://twitter.com/MartijnVisser82
> > >> > > > > https://github.com/MartijnVisser
> > >> > > > >
> > >> > > > >
> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com>
> > wrote:
> > >> > > > >
> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > >> > > > >>
> > >> > > > >> +1 for the overall proposal. Just share some cents here:
> > >> > > > >>
> > >> > > > >> - Why do we need to expose
> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval to the
> > user?
> > >> > > > >> I think the semantics of
> > `cluster.resource-blacklist.item.timeout`
> > >> > is
> > >> > > > >> sufficient for the user. How to guarantee the timeout
> > mechanism is
> > >> > > > >> Flink's internal implementation. I think it will be very
> > confusing
> > >> > and
> > >> > > > >> we do not need to expose it to users.
> > >> > > > >>
> > >> > > > >> - ResourceManager can notify the exception of a task manager
> to
> > >> > > > >> `BlacklistHandler` as well.
> > >> > > > >> For example, the slot allocation might fail in case the
> target
> > task
> > >> > > > >> manager is busy or has a network jitter. I don't mean we need
> > to
> > >> > cover
> > >> > > > >> this case in this version, but we can also open a
> > `notifyException`
> > >> > in
> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > >> > > > >>
> > >> > > > >> - Before we sync the blocklist to ResourceManager, will the
> > slot of
> > >> > a
> > >> > > > >> blocked task manager continues to be released and allocated?
> > >> > > > >>
> > >> > > > >> Best,
> > >> > > > >> Yangze Guo
> > >> > > > >>
> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > >> > wangdachui9501@gmail.com>
> > >> > > > >> wrote:
> > >> > > > >>> Hi Konstantin,
> > >> > > > >>>
> > >> > > > >>> Thanks for your feedback. I will response your 4 remarks:
> > >> > > > >>>
> > >> > > > >>>
> > >> > > > >>> 1) Thanks for reminding me of the controversy. I think
> > “BlockList”
> > >> > is
> > >> > > > >> good
> > >> > > > >>> enough, and I will change it in FLIP.
> > >> > > > >>>
> > >> > > > >>>
> > >> > > > >>> 2) Your suggestion for the REST API is a good idea. Based on
> > the
> > >> > > > above, I
> > >> > > > >>> would change REST API as following:
> > >> > > > >>>
> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > >> > > > >>>
> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > >> > > > >>>
> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > >> > > > >>>
> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > >> > > > >>>
> > >> > > > >>>
> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that all task
> > >> > managers
> > >> > > > on
> > >> > > > >>> this node are blocklisted. All slots on these TMs are not
> > >> > available.
> > >> > > > This
> > >> > > > >>> is actually a bit like TM losts, but these TMs are not
> really
> > lost,
> > >> > > > they
> > >> > > > >>> are in an unavailable status, and they are still registered
> > in this
> > >> > > > flink
> > >> > > > >>> cluster. They will be available again once the corresponding
> > >> > > blocklist
> > >> > > > >> item
> > >> > > > >>> is removed. This behavior is the same in active/non-active
> > >> > clusters.
> > >> > > > >>> However in the active clusters, these TMs may be released
> due
> > to
> > >> > idle
> > >> > > > >>> timeouts.
> > >> > > > >>>
> > >> > > > >>>
> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The reasons
> are
> > as
> > >> > > > >> following:
> > >> > > > >>> a) The timeout will not affect users adding or removing
> items
> > via
> > >> > > REST
> > >> > > > >> API,
> > >> > > > >>> and users can disable it by configuring it to
> Long.MAX_VALUE .
> > >> > > > >>>
> > >> > > > >>> b) Some node problems can recover after a period of time
> > (such as
> > >> > > > machine
> > >> > > > >>> hotspots), in which case users may prefer that Flink can do
> > this
> > >> > > > >>> automatically instead of requiring the user to do it
> manually.
> > >> > > > >>>
> > >> > > > >>>
> > >> > > > >>> Best,
> > >> > > > >>>
> > >> > > > >>> Lijie
> > >> > > > >>>
> > >> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
> > >> > > > >>>
> > >> > > > >>>> Hi Lijie,
> > >> > > > >>>>
> > >> > > > >>>> I think, this makes sense and +1 to only support manually
> > blocking
> > >> > > > >>>> taskmanagers and nodes. Maybe the different strategies can
> > also be
> > >> > > > >>>> maintained outside of Apache Flink.
> > >> > > > >>>>
> > >> > > > >>>> A few remarks:
> > >> > > > >>>>
> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to the
> > >> > controversy
> > >> > > > >> around
> > >> > > > >>>> the term? [1] There was also a Jira Ticket about this
> topic a
> > >> > while
> > >> > > > >> back
> > >> > > > >>>> and there was generally a consensus to avoid the term
> > blacklist &
> > >> > > > >> whitelist
> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different design
> > as
> > >> > verbs
> > >> > > > >> like
> > >> > > > >>>> add/remove often considered an anti-pattern for REST APIs.
> > POST
> > >> > on a
> > >> > > > >> list
> > >> > > > >>>> item is generally the standard to add items. DELETE on the
> > >> > > individual
> > >> > > > >>>> resource is standard to remove an item.
> > >> > > > >>>>
> > >> > > > >>>> POST <host>/quarantine/items
> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > >> > > > >>>>
> > >> > > > >>>> We could also consider to separate taskmanagers and nodes
> in
> > the
> > >> > > REST
> > >> > > > >> API
> > >> > > > >>>> (and internal data structures). Any opinion on this?
> > >> > > > >>>>
> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > >> > > > >>>>
> > >> > > > >>>> 3) How would blocking nodes behave with non-active resource
> > >> > > managers,
> > >> > > > >> i.e.
> > >> > > > >>>> standalone or reactive mode?
> > >> > > > >>>>
> > >> > > > >>>> 4) To keep the implementation even more minimal, do we need
> > the
> > >> > > > timeout
> > >> > > > >>>> behavior? If items are added/removed manually we could
> > delegate
> > >> > this
> > >> > > > >> to the
> > >> > > > >>>> user easily. In my opinion the timeout behavior would
> better
> > fit
> > >> > > into
> > >> > > > >>>> specific strategies at a later point.
> > >> > > > >>>>
> > >> > > > >>>> Looking forward to your thoughts.
> > >> > > > >>>>
> > >> > > > >>>> Cheers and thank you,
> > >> > > > >>>>
> > >> > > > >>>> Konstantin
> > >> > > > >>>>
> > >> > > > >>>> [1]
> > >> > > > >>>>
> > >> > > > >>>>
> > >> > > > >>
> > >> > > >
> > >> > >
> > >> >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> > >> > > > >>>>
> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> > >> > > > >>>> wangdachui9501@gmail.com>:
> > >> > > > >>>>
> > >> > > > >>>>> Hi all,
> > >> > > > >>>>>
> > >> > > > >>>>> Flink job failures may happen due to cluster node issues
> > >> > > > >> (insufficient
> > >> > > > >>>> disk
> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink will
> > take care
> > >> > > of
> > >> > > > >> the
> > >> > > > >>>>> failures and redeploy the tasks. However, due to data
> > locality
> > >> > and
> > >> > > > >>>> limited
> > >> > > > >>>>> resources, the new tasks are very likely to be redeployed
> > to the
> > >> > > same
> > >> > > > >>>>> nodes, which will result in continuous task abnormalities
> > and
> > >> > > affect
> > >> > > > >> job
> > >> > > > >>>>> progress.
> > >> > > > >>>>>
> > >> > > > >>>>> Currently, Flink users need to manually identify the
> > problematic
> > >> > > > >> node and
> > >> > > > >>>>> take it offline to solve this problem. But this approach
> has
> > >> > > > >> following
> > >> > > > >>>>> disadvantages:
> > >> > > > >>>>>
> > >> > > > >>>>> 1. Taking a node offline can be a heavy process. Users may
> > need
> > >> > to
> > >> > > > >>>> contact
> > >> > > > >>>>> cluster administors to do this. The operation can even be
> > >> > dangerous
> > >> > > > >> and
> > >> > > > >>>> not
> > >> > > > >>>>> allowed during some important business events.
> > >> > > > >>>>>
> > >> > > > >>>>> 2. Identifying and solving this kind of problems manually
> > would
> > >> > be
> > >> > > > >> slow
> > >> > > > >>>> and
> > >> > > > >>>>> a waste of human resources.
> > >> > > > >>>>>
> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to introduce
> a
> > >> > > blacklist
> > >> > > > >>>>> mechanism for Flink to filter out problematic resources.
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking forward
> > to your
> > >> > > > >>>> feedback.
> > >> > > > >>>>> [1]
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > > >>
> > >> > > >
> > >> > >
> > >> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > >> > > > >>>>>
> > >> > > > >>>>> Best,
> > >> > > > >>>>>
> > >> > > > >>>>> Lijie
> > >> > > > >>>>>
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Jiangang Liu <li...@gmail.com>.

Thanks for the valuable design. The auto-detecting can decrease great work
for us. We have implemented the similar feature in our inner flink version.
Below is something that I care about:

   1. For auto-detecting, I wonder how to make the strategy and mark a node
   blocked? Sometimes the blocked node is hard to be detected, for example,
   the upper node or the down node will be blocked when network unreachable.
   2. I see that the strategy is made in JobMaster side. How about
   implementing the similar logic in resource manager? In session mode, multi
   jobs can fail on the same bad node and the node should be marked blocked.
   If the job makes the strategy, the node may be not marked blocked if the
   fail times don't exceed the threshold.


Zhu Zhu <re...@gmail.com> 于2022年5月5日周四 23:35写道：

> Thank you for all your feedback!
>
> Besides the answers from Lijie, I'd like to share some of my thoughts:
> 1. Whether to enable automatical blocklist
> Generally speaking, it is not a goal of FLIP-224.
> The automatical way should be something built upon the blocklist
> mechanism and well decoupled. It was designed to be a configurable
> blocklist strategy, but I think we can further decouple it by
> introducing a abnormal node detector, as Becket suggested, which just
> uses the blocklist mechanism once bad nodes are detected. However, it
> should be a separate FLIP with further dev discussions and feedback
> from users. I also agree with Becket that different users have different
> requirements, and we should listen to them.
>
> 2. Is it enough to just take away abnormal nodes externally
> My answer is no. As Lijie has mentioned, we need a way to avoid
> deploying tasks to temporary hot nodes. In this case, users may just
> want to limit the load of the node and do not want to kill all the
> processes on it. Another case is the speculative execution[1] which
> may also leverage this feature to avoid starting mirror tasks on slow
> nodes.
>
> Thanks,
> Zhu
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>
> Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：
>
> >
> > Hi everyone,
> >
> >
> > Thanks for your feedback.
> >
> >
> > There's one detail that I'd like to re-emphasize here because it can
> affect the value and design of the blocklist mechanism (perhaps I should
> highlight it in the FLIP). We propose two actions in FLIP:
> >
> > 1) MARK_BLOCKLISTED: Just mark the task manager or node as blocked.
> Future slots should not be allocated from the blocked task manager or node.
> But slots that are already allocated will not be affected. A typical
> application scenario is to mitigate machine hotspots. In this case, we hope
> that subsequent resource allocations will not be on the hot machine, but
> tasks currently running on it should not be affected.
> >
> > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or node as
> blocked, and evacuate all tasks on it. Evacuated tasks will be restarted on
> non-blocked task managers.
> >
> > For the above 2 actions, the former may more highlight the meaning of
> this FLIP, because the external system cannot do that.
> >
> >
> > Regarding *Manually* and *Automatically*, I basically agree with @Becket
> Qin: different users have different answers. Not all users’ deployment
> environments have a special external system that can perform the anomaly
> detection. In addition, adding pluggable/optional auto-detection doesn't
> require much extra work on top of manual specification.
> >
> >
> > I will answer your other questions one by one.
> >
> >
> > @Yangze
> >
> > a) I think you are right, we do not need to expose the
> `cluster.resource-blocklist.item.timeout-check-interval` to users.
> >
> > b) We can abstract the `notifyException` to a separate interface (maybe
> BlocklistExceptionListener), and the ResourceManagerBlocklistHandler can
> implement it in the future.
> >
> >
> > @Martijn
> >
> > a) I also think the manual blocking should be done by cluster operators.
> >
> > b) I think manual blocking makes sense, because according to my
> experience, users are often the first to perceive the machine problems
> (because of job failover or delay), and they will contact cluster operators
> to solve it, or even tell the cluster operators which machine is
> problematic. From this point of view, I think the people who really need
> the manual blocking are the users, and it’s just performed by the cluster
> operator, so I think the manual blocking makes sense.
> >
> >
> > @Chesnay
> >
> > We need to touch the logic of JM/SlotPool, because for MARK_BLOCKLISTED
> , we need to know whether the slot is blocklisted when the task is
> FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
> directly to avoid assigning other tasks (of this job) on it. If we only
> maintain the blocklist information on the RM, JM needs to retrieve it by
> RPC. I think the performance overhead of that is relatively large, so I
> think it's worth maintaining the blocklist information on the JM side and
> syncing them.
> >
> >
> > @Роман
> >
> >     a) “Probably storing inside Zookeeper/Configmap might be helpful
> here.”  Can you explain it in detail? I don't fully understand that. In my
> opinion, non-active and active are the same, and no special treatment is
> required.
> >
> > b) I agree with you, the `endTimestamp` makes sense, I will add it to
> FLIP.
> >
> >
> > @Yang
> >
> > As mentioned above, AFAK, the external system cannot support the
> MARK_BLOCKLISTED action.
> >
> >
> > Looking forward to your further feedback.
> >
> >
> > Best,
> >
> > Lijie
> >
> >
> > Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
> >>
> >> Thanks Lijie and Zhu for creating the proposal.
> >>
> >> I want to share some thoughts about Flink cluster operations.
> >>
> >> In the production environment, the SRE(aka Site Reliability Engineer)
> >> already has many tools to detect the unstable nodes, which could take
> the
> >> system logs/metrics into consideration.
> >> Then they use graceful-decomission in YARN and taint in K8s to prevent
> new
> >> allocations on these unstable nodes.
> >> At last, they will evict all the containers and pods running on these
> nodes.
> >> This mechanism also works for planned maintenance. So I am afraid this
> is
> >> not the typical use case for FLIP-224.
> >>
> >> If we only support to block nodes manually, then I could not see
> >> the obvious advantages compared with current SRE's approach(via *yarn
> >> rmadmin or kubectl taint*).
> >> At least, we need to have a pluggable component which could expose the
> >> potential unstable nodes automatically and block them if enabled
> explicitly.
> >>
> >>
> >> Best,
> >> Yang
> >>
> >>
> >>
> >> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
> >>
> >> > Thanks for the proposal, Lijie.
> >> >
> >> > This is an interesting feature and discussion, and somewhat related
> to the
> >> > design principle about how people should operate Flink.
> >> >
> >> > I think there are three things involved in this FLIP.
> >> >      a) Detect and report the unstable node.
> >> >      b) Collect the information of the unstable node and form a
> blocklist.
> >> >      c) Take the action to block nodes.
> >> >
> >> > My two cents:
> >> >
> >> > 1. It looks like people all agree that Flink should have c). It is
> not only
> >> > useful for cases of node failures, but also handy for some planned
> >> > maintenance.
> >> >
> >> > 2. People have different opinions on b), i.e. who should be the brain
> to
> >> > make the decision to block a node. I think this largely depends on
> who we
> >> > talk to. Different users would probably give different answers. For
> people
> >> > who do have a centralized node health management service, let Flink
> do just
> >> > do a) and c) would be preferred. So essentially Flink would be one of
> the
> >> > sources that may detect unstable nodes, report it to that service,
> and then
> >> > take the command from that service to block the problematic nodes. On
> the
> >> > other hand, for users who do not have such a service, simply letting
> Flink
> >> > be clever by itself to block the suspicious nodes might be desired to
> >> > ensure the jobs are running smoothly.
> >> >
> >> > So that indicates a) and b) here should be pluggable / optional.
> >> >
> >> > In light of this, maybe it would make sense to have something
> pluggable
> >> > like a UnstableNodeReporter which exposes unstable nodes actively. (A
> more
> >> > general interface should be JobInfoReporter<T> which can be used to
> report
> >> > any information of type <T>. But I'll just keep the scope relevant to
> this
> >> > FLIP here). Personally speaking, I think it is OK to have a default
> >> > implementation of a reporter which just tells Flink to take action to
> block
> >> > problematic nodes and also unblocks them after timeout.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> >
> >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro...@gmail.com>
> wrote:
> >> >
> >> > > Thanks for good initiative, Lijie and Zhu!
> >> > >
> >> > > If it's possible I'd like to participate in development.
> >> > >
> >> > > I agree with 3rd point of Konstantin's reply - we should consider
> to move
> >> > > somehow the information of blocklisted nodes/TMs from active
> >> > > ResourceManager to non-active ones. Probably storing inside
> >> > > Zookeeper/Configmap might be helpful here.
> >> > >
> >> > > And I agree with Martijn that a lot of organizations don't want to
> expose
> >> > > such API for a cluster user group. But I think it's necessary to
> have the
> >> > > mechanism for unblocking the nodes/TMs anyway for avoiding incorrect
> >> > > automatic behaviour.
> >> > >
> >> > > And another one small suggestion - I think it would be better to
> extend
> >> > the
> >> > > *BlocklistedItem* class with the *endTimestamp* field and fill it
> at the
> >> > > item creation. This simple addition will allow to:
> >> > >
> >> > >    -
> >> > >
> >> > >    Provide the ability to users to setup the exact time of
> blocklist end
> >> > >    through RestAPI
> >> > >    -
> >> > >
> >> > >    Not being tied to a single value of
> >> > >    *cluster.resource-blacklist.item.timeout*
> >> > >
> >> > >
> >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <ch...@apache.org>
> >> > wrote:
> >> > >
> >> > > > I do share the concern between blurring the lines a bit.
> >> > > >
> >> > > > That said, I'd prefer to not have any auto-detection and only
> have an
> >> > > > opt-in mechanism
> >> > > > to manually block processes/nodes. To me this sounds yet again
> like one
> >> > > > of those
> >> > > > magical mechanisms that will rarely work just right.
> >> > > > An external system can leverage way more information after all.
> >> > > >
> >> > > > Moreover, I'm quite concerned about the complexity of this
> proposal.
> >> > > > Tracking on both the RM/JM side; syncing between components;
> >> > adjustments
> >> > > > to the
> >> > > > slot and resource protocol.
> >> > > >
> >> > > > In a way it seems overly complicated.
> >> > > >
> >> > > > If we look at it purely from an active resource management
> perspective,
> >> > > > then there
> >> > > > isn't really a need to touch the slot protocol at all (or in fact
> to
> >> > > > anything in the JobMaster),
> >> > > > because there isn't any point in keeping around blocked TMs in the
> >> > first
> >> > > > place.
> >> > > > They'd just be idling, potentially shutting down after a while by
> the
> >> > RM
> >> > > > because of
> >> > > > it (unless we _also_ touch that logic).
> >> > > > Here the blocking of a process (be it by blocking the process or
> node)
> >> > is
> >> > > > equivalent with shutting down the blocked process(es).
> >> > > > Once the block is lifted we can just spin it back up.
> >> > > >
> >> > > > And I do wonder whether we couldn't apply the same line of
> thinking to
> >> > > > standalone resource management.
> >> > > > Here being able to stop/restart a process/node manually should be
> a
> >> > core
> >> > > > requirement for a Flink deployment anyway.
> >> > > >
> >> > > >
> >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> >> > > > > Hi everyone,
> >> > > > >
> >> > > > > Thanks for creating this FLIP. I can understand the problem and
> I see
> >> > > > value
> >> > > > > in the automatic detection and blocklisting. I do have some
> concerns
> >> > > with
> >> > > > > the ability to manually specify to be blocked resources. I have
> two
> >> > > > > concerns;
> >> > > > >
> >> > > > > * Most organizations explicitly have a separation of concerns,
> >> > meaning
> >> > > > that
> >> > > > > there's a group who's responsible for managing a cluster and
> there's
> >> > a
> >> > > > user
> >> > > > > group who uses that cluster. With the introduction of this
> mechanism,
> >> > > the
> >> > > > > latter group now can influence the responsibility of the first
> group.
> >> > > So
> >> > > > it
> >> > > > > can be possible that someone from the user group blocks
> something,
> >> > > which
> >> > > > > causes an outage (which could result in paging mechanism
> triggering
> >> > > etc)
> >> > > > > which impacts the first group.
> >> > > > > * How big is the group of people who can go through the process
> of
> >> > > > manually
> >> > > > > identifying a node that isn't behaving as it should be? I do
> think
> >> > this
> >> > > > > group is relatively limited. Does it then make sense to
> introduce
> >> > such
> >> > > a
> >> > > > > feature, which would only be used by a really small user group
> of
> >> > > Flink?
> >> > > > We
> >> > > > > still have to maintain, test and support such a feature.
> >> > > > >
> >> > > > > I'm +1 for the autodetection features, but I'm leaning towards
> not
> >> > > > exposing
> >> > > > > this to the user group but having this available strictly for
> cluster
> >> > > > > operators. They could then also set up their
> paging/metrics/logging
> >> > > > system
> >> > > > > to take this into account.
> >> > > > >
> >> > > > > Best regards,
> >> > > > >
> >> > > > > Martijn Visser
> >> > > > > https://twitter.com/MartijnVisser82
> >> > > > > https://github.com/MartijnVisser
> >> > > > >
> >> > > > >
> >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com>
> wrote:
> >> > > > >
> >> > > > >> Thanks for driving this, Zhu and Lijie.
> >> > > > >>
> >> > > > >> +1 for the overall proposal. Just share some cents here:
> >> > > > >>
> >> > > > >> - Why do we need to expose
> >> > > > >> cluster.resource-blacklist.item.timeout-check-interval to the
> user?
> >> > > > >> I think the semantics of
> `cluster.resource-blacklist.item.timeout`
> >> > is
> >> > > > >> sufficient for the user. How to guarantee the timeout
> mechanism is
> >> > > > >> Flink's internal implementation. I think it will be very
> confusing
> >> > and
> >> > > > >> we do not need to expose it to users.
> >> > > > >>
> >> > > > >> - ResourceManager can notify the exception of a task manager to
> >> > > > >> `BlacklistHandler` as well.
> >> > > > >> For example, the slot allocation might fail in case the target
> task
> >> > > > >> manager is busy or has a network jitter. I don't mean we need
> to
> >> > cover
> >> > > > >> this case in this version, but we can also open a
> `notifyException`
> >> > in
> >> > > > >> `ResourceManagerBlacklistHandler`.
> >> > > > >>
> >> > > > >> - Before we sync the blocklist to ResourceManager, will the
> slot of
> >> > a
> >> > > > >> blocked task manager continues to be released and allocated?
> >> > > > >>
> >> > > > >> Best,
> >> > > > >> Yangze Guo
> >> > > > >>
> >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> >> > wangdachui9501@gmail.com>
> >> > > > >> wrote:
> >> > > > >>> Hi Konstantin,
> >> > > > >>>
> >> > > > >>> Thanks for your feedback. I will response your 4 remarks:
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> 1) Thanks for reminding me of the controversy. I think
> “BlockList”
> >> > is
> >> > > > >> good
> >> > > > >>> enough, and I will change it in FLIP.
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> 2) Your suggestion for the REST API is a good idea. Based on
> the
> >> > > > above, I
> >> > > > >>> would change REST API as following:
> >> > > > >>>
> >> > > > >>> POST/GET <host>/blocklist/nodes
> >> > > > >>>
> >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> >> > > > >>>
> >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> >> > > > >>>
> >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> 3) If a node is blocking/blocklisted, it means that all task
> >> > managers
> >> > > > on
> >> > > > >>> this node are blocklisted. All slots on these TMs are not
> >> > available.
> >> > > > This
> >> > > > >>> is actually a bit like TM losts, but these TMs are not really
> lost,
> >> > > > they
> >> > > > >>> are in an unavailable status, and they are still registered
> in this
> >> > > > flink
> >> > > > >>> cluster. They will be available again once the corresponding
> >> > > blocklist
> >> > > > >> item
> >> > > > >>> is removed. This behavior is the same in active/non-active
> >> > clusters.
> >> > > > >>> However in the active clusters, these TMs may be released due
> to
> >> > idle
> >> > > > >>> timeouts.
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> 4) For the item timeout, I prefer to keep it. The reasons are
> as
> >> > > > >> following:
> >> > > > >>> a) The timeout will not affect users adding or removing items
> via
> >> > > REST
> >> > > > >> API,
> >> > > > >>> and users can disable it by configuring it to Long.MAX_VALUE .
> >> > > > >>>
> >> > > > >>> b) Some node problems can recover after a period of time
> (such as
> >> > > > machine
> >> > > > >>> hotspots), in which case users may prefer that Flink can do
> this
> >> > > > >>> automatically instead of requiring the user to do it manually.
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> Best,
> >> > > > >>>
> >> > > > >>> Lijie
> >> > > > >>>
> >> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
> >> > > > >>>
> >> > > > >>>> Hi Lijie,
> >> > > > >>>>
> >> > > > >>>> I think, this makes sense and +1 to only support manually
> blocking
> >> > > > >>>> taskmanagers and nodes. Maybe the different strategies can
> also be
> >> > > > >>>> maintained outside of Apache Flink.
> >> > > > >>>>
> >> > > > >>>> A few remarks:
> >> > > > >>>>
> >> > > > >>>> 1) Can we use another term than "bla.cklist" due to the
> >> > controversy
> >> > > > >> around
> >> > > > >>>> the term? [1] There was also a Jira Ticket about this topic a
> >> > while
> >> > > > >> back
> >> > > > >>>> and there was generally a consensus to avoid the term
> blacklist &
> >> > > > >> whitelist
> >> > > > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
> >> > > > >>>> 2) For the REST API, I'd prefer a slightly different design
> as
> >> > verbs
> >> > > > >> like
> >> > > > >>>> add/remove often considered an anti-pattern for REST APIs.
> POST
> >> > on a
> >> > > > >> list
> >> > > > >>>> item is generally the standard to add items. DELETE on the
> >> > > individual
> >> > > > >>>> resource is standard to remove an item.
> >> > > > >>>>
> >> > > > >>>> POST <host>/quarantine/items
> >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> >> > > > >>>>
> >> > > > >>>> We could also consider to separate taskmanagers and nodes in
> the
> >> > > REST
> >> > > > >> API
> >> > > > >>>> (and internal data structures). Any opinion on this?
> >> > > > >>>>
> >> > > > >>>> POST/GET <host>/quarantine/nodes
> >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> >> > > > >>>>
> >> > > > >>>> 3) How would blocking nodes behave with non-active resource
> >> > > managers,
> >> > > > >> i.e.
> >> > > > >>>> standalone or reactive mode?
> >> > > > >>>>
> >> > > > >>>> 4) To keep the implementation even more minimal, do we need
> the
> >> > > > timeout
> >> > > > >>>> behavior? If items are added/removed manually we could
> delegate
> >> > this
> >> > > > >> to the
> >> > > > >>>> user easily. In my opinion the timeout behavior would better
> fit
> >> > > into
> >> > > > >>>> specific strategies at a later point.
> >> > > > >>>>
> >> > > > >>>> Looking forward to your thoughts.
> >> > > > >>>>
> >> > > > >>>> Cheers and thank you,
> >> > > > >>>>
> >> > > > >>>> Konstantin
> >> > > > >>>>
> >> > > > >>>> [1]
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> >> > > > >>>>
> >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> >> > > > >>>> wangdachui9501@gmail.com>:
> >> > > > >>>>
> >> > > > >>>>> Hi all,
> >> > > > >>>>>
> >> > > > >>>>> Flink job failures may happen due to cluster node issues
> >> > > > >> (insufficient
> >> > > > >>>> disk
> >> > > > >>>>> space, bad hardware, network abnormalities). Flink will
> take care
> >> > > of
> >> > > > >> the
> >> > > > >>>>> failures and redeploy the tasks. However, due to data
> locality
> >> > and
> >> > > > >>>> limited
> >> > > > >>>>> resources, the new tasks are very likely to be redeployed
> to the
> >> > > same
> >> > > > >>>>> nodes, which will result in continuous task abnormalities
> and
> >> > > affect
> >> > > > >> job
> >> > > > >>>>> progress.
> >> > > > >>>>>
> >> > > > >>>>> Currently, Flink users need to manually identify the
> problematic
> >> > > > >> node and
> >> > > > >>>>> take it offline to solve this problem. But this approach has
> >> > > > >> following
> >> > > > >>>>> disadvantages:
> >> > > > >>>>>
> >> > > > >>>>> 1. Taking a node offline can be a heavy process. Users may
> need
> >> > to
> >> > > > >>>> contact
> >> > > > >>>>> cluster administors to do this. The operation can even be
> >> > dangerous
> >> > > > >> and
> >> > > > >>>> not
> >> > > > >>>>> allowed during some important business events.
> >> > > > >>>>>
> >> > > > >>>>> 2. Identifying and solving this kind of problems manually
> would
> >> > be
> >> > > > >> slow
> >> > > > >>>> and
> >> > > > >>>>> a waste of human resources.
> >> > > > >>>>>
> >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to introduce a
> >> > > blacklist
> >> > > > >>>>> mechanism for Flink to filter out problematic resources.
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>> You can find more details in FLIP-224[1]. Looking forward
> to your
> >> > > > >>>> feedback.
> >> > > > >>>>> [1]
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >> > > > >>>>>
> >> > > > >>>>> Best,
> >> > > > >>>>>
> >> > > > >>>>> Lijie
> >> > > > >>>>>
> >> > > >
> >> > > >
> >> > >
> >> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Zhu Zhu <re...@gmail.com>.

Thank you for all your feedback!

Besides the answers from Lijie, I'd like to share some of my thoughts:
1. Whether to enable automatical blocklist
Generally speaking, it is not a goal of FLIP-224.
The automatical way should be something built upon the blocklist
mechanism and well decoupled. It was designed to be a configurable
blocklist strategy, but I think we can further decouple it by
introducing a abnormal node detector, as Becket suggested, which just
uses the blocklist mechanism once bad nodes are detected. However, it
should be a separate FLIP with further dev discussions and feedback
from users. I also agree with Becket that different users have different
requirements, and we should listen to them.

2. Is it enough to just take away abnormal nodes externally
My answer is no. As Lijie has mentioned, we need a way to avoid
deploying tasks to temporary hot nodes. In this case, users may just
want to limit the load of the node and do not want to kill all the
processes on it. Another case is the speculative execution[1] which
may also leverage this feature to avoid starting mirror tasks on slow
nodes.

Thanks,
Zhu

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job

Lijie Wang <wa...@gmail.com> 于2022年5月5日周四 15:56写道：

>
> Hi everyone,
>
>
> Thanks for your feedback.
>
>
> There's one detail that I'd like to re-emphasize here because it can affect the value and design of the blocklist mechanism (perhaps I should highlight it in the FLIP). We propose two actions in FLIP:
>
> 1) MARK_BLOCKLISTED: Just mark the task manager or node as blocked. Future slots should not be allocated from the blocked task manager or node. But slots that are already allocated will not be affected. A typical application scenario is to mitigate machine hotspots. In this case, we hope that subsequent resource allocations will not be on the hot machine, but tasks currently running on it should not be affected.
>
> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or node as blocked, and evacuate all tasks on it. Evacuated tasks will be restarted on non-blocked task managers.
>
> For the above 2 actions, the former may more highlight the meaning of this FLIP, because the external system cannot do that.
>
>
> Regarding *Manually* and *Automatically*, I basically agree with @Becket Qin: different users have different answers. Not all users’ deployment environments have a special external system that can perform the anomaly detection. In addition, adding pluggable/optional auto-detection doesn't require much extra work on top of manual specification.
>
>
> I will answer your other questions one by one.
>
>
> @Yangze
>
> a) I think you are right, we do not need to expose the `cluster.resource-blocklist.item.timeout-check-interval` to users.
>
> b) We can abstract the `notifyException` to a separate interface (maybe BlocklistExceptionListener), and the ResourceManagerBlocklistHandler can implement it in the future.
>
>
> @Martijn
>
> a) I also think the manual blocking should be done by cluster operators.
>
> b) I think manual blocking makes sense, because according to my experience, users are often the first to perceive the machine problems (because of job failover or delay), and they will contact cluster operators to solve it, or even tell the cluster operators which machine is problematic. From this point of view, I think the people who really need the manual blocking are the users, and it’s just performed by the cluster operator, so I think the manual blocking makes sense.
>
>
> @Chesnay
>
> We need to touch the logic of JM/SlotPool, because for MARK_BLOCKLISTED , we need to know whether the slot is blocklisted when the task is FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot directly to avoid assigning other tasks (of this job) on it. If we only maintain the blocklist information on the RM, JM needs to retrieve it by RPC. I think the performance overhead of that is relatively large, so I think it's worth maintaining the blocklist information on the JM side and syncing them.
>
>
> @Роман
>
>     a) “Probably storing inside Zookeeper/Configmap might be helpful here.”  Can you explain it in detail? I don't fully understand that. In my opinion, non-active and active are the same, and no special treatment is required.
>
> b) I agree with you, the `endTimestamp` makes sense, I will add it to FLIP.
>
>
> @Yang
>
> As mentioned above, AFAK, the external system cannot support the MARK_BLOCKLISTED action.
>
>
> Looking forward to your further feedback.
>
>
> Best,
>
> Lijie
>
>
> Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：
>>
>> Thanks Lijie and Zhu for creating the proposal.
>>
>> I want to share some thoughts about Flink cluster operations.
>>
>> In the production environment, the SRE(aka Site Reliability Engineer)
>> already has many tools to detect the unstable nodes, which could take the
>> system logs/metrics into consideration.
>> Then they use graceful-decomission in YARN and taint in K8s to prevent new
>> allocations on these unstable nodes.
>> At last, they will evict all the containers and pods running on these nodes.
>> This mechanism also works for planned maintenance. So I am afraid this is
>> not the typical use case for FLIP-224.
>>
>> If we only support to block nodes manually, then I could not see
>> the obvious advantages compared with current SRE's approach(via *yarn
>> rmadmin or kubectl taint*).
>> At least, we need to have a pluggable component which could expose the
>> potential unstable nodes automatically and block them if enabled explicitly.
>>
>>
>> Best,
>> Yang
>>
>>
>>
>> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
>>
>> > Thanks for the proposal, Lijie.
>> >
>> > This is an interesting feature and discussion, and somewhat related to the
>> > design principle about how people should operate Flink.
>> >
>> > I think there are three things involved in this FLIP.
>> >      a) Detect and report the unstable node.
>> >      b) Collect the information of the unstable node and form a blocklist.
>> >      c) Take the action to block nodes.
>> >
>> > My two cents:
>> >
>> > 1. It looks like people all agree that Flink should have c). It is not only
>> > useful for cases of node failures, but also handy for some planned
>> > maintenance.
>> >
>> > 2. People have different opinions on b), i.e. who should be the brain to
>> > make the decision to block a node. I think this largely depends on who we
>> > talk to. Different users would probably give different answers. For people
>> > who do have a centralized node health management service, let Flink do just
>> > do a) and c) would be preferred. So essentially Flink would be one of the
>> > sources that may detect unstable nodes, report it to that service, and then
>> > take the command from that service to block the problematic nodes. On the
>> > other hand, for users who do not have such a service, simply letting Flink
>> > be clever by itself to block the suspicious nodes might be desired to
>> > ensure the jobs are running smoothly.
>> >
>> > So that indicates a) and b) here should be pluggable / optional.
>> >
>> > In light of this, maybe it would make sense to have something pluggable
>> > like a UnstableNodeReporter which exposes unstable nodes actively. (A more
>> > general interface should be JobInfoReporter<T> which can be used to report
>> > any information of type <T>. But I'll just keep the scope relevant to this
>> > FLIP here). Personally speaking, I think it is OK to have a default
>> > implementation of a reporter which just tells Flink to take action to block
>> > problematic nodes and also unblocks them after timeout.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro...@gmail.com> wrote:
>> >
>> > > Thanks for good initiative, Lijie and Zhu!
>> > >
>> > > If it's possible I'd like to participate in development.
>> > >
>> > > I agree with 3rd point of Konstantin's reply - we should consider to move
>> > > somehow the information of blocklisted nodes/TMs from active
>> > > ResourceManager to non-active ones. Probably storing inside
>> > > Zookeeper/Configmap might be helpful here.
>> > >
>> > > And I agree with Martijn that a lot of organizations don't want to expose
>> > > such API for a cluster user group. But I think it's necessary to have the
>> > > mechanism for unblocking the nodes/TMs anyway for avoiding incorrect
>> > > automatic behaviour.
>> > >
>> > > And another one small suggestion - I think it would be better to extend
>> > the
>> > > *BlocklistedItem* class with the *endTimestamp* field and fill it at the
>> > > item creation. This simple addition will allow to:
>> > >
>> > >    -
>> > >
>> > >    Provide the ability to users to setup the exact time of blocklist end
>> > >    through RestAPI
>> > >    -
>> > >
>> > >    Not being tied to a single value of
>> > >    *cluster.resource-blacklist.item.timeout*
>> > >
>> > >
>> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <ch...@apache.org>
>> > wrote:
>> > >
>> > > > I do share the concern between blurring the lines a bit.
>> > > >
>> > > > That said, I'd prefer to not have any auto-detection and only have an
>> > > > opt-in mechanism
>> > > > to manually block processes/nodes. To me this sounds yet again like one
>> > > > of those
>> > > > magical mechanisms that will rarely work just right.
>> > > > An external system can leverage way more information after all.
>> > > >
>> > > > Moreover, I'm quite concerned about the complexity of this proposal.
>> > > > Tracking on both the RM/JM side; syncing between components;
>> > adjustments
>> > > > to the
>> > > > slot and resource protocol.
>> > > >
>> > > > In a way it seems overly complicated.
>> > > >
>> > > > If we look at it purely from an active resource management perspective,
>> > > > then there
>> > > > isn't really a need to touch the slot protocol at all (or in fact to
>> > > > anything in the JobMaster),
>> > > > because there isn't any point in keeping around blocked TMs in the
>> > first
>> > > > place.
>> > > > They'd just be idling, potentially shutting down after a while by the
>> > RM
>> > > > because of
>> > > > it (unless we _also_ touch that logic).
>> > > > Here the blocking of a process (be it by blocking the process or node)
>> > is
>> > > > equivalent with shutting down the blocked process(es).
>> > > > Once the block is lifted we can just spin it back up.
>> > > >
>> > > > And I do wonder whether we couldn't apply the same line of thinking to
>> > > > standalone resource management.
>> > > > Here being able to stop/restart a process/node manually should be a
>> > core
>> > > > requirement for a Flink deployment anyway.
>> > > >
>> > > >
>> > > > On 02/05/2022 08:49, Martijn Visser wrote:
>> > > > > Hi everyone,
>> > > > >
>> > > > > Thanks for creating this FLIP. I can understand the problem and I see
>> > > > value
>> > > > > in the automatic detection and blocklisting. I do have some concerns
>> > > with
>> > > > > the ability to manually specify to be blocked resources. I have two
>> > > > > concerns;
>> > > > >
>> > > > > * Most organizations explicitly have a separation of concerns,
>> > meaning
>> > > > that
>> > > > > there's a group who's responsible for managing a cluster and there's
>> > a
>> > > > user
>> > > > > group who uses that cluster. With the introduction of this mechanism,
>> > > the
>> > > > > latter group now can influence the responsibility of the first group.
>> > > So
>> > > > it
>> > > > > can be possible that someone from the user group blocks something,
>> > > which
>> > > > > causes an outage (which could result in paging mechanism triggering
>> > > etc)
>> > > > > which impacts the first group.
>> > > > > * How big is the group of people who can go through the process of
>> > > > manually
>> > > > > identifying a node that isn't behaving as it should be? I do think
>> > this
>> > > > > group is relatively limited. Does it then make sense to introduce
>> > such
>> > > a
>> > > > > feature, which would only be used by a really small user group of
>> > > Flink?
>> > > > We
>> > > > > still have to maintain, test and support such a feature.
>> > > > >
>> > > > > I'm +1 for the autodetection features, but I'm leaning towards not
>> > > > exposing
>> > > > > this to the user group but having this available strictly for cluster
>> > > > > operators. They could then also set up their paging/metrics/logging
>> > > > system
>> > > > > to take this into account.
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > Martijn Visser
>> > > > > https://twitter.com/MartijnVisser82
>> > > > > https://github.com/MartijnVisser
>> > > > >
>> > > > >
>> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com> wrote:
>> > > > >
>> > > > >> Thanks for driving this, Zhu and Lijie.
>> > > > >>
>> > > > >> +1 for the overall proposal. Just share some cents here:
>> > > > >>
>> > > > >> - Why do we need to expose
>> > > > >> cluster.resource-blacklist.item.timeout-check-interval to the user?
>> > > > >> I think the semantics of `cluster.resource-blacklist.item.timeout`
>> > is
>> > > > >> sufficient for the user. How to guarantee the timeout mechanism is
>> > > > >> Flink's internal implementation. I think it will be very confusing
>> > and
>> > > > >> we do not need to expose it to users.
>> > > > >>
>> > > > >> - ResourceManager can notify the exception of a task manager to
>> > > > >> `BlacklistHandler` as well.
>> > > > >> For example, the slot allocation might fail in case the target task
>> > > > >> manager is busy or has a network jitter. I don't mean we need to
>> > cover
>> > > > >> this case in this version, but we can also open a `notifyException`
>> > in
>> > > > >> `ResourceManagerBlacklistHandler`.
>> > > > >>
>> > > > >> - Before we sync the blocklist to ResourceManager, will the slot of
>> > a
>> > > > >> blocked task manager continues to be released and allocated?
>> > > > >>
>> > > > >> Best,
>> > > > >> Yangze Guo
>> > > > >>
>> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
>> > wangdachui9501@gmail.com>
>> > > > >> wrote:
>> > > > >>> Hi Konstantin,
>> > > > >>>
>> > > > >>> Thanks for your feedback. I will response your 4 remarks:
>> > > > >>>
>> > > > >>>
>> > > > >>> 1) Thanks for reminding me of the controversy. I think “BlockList”
>> > is
>> > > > >> good
>> > > > >>> enough, and I will change it in FLIP.
>> > > > >>>
>> > > > >>>
>> > > > >>> 2) Your suggestion for the REST API is a good idea. Based on the
>> > > > above, I
>> > > > >>> would change REST API as following:
>> > > > >>>
>> > > > >>> POST/GET <host>/blocklist/nodes
>> > > > >>>
>> > > > >>> POST/GET <host>/blocklist/taskmanagers
>> > > > >>>
>> > > > >>> DELETE <host>/blocklist/node/<identifier>
>> > > > >>>
>> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
>> > > > >>>
>> > > > >>>
>> > > > >>> 3) If a node is blocking/blocklisted, it means that all task
>> > managers
>> > > > on
>> > > > >>> this node are blocklisted. All slots on these TMs are not
>> > available.
>> > > > This
>> > > > >>> is actually a bit like TM losts, but these TMs are not really lost,
>> > > > they
>> > > > >>> are in an unavailable status, and they are still registered in this
>> > > > flink
>> > > > >>> cluster. They will be available again once the corresponding
>> > > blocklist
>> > > > >> item
>> > > > >>> is removed. This behavior is the same in active/non-active
>> > clusters.
>> > > > >>> However in the active clusters, these TMs may be released due to
>> > idle
>> > > > >>> timeouts.
>> > > > >>>
>> > > > >>>
>> > > > >>> 4) For the item timeout, I prefer to keep it. The reasons are as
>> > > > >> following:
>> > > > >>> a) The timeout will not affect users adding or removing items via
>> > > REST
>> > > > >> API,
>> > > > >>> and users can disable it by configuring it to Long.MAX_VALUE .
>> > > > >>>
>> > > > >>> b) Some node problems can recover after a period of time (such as
>> > > > machine
>> > > > >>> hotspots), in which case users may prefer that Flink can do this
>> > > > >>> automatically instead of requiring the user to do it manually.
>> > > > >>>
>> > > > >>>
>> > > > >>> Best,
>> > > > >>>
>> > > > >>> Lijie
>> > > > >>>
>> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
>> > > > >>>
>> > > > >>>> Hi Lijie,
>> > > > >>>>
>> > > > >>>> I think, this makes sense and +1 to only support manually blocking
>> > > > >>>> taskmanagers and nodes. Maybe the different strategies can also be
>> > > > >>>> maintained outside of Apache Flink.
>> > > > >>>>
>> > > > >>>> A few remarks:
>> > > > >>>>
>> > > > >>>> 1) Can we use another term than "bla.cklist" due to the
>> > controversy
>> > > > >> around
>> > > > >>>> the term? [1] There was also a Jira Ticket about this topic a
>> > while
>> > > > >> back
>> > > > >>>> and there was generally a consensus to avoid the term blacklist &
>> > > > >> whitelist
>> > > > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
>> > > > >>>> 2) For the REST API, I'd prefer a slightly different design as
>> > verbs
>> > > > >> like
>> > > > >>>> add/remove often considered an anti-pattern for REST APIs. POST
>> > on a
>> > > > >> list
>> > > > >>>> item is generally the standard to add items. DELETE on the
>> > > individual
>> > > > >>>> resource is standard to remove an item.
>> > > > >>>>
>> > > > >>>> POST <host>/quarantine/items
>> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
>> > > > >>>>
>> > > > >>>> We could also consider to separate taskmanagers and nodes in the
>> > > REST
>> > > > >> API
>> > > > >>>> (and internal data structures). Any opinion on this?
>> > > > >>>>
>> > > > >>>> POST/GET <host>/quarantine/nodes
>> > > > >>>> POST/GET <host>/quarantine/taskmanager
>> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
>> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
>> > > > >>>>
>> > > > >>>> 3) How would blocking nodes behave with non-active resource
>> > > managers,
>> > > > >> i.e.
>> > > > >>>> standalone or reactive mode?
>> > > > >>>>
>> > > > >>>> 4) To keep the implementation even more minimal, do we need the
>> > > > timeout
>> > > > >>>> behavior? If items are added/removed manually we could delegate
>> > this
>> > > > >> to the
>> > > > >>>> user easily. In my opinion the timeout behavior would better fit
>> > > into
>> > > > >>>> specific strategies at a later point.
>> > > > >>>>
>> > > > >>>> Looking forward to your thoughts.
>> > > > >>>>
>> > > > >>>> Cheers and thank you,
>> > > > >>>>
>> > > > >>>> Konstantin
>> > > > >>>>
>> > > > >>>> [1]
>> > > > >>>>
>> > > > >>>>
>> > > > >>
>> > > >
>> > >
>> > https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
>> > > > >>>>
>> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
>> > > > >>>> wangdachui9501@gmail.com>:
>> > > > >>>>
>> > > > >>>>> Hi all,
>> > > > >>>>>
>> > > > >>>>> Flink job failures may happen due to cluster node issues
>> > > > >> (insufficient
>> > > > >>>> disk
>> > > > >>>>> space, bad hardware, network abnormalities). Flink will take care
>> > > of
>> > > > >> the
>> > > > >>>>> failures and redeploy the tasks. However, due to data locality
>> > and
>> > > > >>>> limited
>> > > > >>>>> resources, the new tasks are very likely to be redeployed to the
>> > > same
>> > > > >>>>> nodes, which will result in continuous task abnormalities and
>> > > affect
>> > > > >> job
>> > > > >>>>> progress.
>> > > > >>>>>
>> > > > >>>>> Currently, Flink users need to manually identify the problematic
>> > > > >> node and
>> > > > >>>>> take it offline to solve this problem. But this approach has
>> > > > >> following
>> > > > >>>>> disadvantages:
>> > > > >>>>>
>> > > > >>>>> 1. Taking a node offline can be a heavy process. Users may need
>> > to
>> > > > >>>> contact
>> > > > >>>>> cluster administors to do this. The operation can even be
>> > dangerous
>> > > > >> and
>> > > > >>>> not
>> > > > >>>>> allowed during some important business events.
>> > > > >>>>>
>> > > > >>>>> 2. Identifying and solving this kind of problems manually would
>> > be
>> > > > >> slow
>> > > > >>>> and
>> > > > >>>>> a waste of human resources.
>> > > > >>>>>
>> > > > >>>>> To solve this problem, Zhu Zhu and I propose to introduce a
>> > > blacklist
>> > > > >>>>> mechanism for Flink to filter out problematic resources.
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>> You can find more details in FLIP-224[1]. Looking forward to your
>> > > > >>>> feedback.
>> > > > >>>>> [1]
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>
>> > > >
>> > >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>> > > > >>>>>
>> > > > >>>>> Best,
>> > > > >>>>>
>> > > > >>>>> Lijie
>> > > > >>>>>
>> > > >
>> > > >
>> > >
>> >

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Lijie Wang <wa...@gmail.com>.

Hi everyone,

Thanks for your feedback.


There's one detail that I'd like to re-emphasize here because it can affect
the value and design of the blocklist mechanism (perhaps I should highlight
it in the FLIP). We propose two actions in FLIP:

1) MARK_BLOCKLISTED: Just mark the task manager or node as blocked. Future
slots should not be allocated from the blocked task manager or node. But
slots that are already allocated will not be affected. A typical
application scenario is to mitigate machine hotspots. In this case, we hope
that subsequent resource allocations will not be on the hot machine, but
tasks currently running on it should not be affected.

2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or node as
blocked, and evacuate all tasks on it. Evacuated tasks will be restarted on
non-blocked task managers.

For the above 2 actions, the former may more highlight the meaning of this
FLIP, because the external system cannot do that.


Regarding *Manually* and *Automatically*, I basically agree with @Becket
Qin: different users have different answers. Not all users’ deployment
environments have a special external system that can perform the anomaly
detection. In addition, adding pluggable/optional auto-detection doesn't
require much extra work on top of manual specification.

I will answer your other questions one by one.


@Yangze

a) I think you are right, we do not need to expose the
`cluster.resource-blocklist.item.timeout-check-interval` to users.

b) We can abstract the `notifyException` to a separate interface (maybe
BlocklistExceptionListener), and the ResourceManagerBlocklistHandler can
implement it in the future.

@Martijn

a) I also think the manual blocking should be done by cluster operators.

b) I think manual blocking makes sense, because according to my experience,
users are often the first to perceive the machine problems (because of job
failover or delay), and they will contact cluster operators to solve it, or
even tell the cluster operators which machine is problematic. From this
point of view, I think the people who really need the manual blocking are
the users, and it’s just performed by the cluster operator, so I think the
manual blocking makes sense.

@Chesnay

We need to touch the logic of JM/SlotPool, because for MARK_BLOCKLISTED ,
we need to know whether the slot is blocklisted when the task is
FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
directly to avoid assigning other tasks (of this job) on it. If we only
maintain the blocklist information on the RM, JM needs to retrieve it by
RPC. I think the performance overhead of that is relatively large, so I
think it's worth maintaining the blocklist information on the JM side and
syncing them.

@Роман

    a) “Probably storing inside Zookeeper/Configmap might be helpful
here.”  Can you explain it in detail? I don't fully understand that. In my
opinion, non-active and active are the same, and no special treatment is
required.

b) I agree with you, the `endTimestamp` makes sense, I will add it to FLIP.

@Yang

As mentioned above, AFAK, the external system cannot support the
MARK_BLOCKLISTED action.


Looking forward to your further feedback.


Best,

Lijie

Yang Wang <da...@gmail.com> 于2022年5月3日周二 21:09写道：

> Thanks Lijie and Zhu for creating the proposal.
>
> I want to share some thoughts about Flink cluster operations.
>
> In the production environment, the SRE(aka Site Reliability Engineer)
> already has many tools to detect the unstable nodes, which could take the
> system logs/metrics into consideration.
> Then they use graceful-decomission in YARN and taint in K8s to prevent new
> allocations on these unstable nodes.
> At last, they will evict all the containers and pods running on these
> nodes.
> This mechanism also works for planned maintenance. So I am afraid this is
> not the typical use case for FLIP-224.
>
> If we only support to block nodes manually, then I could not see
> the obvious advantages compared with current SRE's approach(via *yarn
> rmadmin or kubectl taint*).
> At least, we need to have a pluggable component which could expose the
> potential unstable nodes automatically and block them if enabled
> explicitly.
>
>
> Best,
> Yang
>
>
>
> Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：
>
> > Thanks for the proposal, Lijie.
> >
> > This is an interesting feature and discussion, and somewhat related to
> the
> > design principle about how people should operate Flink.
> >
> > I think there are three things involved in this FLIP.
> >      a) Detect and report the unstable node.
> >      b) Collect the information of the unstable node and form a
> blocklist.
> >      c) Take the action to block nodes.
> >
> > My two cents:
> >
> > 1. It looks like people all agree that Flink should have c). It is not
> only
> > useful for cases of node failures, but also handy for some planned
> > maintenance.
> >
> > 2. People have different opinions on b), i.e. who should be the brain to
> > make the decision to block a node. I think this largely depends on who we
> > talk to. Different users would probably give different answers. For
> people
> > who do have a centralized node health management service, let Flink do
> just
> > do a) and c) would be preferred. So essentially Flink would be one of the
> > sources that may detect unstable nodes, report it to that service, and
> then
> > take the command from that service to block the problematic nodes. On the
> > other hand, for users who do not have such a service, simply letting
> Flink
> > be clever by itself to block the suspicious nodes might be desired to
> > ensure the jobs are running smoothly.
> >
> > So that indicates a) and b) here should be pluggable / optional.
> >
> > In light of this, maybe it would make sense to have something pluggable
> > like a UnstableNodeReporter which exposes unstable nodes actively. (A
> more
> > general interface should be JobInfoReporter<T> which can be used to
> report
> > any information of type <T>. But I'll just keep the scope relevant to
> this
> > FLIP here). Personally speaking, I think it is OK to have a default
> > implementation of a reporter which just tells Flink to take action to
> block
> > problematic nodes and also unblocks them after timeout.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro...@gmail.com> wrote:
> >
> > > Thanks for good initiative, Lijie and Zhu!
> > >
> > > If it's possible I'd like to participate in development.
> > >
> > > I agree with 3rd point of Konstantin's reply - we should consider to
> move
> > > somehow the information of blocklisted nodes/TMs from active
> > > ResourceManager to non-active ones. Probably storing inside
> > > Zookeeper/Configmap might be helpful here.
> > >
> > > And I agree with Martijn that a lot of organizations don't want to
> expose
> > > such API for a cluster user group. But I think it's necessary to have
> the
> > > mechanism for unblocking the nodes/TMs anyway for avoiding incorrect
> > > automatic behaviour.
> > >
> > > And another one small suggestion - I think it would be better to extend
> > the
> > > *BlocklistedItem* class with the *endTimestamp* field and fill it at
> the
> > > item creation. This simple addition will allow to:
> > >
> > >    -
> > >
> > >    Provide the ability to users to setup the exact time of blocklist
> end
> > >    through RestAPI
> > >    -
> > >
> > >    Not being tied to a single value of
> > >    *cluster.resource-blacklist.item.timeout*
> > >
> > >
> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <ch...@apache.org>
> > wrote:
> > >
> > > > I do share the concern between blurring the lines a bit.
> > > >
> > > > That said, I'd prefer to not have any auto-detection and only have an
> > > > opt-in mechanism
> > > > to manually block processes/nodes. To me this sounds yet again like
> one
> > > > of those
> > > > magical mechanisms that will rarely work just right.
> > > > An external system can leverage way more information after all.
> > > >
> > > > Moreover, I'm quite concerned about the complexity of this proposal.
> > > > Tracking on both the RM/JM side; syncing between components;
> > adjustments
> > > > to the
> > > > slot and resource protocol.
> > > >
> > > > In a way it seems overly complicated.
> > > >
> > > > If we look at it purely from an active resource management
> perspective,
> > > > then there
> > > > isn't really a need to touch the slot protocol at all (or in fact to
> > > > anything in the JobMaster),
> > > > because there isn't any point in keeping around blocked TMs in the
> > first
> > > > place.
> > > > They'd just be idling, potentially shutting down after a while by the
> > RM
> > > > because of
> > > > it (unless we _also_ touch that logic).
> > > > Here the blocking of a process (be it by blocking the process or
> node)
> > is
> > > > equivalent with shutting down the blocked process(es).
> > > > Once the block is lifted we can just spin it back up.
> > > >
> > > > And I do wonder whether we couldn't apply the same line of thinking
> to
> > > > standalone resource management.
> > > > Here being able to stop/restart a process/node manually should be a
> > core
> > > > requirement for a Flink deployment anyway.
> > > >
> > > >
> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > > > Hi everyone,
> > > > >
> > > > > Thanks for creating this FLIP. I can understand the problem and I
> see
> > > > value
> > > > > in the automatic detection and blocklisting. I do have some
> concerns
> > > with
> > > > > the ability to manually specify to be blocked resources. I have two
> > > > > concerns;
> > > > >
> > > > > * Most organizations explicitly have a separation of concerns,
> > meaning
> > > > that
> > > > > there's a group who's responsible for managing a cluster and
> there's
> > a
> > > > user
> > > > > group who uses that cluster. With the introduction of this
> mechanism,
> > > the
> > > > > latter group now can influence the responsibility of the first
> group.
> > > So
> > > > it
> > > > > can be possible that someone from the user group blocks something,
> > > which
> > > > > causes an outage (which could result in paging mechanism triggering
> > > etc)
> > > > > which impacts the first group.
> > > > > * How big is the group of people who can go through the process of
> > > > manually
> > > > > identifying a node that isn't behaving as it should be? I do think
> > this
> > > > > group is relatively limited. Does it then make sense to introduce
> > such
> > > a
> > > > > feature, which would only be used by a really small user group of
> > > Flink?
> > > > We
> > > > > still have to maintain, test and support such a feature.
> > > > >
> > > > > I'm +1 for the autodetection features, but I'm leaning towards not
> > > > exposing
> > > > > this to the user group but having this available strictly for
> cluster
> > > > > operators. They could then also set up their paging/metrics/logging
> > > > system
> > > > > to take this into account.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn Visser
> > > > > https://twitter.com/MartijnVisser82
> > > > > https://github.com/MartijnVisser
> > > > >
> > > > >
> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com>
> wrote:
> > > > >
> > > > >> Thanks for driving this, Zhu and Lijie.
> > > > >>
> > > > >> +1 for the overall proposal. Just share some cents here:
> > > > >>
> > > > >> - Why do we need to expose
> > > > >> cluster.resource-blacklist.item.timeout-check-interval to the
> user?
> > > > >> I think the semantics of `cluster.resource-blacklist.item.timeout`
> > is
> > > > >> sufficient for the user. How to guarantee the timeout mechanism is
> > > > >> Flink's internal implementation. I think it will be very confusing
> > and
> > > > >> we do not need to expose it to users.
> > > > >>
> > > > >> - ResourceManager can notify the exception of a task manager to
> > > > >> `BlacklistHandler` as well.
> > > > >> For example, the slot allocation might fail in case the target
> task
> > > > >> manager is busy or has a network jitter. I don't mean we need to
> > cover
> > > > >> this case in this version, but we can also open a
> `notifyException`
> > in
> > > > >> `ResourceManagerBlacklistHandler`.
> > > > >>
> > > > >> - Before we sync the blocklist to ResourceManager, will the slot
> of
> > a
> > > > >> blocked task manager continues to be released and allocated?
> > > > >>
> > > > >> Best,
> > > > >> Yangze Guo
> > > > >>
> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > wangdachui9501@gmail.com>
> > > > >> wrote:
> > > > >>> Hi Konstantin,
> > > > >>>
> > > > >>> Thanks for your feedback. I will response your 4 remarks:
> > > > >>>
> > > > >>>
> > > > >>> 1) Thanks for reminding me of the controversy. I think
> “BlockList”
> > is
> > > > >> good
> > > > >>> enough, and I will change it in FLIP.
> > > > >>>
> > > > >>>
> > > > >>> 2) Your suggestion for the REST API is a good idea. Based on the
> > > > above, I
> > > > >>> would change REST API as following:
> > > > >>>
> > > > >>> POST/GET <host>/blocklist/nodes
> > > > >>>
> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > > > >>>
> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > > > >>>
> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > > > >>>
> > > > >>>
> > > > >>> 3) If a node is blocking/blocklisted, it means that all task
> > managers
> > > > on
> > > > >>> this node are blocklisted. All slots on these TMs are not
> > available.
> > > > This
> > > > >>> is actually a bit like TM losts, but these TMs are not really
> lost,
> > > > they
> > > > >>> are in an unavailable status, and they are still registered in
> this
> > > > flink
> > > > >>> cluster. They will be available again once the corresponding
> > > blocklist
> > > > >> item
> > > > >>> is removed. This behavior is the same in active/non-active
> > clusters.
> > > > >>> However in the active clusters, these TMs may be released due to
> > idle
> > > > >>> timeouts.
> > > > >>>
> > > > >>>
> > > > >>> 4) For the item timeout, I prefer to keep it. The reasons are as
> > > > >> following:
> > > > >>> a) The timeout will not affect users adding or removing items via
> > > REST
> > > > >> API,
> > > > >>> and users can disable it by configuring it to Long.MAX_VALUE .
> > > > >>>
> > > > >>> b) Some node problems can recover after a period of time (such as
> > > > machine
> > > > >>> hotspots), in which case users may prefer that Flink can do this
> > > > >>> automatically instead of requiring the user to do it manually.
> > > > >>>
> > > > >>>
> > > > >>> Best,
> > > > >>>
> > > > >>> Lijie
> > > > >>>
> > > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
> > > > >>>
> > > > >>>> Hi Lijie,
> > > > >>>>
> > > > >>>> I think, this makes sense and +1 to only support manually
> blocking
> > > > >>>> taskmanagers and nodes. Maybe the different strategies can also
> be
> > > > >>>> maintained outside of Apache Flink.
> > > > >>>>
> > > > >>>> A few remarks:
> > > > >>>>
> > > > >>>> 1) Can we use another term than "bla.cklist" due to the
> > controversy
> > > > >> around
> > > > >>>> the term? [1] There was also a Jira Ticket about this topic a
> > while
> > > > >> back
> > > > >>>> and there was generally a consensus to avoid the term blacklist
> &
> > > > >> whitelist
> > > > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
> > > > >>>> 2) For the REST API, I'd prefer a slightly different design as
> > verbs
> > > > >> like
> > > > >>>> add/remove often considered an anti-pattern for REST APIs. POST
> > on a
> > > > >> list
> > > > >>>> item is generally the standard to add items. DELETE on the
> > > individual
> > > > >>>> resource is standard to remove an item.
> > > > >>>>
> > > > >>>> POST <host>/quarantine/items
> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > > > >>>>
> > > > >>>> We could also consider to separate taskmanagers and nodes in the
> > > REST
> > > > >> API
> > > > >>>> (and internal data structures). Any opinion on this?
> > > > >>>>
> > > > >>>> POST/GET <host>/quarantine/nodes
> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > > > >>>>
> > > > >>>> 3) How would blocking nodes behave with non-active resource
> > > managers,
> > > > >> i.e.
> > > > >>>> standalone or reactive mode?
> > > > >>>>
> > > > >>>> 4) To keep the implementation even more minimal, do we need the
> > > > timeout
> > > > >>>> behavior? If items are added/removed manually we could delegate
> > this
> > > > >> to the
> > > > >>>> user easily. In my opinion the timeout behavior would better fit
> > > into
> > > > >>>> specific strategies at a later point.
> > > > >>>>
> > > > >>>> Looking forward to your thoughts.
> > > > >>>>
> > > > >>>> Cheers and thank you,
> > > > >>>>
> > > > >>>> Konstantin
> > > > >>>>
> > > > >>>> [1]
> > > > >>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> > > > >>>>
> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> > > > >>>> wangdachui9501@gmail.com>:
> > > > >>>>
> > > > >>>>> Hi all,
> > > > >>>>>
> > > > >>>>> Flink job failures may happen due to cluster node issues
> > > > >> (insufficient
> > > > >>>> disk
> > > > >>>>> space, bad hardware, network abnormalities). Flink will take
> care
> > > of
> > > > >> the
> > > > >>>>> failures and redeploy the tasks. However, due to data locality
> > and
> > > > >>>> limited
> > > > >>>>> resources, the new tasks are very likely to be redeployed to
> the
> > > same
> > > > >>>>> nodes, which will result in continuous task abnormalities and
> > > affect
> > > > >> job
> > > > >>>>> progress.
> > > > >>>>>
> > > > >>>>> Currently, Flink users need to manually identify the
> problematic
> > > > >> node and
> > > > >>>>> take it offline to solve this problem. But this approach has
> > > > >> following
> > > > >>>>> disadvantages:
> > > > >>>>>
> > > > >>>>> 1. Taking a node offline can be a heavy process. Users may need
> > to
> > > > >>>> contact
> > > > >>>>> cluster administors to do this. The operation can even be
> > dangerous
> > > > >> and
> > > > >>>> not
> > > > >>>>> allowed during some important business events.
> > > > >>>>>
> > > > >>>>> 2. Identifying and solving this kind of problems manually would
> > be
> > > > >> slow
> > > > >>>> and
> > > > >>>>> a waste of human resources.
> > > > >>>>>
> > > > >>>>> To solve this problem, Zhu Zhu and I propose to introduce a
> > > blacklist
> > > > >>>>> mechanism for Flink to filter out problematic resources.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> You can find more details in FLIP-224[1]. Looking forward to
> your
> > > > >>>> feedback.
> > > > >>>>> [1]
> > > > >>>>>
> > > > >>>>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>>
> > > > >>>>> Lijie
> > > > >>>>>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Yang Wang <da...@gmail.com>.

Thanks Lijie and Zhu for creating the proposal.

I want to share some thoughts about Flink cluster operations.

In the production environment, the SRE(aka Site Reliability Engineer)
already has many tools to detect the unstable nodes, which could take the
system logs/metrics into consideration.
Then they use graceful-decomission in YARN and taint in K8s to prevent new
allocations on these unstable nodes.
At last, they will evict all the containers and pods running on these nodes.
This mechanism also works for planned maintenance. So I am afraid this is
not the typical use case for FLIP-224.

If we only support to block nodes manually, then I could not see
the obvious advantages compared with current SRE's approach(via *yarn
rmadmin or kubectl taint*).
At least, we need to have a pluggable component which could expose the
potential unstable nodes automatically and block them if enabled explicitly.


Best,
Yang



Becket Qin <be...@gmail.com> 于2022年5月2日周一 16:36写道：

> Thanks for the proposal, Lijie.
>
> This is an interesting feature and discussion, and somewhat related to the
> design principle about how people should operate Flink.
>
> I think there are three things involved in this FLIP.
>      a) Detect and report the unstable node.
>      b) Collect the information of the unstable node and form a blocklist.
>      c) Take the action to block nodes.
>
> My two cents:
>
> 1. It looks like people all agree that Flink should have c). It is not only
> useful for cases of node failures, but also handy for some planned
> maintenance.
>
> 2. People have different opinions on b), i.e. who should be the brain to
> make the decision to block a node. I think this largely depends on who we
> talk to. Different users would probably give different answers. For people
> who do have a centralized node health management service, let Flink do just
> do a) and c) would be preferred. So essentially Flink would be one of the
> sources that may detect unstable nodes, report it to that service, and then
> take the command from that service to block the problematic nodes. On the
> other hand, for users who do not have such a service, simply letting Flink
> be clever by itself to block the suspicious nodes might be desired to
> ensure the jobs are running smoothly.
>
> So that indicates a) and b) here should be pluggable / optional.
>
> In light of this, maybe it would make sense to have something pluggable
> like a UnstableNodeReporter which exposes unstable nodes actively. (A more
> general interface should be JobInfoReporter<T> which can be used to report
> any information of type <T>. But I'll just keep the scope relevant to this
> FLIP here). Personally speaking, I think it is OK to have a default
> implementation of a reporter which just tells Flink to take action to block
> problematic nodes and also unblocks them after timeout.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro...@gmail.com> wrote:
>
> > Thanks for good initiative, Lijie and Zhu!
> >
> > If it's possible I'd like to participate in development.
> >
> > I agree with 3rd point of Konstantin's reply - we should consider to move
> > somehow the information of blocklisted nodes/TMs from active
> > ResourceManager to non-active ones. Probably storing inside
> > Zookeeper/Configmap might be helpful here.
> >
> > And I agree with Martijn that a lot of organizations don't want to expose
> > such API for a cluster user group. But I think it's necessary to have the
> > mechanism for unblocking the nodes/TMs anyway for avoiding incorrect
> > automatic behaviour.
> >
> > And another one small suggestion - I think it would be better to extend
> the
> > *BlocklistedItem* class with the *endTimestamp* field and fill it at the
> > item creation. This simple addition will allow to:
> >
> >    -
> >
> >    Provide the ability to users to setup the exact time of blocklist end
> >    through RestAPI
> >    -
> >
> >    Not being tied to a single value of
> >    *cluster.resource-blacklist.item.timeout*
> >
> >
> > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <ch...@apache.org>
> wrote:
> >
> > > I do share the concern between blurring the lines a bit.
> > >
> > > That said, I'd prefer to not have any auto-detection and only have an
> > > opt-in mechanism
> > > to manually block processes/nodes. To me this sounds yet again like one
> > > of those
> > > magical mechanisms that will rarely work just right.
> > > An external system can leverage way more information after all.
> > >
> > > Moreover, I'm quite concerned about the complexity of this proposal.
> > > Tracking on both the RM/JM side; syncing between components;
> adjustments
> > > to the
> > > slot and resource protocol.
> > >
> > > In a way it seems overly complicated.
> > >
> > > If we look at it purely from an active resource management perspective,
> > > then there
> > > isn't really a need to touch the slot protocol at all (or in fact to
> > > anything in the JobMaster),
> > > because there isn't any point in keeping around blocked TMs in the
> first
> > > place.
> > > They'd just be idling, potentially shutting down after a while by the
> RM
> > > because of
> > > it (unless we _also_ touch that logic).
> > > Here the blocking of a process (be it by blocking the process or node)
> is
> > > equivalent with shutting down the blocked process(es).
> > > Once the block is lifted we can just spin it back up.
> > >
> > > And I do wonder whether we couldn't apply the same line of thinking to
> > > standalone resource management.
> > > Here being able to stop/restart a process/node manually should be a
> core
> > > requirement for a Flink deployment anyway.
> > >
> > >
> > > On 02/05/2022 08:49, Martijn Visser wrote:
> > > > Hi everyone,
> > > >
> > > > Thanks for creating this FLIP. I can understand the problem and I see
> > > value
> > > > in the automatic detection and blocklisting. I do have some concerns
> > with
> > > > the ability to manually specify to be blocked resources. I have two
> > > > concerns;
> > > >
> > > > * Most organizations explicitly have a separation of concerns,
> meaning
> > > that
> > > > there's a group who's responsible for managing a cluster and there's
> a
> > > user
> > > > group who uses that cluster. With the introduction of this mechanism,
> > the
> > > > latter group now can influence the responsibility of the first group.
> > So
> > > it
> > > > can be possible that someone from the user group blocks something,
> > which
> > > > causes an outage (which could result in paging mechanism triggering
> > etc)
> > > > which impacts the first group.
> > > > * How big is the group of people who can go through the process of
> > > manually
> > > > identifying a node that isn't behaving as it should be? I do think
> this
> > > > group is relatively limited. Does it then make sense to introduce
> such
> > a
> > > > feature, which would only be used by a really small user group of
> > Flink?
> > > We
> > > > still have to maintain, test and support such a feature.
> > > >
> > > > I'm +1 for the autodetection features, but I'm leaning towards not
> > > exposing
> > > > this to the user group but having this available strictly for cluster
> > > > operators. They could then also set up their paging/metrics/logging
> > > system
> > > > to take this into account.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn Visser
> > > > https://twitter.com/MartijnVisser82
> > > > https://github.com/MartijnVisser
> > > >
> > > >
> > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com> wrote:
> > > >
> > > >> Thanks for driving this, Zhu and Lijie.
> > > >>
> > > >> +1 for the overall proposal. Just share some cents here:
> > > >>
> > > >> - Why do we need to expose
> > > >> cluster.resource-blacklist.item.timeout-check-interval to the user?
> > > >> I think the semantics of `cluster.resource-blacklist.item.timeout`
> is
> > > >> sufficient for the user. How to guarantee the timeout mechanism is
> > > >> Flink's internal implementation. I think it will be very confusing
> and
> > > >> we do not need to expose it to users.
> > > >>
> > > >> - ResourceManager can notify the exception of a task manager to
> > > >> `BlacklistHandler` as well.
> > > >> For example, the slot allocation might fail in case the target task
> > > >> manager is busy or has a network jitter. I don't mean we need to
> cover
> > > >> this case in this version, but we can also open a `notifyException`
> in
> > > >> `ResourceManagerBlacklistHandler`.
> > > >>
> > > >> - Before we sync the blocklist to ResourceManager, will the slot of
> a
> > > >> blocked task manager continues to be released and allocated?
> > > >>
> > > >> Best,
> > > >> Yangze Guo
> > > >>
> > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> wangdachui9501@gmail.com>
> > > >> wrote:
> > > >>> Hi Konstantin,
> > > >>>
> > > >>> Thanks for your feedback. I will response your 4 remarks:
> > > >>>
> > > >>>
> > > >>> 1) Thanks for reminding me of the controversy. I think “BlockList”
> is
> > > >> good
> > > >>> enough, and I will change it in FLIP.
> > > >>>
> > > >>>
> > > >>> 2) Your suggestion for the REST API is a good idea. Based on the
> > > above, I
> > > >>> would change REST API as following:
> > > >>>
> > > >>> POST/GET <host>/blocklist/nodes
> > > >>>
> > > >>> POST/GET <host>/blocklist/taskmanagers
> > > >>>
> > > >>> DELETE <host>/blocklist/node/<identifier>
> > > >>>
> > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > > >>>
> > > >>>
> > > >>> 3) If a node is blocking/blocklisted, it means that all task
> managers
> > > on
> > > >>> this node are blocklisted. All slots on these TMs are not
> available.
> > > This
> > > >>> is actually a bit like TM losts, but these TMs are not really lost,
> > > they
> > > >>> are in an unavailable status, and they are still registered in this
> > > flink
> > > >>> cluster. They will be available again once the corresponding
> > blocklist
> > > >> item
> > > >>> is removed. This behavior is the same in active/non-active
> clusters.
> > > >>> However in the active clusters, these TMs may be released due to
> idle
> > > >>> timeouts.
> > > >>>
> > > >>>
> > > >>> 4) For the item timeout, I prefer to keep it. The reasons are as
> > > >> following:
> > > >>> a) The timeout will not affect users adding or removing items via
> > REST
> > > >> API,
> > > >>> and users can disable it by configuring it to Long.MAX_VALUE .
> > > >>>
> > > >>> b) Some node problems can recover after a period of time (such as
> > > machine
> > > >>> hotspots), in which case users may prefer that Flink can do this
> > > >>> automatically instead of requiring the user to do it manually.
> > > >>>
> > > >>>
> > > >>> Best,
> > > >>>
> > > >>> Lijie
> > > >>>
> > > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
> > > >>>
> > > >>>> Hi Lijie,
> > > >>>>
> > > >>>> I think, this makes sense and +1 to only support manually blocking
> > > >>>> taskmanagers and nodes. Maybe the different strategies can also be
> > > >>>> maintained outside of Apache Flink.
> > > >>>>
> > > >>>> A few remarks:
> > > >>>>
> > > >>>> 1) Can we use another term than "bla.cklist" due to the
> controversy
> > > >> around
> > > >>>> the term? [1] There was also a Jira Ticket about this topic a
> while
> > > >> back
> > > >>>> and there was generally a consensus to avoid the term blacklist &
> > > >> whitelist
> > > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
> > > >>>> 2) For the REST API, I'd prefer a slightly different design as
> verbs
> > > >> like
> > > >>>> add/remove often considered an anti-pattern for REST APIs. POST
> on a
> > > >> list
> > > >>>> item is generally the standard to add items. DELETE on the
> > individual
> > > >>>> resource is standard to remove an item.
> > > >>>>
> > > >>>> POST <host>/quarantine/items
> > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > > >>>>
> > > >>>> We could also consider to separate taskmanagers and nodes in the
> > REST
> > > >> API
> > > >>>> (and internal data structures). Any opinion on this?
> > > >>>>
> > > >>>> POST/GET <host>/quarantine/nodes
> > > >>>> POST/GET <host>/quarantine/taskmanager
> > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > > >>>>
> > > >>>> 3) How would blocking nodes behave with non-active resource
> > managers,
> > > >> i.e.
> > > >>>> standalone or reactive mode?
> > > >>>>
> > > >>>> 4) To keep the implementation even more minimal, do we need the
> > > timeout
> > > >>>> behavior? If items are added/removed manually we could delegate
> this
> > > >> to the
> > > >>>> user easily. In my opinion the timeout behavior would better fit
> > into
> > > >>>> specific strategies at a later point.
> > > >>>>
> > > >>>> Looking forward to your thoughts.
> > > >>>>
> > > >>>> Cheers and thank you,
> > > >>>>
> > > >>>> Konstantin
> > > >>>>
> > > >>>> [1]
> > > >>>>
> > > >>>>
> > > >>
> > >
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> > > >>>>
> > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> > > >>>> wangdachui9501@gmail.com>:
> > > >>>>
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>> Flink job failures may happen due to cluster node issues
> > > >> (insufficient
> > > >>>> disk
> > > >>>>> space, bad hardware, network abnormalities). Flink will take care
> > of
> > > >> the
> > > >>>>> failures and redeploy the tasks. However, due to data locality
> and
> > > >>>> limited
> > > >>>>> resources, the new tasks are very likely to be redeployed to the
> > same
> > > >>>>> nodes, which will result in continuous task abnormalities and
> > affect
> > > >> job
> > > >>>>> progress.
> > > >>>>>
> > > >>>>> Currently, Flink users need to manually identify the problematic
> > > >> node and
> > > >>>>> take it offline to solve this problem. But this approach has
> > > >> following
> > > >>>>> disadvantages:
> > > >>>>>
> > > >>>>> 1. Taking a node offline can be a heavy process. Users may need
> to
> > > >>>> contact
> > > >>>>> cluster administors to do this. The operation can even be
> dangerous
> > > >> and
> > > >>>> not
> > > >>>>> allowed during some important business events.
> > > >>>>>
> > > >>>>> 2. Identifying and solving this kind of problems manually would
> be
> > > >> slow
> > > >>>> and
> > > >>>>> a waste of human resources.
> > > >>>>>
> > > >>>>> To solve this problem, Zhu Zhu and I propose to introduce a
> > blacklist
> > > >>>>> mechanism for Flink to filter out problematic resources.
> > > >>>>>
> > > >>>>>
> > > >>>>> You can find more details in FLIP-224[1]. Looking forward to your
> > > >>>> feedback.
> > > >>>>> [1]
> > > >>>>>
> > > >>>>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > > >>>>>
> > > >>>>> Best,
> > > >>>>>
> > > >>>>> Lijie
> > > >>>>>
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Becket Qin <be...@gmail.com>.

Thanks for the proposal, Lijie.

This is an interesting feature and discussion, and somewhat related to the
design principle about how people should operate Flink.

I think there are three things involved in this FLIP.
     a) Detect and report the unstable node.
     b) Collect the information of the unstable node and form a blocklist.
     c) Take the action to block nodes.

My two cents:

1. It looks like people all agree that Flink should have c). It is not only
useful for cases of node failures, but also handy for some planned
maintenance.

2. People have different opinions on b), i.e. who should be the brain to
make the decision to block a node. I think this largely depends on who we
talk to. Different users would probably give different answers. For people
who do have a centralized node health management service, let Flink do just
do a) and c) would be preferred. So essentially Flink would be one of the
sources that may detect unstable nodes, report it to that service, and then
take the command from that service to block the problematic nodes. On the
other hand, for users who do not have such a service, simply letting Flink
be clever by itself to block the suspicious nodes might be desired to
ensure the jobs are running smoothly.

So that indicates a) and b) here should be pluggable / optional.

In light of this, maybe it would make sense to have something pluggable
like a UnstableNodeReporter which exposes unstable nodes actively. (A more
general interface should be JobInfoReporter<T> which can be used to report
any information of type <T>. But I'll just keep the scope relevant to this
FLIP here). Personally speaking, I think it is OK to have a default
implementation of a reporter which just tells Flink to take action to block
problematic nodes and also unblocks them after timeout.

Thanks,

Jiangjie (Becket) Qin


On Mon, May 2, 2022 at 3:27 PM Роман Бойко <ro...@gmail.com> wrote:

> Thanks for good initiative, Lijie and Zhu!
>
> If it's possible I'd like to participate in development.
>
> I agree with 3rd point of Konstantin's reply - we should consider to move
> somehow the information of blocklisted nodes/TMs from active
> ResourceManager to non-active ones. Probably storing inside
> Zookeeper/Configmap might be helpful here.
>
> And I agree with Martijn that a lot of organizations don't want to expose
> such API for a cluster user group. But I think it's necessary to have the
> mechanism for unblocking the nodes/TMs anyway for avoiding incorrect
> automatic behaviour.
>
> And another one small suggestion - I think it would be better to extend the
> *BlocklistedItem* class with the *endTimestamp* field and fill it at the
> item creation. This simple addition will allow to:
>
>    -
>
>    Provide the ability to users to setup the exact time of blocklist end
>    through RestAPI
>    -
>
>    Not being tied to a single value of
>    *cluster.resource-blacklist.item.timeout*
>
>
> On Mon, 2 May 2022 at 14:17, Chesnay Schepler <ch...@apache.org> wrote:
>
> > I do share the concern between blurring the lines a bit.
> >
> > That said, I'd prefer to not have any auto-detection and only have an
> > opt-in mechanism
> > to manually block processes/nodes. To me this sounds yet again like one
> > of those
> > magical mechanisms that will rarely work just right.
> > An external system can leverage way more information after all.
> >
> > Moreover, I'm quite concerned about the complexity of this proposal.
> > Tracking on both the RM/JM side; syncing between components; adjustments
> > to the
> > slot and resource protocol.
> >
> > In a way it seems overly complicated.
> >
> > If we look at it purely from an active resource management perspective,
> > then there
> > isn't really a need to touch the slot protocol at all (or in fact to
> > anything in the JobMaster),
> > because there isn't any point in keeping around blocked TMs in the first
> > place.
> > They'd just be idling, potentially shutting down after a while by the RM
> > because of
> > it (unless we _also_ touch that logic).
> > Here the blocking of a process (be it by blocking the process or node) is
> > equivalent with shutting down the blocked process(es).
> > Once the block is lifted we can just spin it back up.
> >
> > And I do wonder whether we couldn't apply the same line of thinking to
> > standalone resource management.
> > Here being able to stop/restart a process/node manually should be a core
> > requirement for a Flink deployment anyway.
> >
> >
> > On 02/05/2022 08:49, Martijn Visser wrote:
> > > Hi everyone,
> > >
> > > Thanks for creating this FLIP. I can understand the problem and I see
> > value
> > > in the automatic detection and blocklisting. I do have some concerns
> with
> > > the ability to manually specify to be blocked resources. I have two
> > > concerns;
> > >
> > > * Most organizations explicitly have a separation of concerns, meaning
> > that
> > > there's a group who's responsible for managing a cluster and there's a
> > user
> > > group who uses that cluster. With the introduction of this mechanism,
> the
> > > latter group now can influence the responsibility of the first group.
> So
> > it
> > > can be possible that someone from the user group blocks something,
> which
> > > causes an outage (which could result in paging mechanism triggering
> etc)
> > > which impacts the first group.
> > > * How big is the group of people who can go through the process of
> > manually
> > > identifying a node that isn't behaving as it should be? I do think this
> > > group is relatively limited. Does it then make sense to introduce such
> a
> > > feature, which would only be used by a really small user group of
> Flink?
> > We
> > > still have to maintain, test and support such a feature.
> > >
> > > I'm +1 for the autodetection features, but I'm leaning towards not
> > exposing
> > > this to the user group but having this available strictly for cluster
> > > operators. They could then also set up their paging/metrics/logging
> > system
> > > to take this into account.
> > >
> > > Best regards,
> > >
> > > Martijn Visser
> > > https://twitter.com/MartijnVisser82
> > > https://github.com/MartijnVisser
> > >
> > >
> > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com> wrote:
> > >
> > >> Thanks for driving this, Zhu and Lijie.
> > >>
> > >> +1 for the overall proposal. Just share some cents here:
> > >>
> > >> - Why do we need to expose
> > >> cluster.resource-blacklist.item.timeout-check-interval to the user?
> > >> I think the semantics of `cluster.resource-blacklist.item.timeout` is
> > >> sufficient for the user. How to guarantee the timeout mechanism is
> > >> Flink's internal implementation. I think it will be very confusing and
> > >> we do not need to expose it to users.
> > >>
> > >> - ResourceManager can notify the exception of a task manager to
> > >> `BlacklistHandler` as well.
> > >> For example, the slot allocation might fail in case the target task
> > >> manager is busy or has a network jitter. I don't mean we need to cover
> > >> this case in this version, but we can also open a `notifyException` in
> > >> `ResourceManagerBlacklistHandler`.
> > >>
> > >> - Before we sync the blocklist to ResourceManager, will the slot of a
> > >> blocked task manager continues to be released and allocated?
> > >>
> > >> Best,
> > >> Yangze Guo
> > >>
> > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <wa...@gmail.com>
> > >> wrote:
> > >>> Hi Konstantin,
> > >>>
> > >>> Thanks for your feedback. I will response your 4 remarks:
> > >>>
> > >>>
> > >>> 1) Thanks for reminding me of the controversy. I think “BlockList” is
> > >> good
> > >>> enough, and I will change it in FLIP.
> > >>>
> > >>>
> > >>> 2) Your suggestion for the REST API is a good idea. Based on the
> > above, I
> > >>> would change REST API as following:
> > >>>
> > >>> POST/GET <host>/blocklist/nodes
> > >>>
> > >>> POST/GET <host>/blocklist/taskmanagers
> > >>>
> > >>> DELETE <host>/blocklist/node/<identifier>
> > >>>
> > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > >>>
> > >>>
> > >>> 3) If a node is blocking/blocklisted, it means that all task managers
> > on
> > >>> this node are blocklisted. All slots on these TMs are not available.
> > This
> > >>> is actually a bit like TM losts, but these TMs are not really lost,
> > they
> > >>> are in an unavailable status, and they are still registered in this
> > flink
> > >>> cluster. They will be available again once the corresponding
> blocklist
> > >> item
> > >>> is removed. This behavior is the same in active/non-active clusters.
> > >>> However in the active clusters, these TMs may be released due to idle
> > >>> timeouts.
> > >>>
> > >>>
> > >>> 4) For the item timeout, I prefer to keep it. The reasons are as
> > >> following:
> > >>> a) The timeout will not affect users adding or removing items via
> REST
> > >> API,
> > >>> and users can disable it by configuring it to Long.MAX_VALUE .
> > >>>
> > >>> b) Some node problems can recover after a period of time (such as
> > machine
> > >>> hotspots), in which case users may prefer that Flink can do this
> > >>> automatically instead of requiring the user to do it manually.
> > >>>
> > >>>
> > >>> Best,
> > >>>
> > >>> Lijie
> > >>>
> > >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
> > >>>
> > >>>> Hi Lijie,
> > >>>>
> > >>>> I think, this makes sense and +1 to only support manually blocking
> > >>>> taskmanagers and nodes. Maybe the different strategies can also be
> > >>>> maintained outside of Apache Flink.
> > >>>>
> > >>>> A few remarks:
> > >>>>
> > >>>> 1) Can we use another term than "bla.cklist" due to the controversy
> > >> around
> > >>>> the term? [1] There was also a Jira Ticket about this topic a while
> > >> back
> > >>>> and there was generally a consensus to avoid the term blacklist &
> > >> whitelist
> > >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
> > >>>> 2) For the REST API, I'd prefer a slightly different design as verbs
> > >> like
> > >>>> add/remove often considered an anti-pattern for REST APIs. POST on a
> > >> list
> > >>>> item is generally the standard to add items. DELETE on the
> individual
> > >>>> resource is standard to remove an item.
> > >>>>
> > >>>> POST <host>/quarantine/items
> > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > >>>>
> > >>>> We could also consider to separate taskmanagers and nodes in the
> REST
> > >> API
> > >>>> (and internal data structures). Any opinion on this?
> > >>>>
> > >>>> POST/GET <host>/quarantine/nodes
> > >>>> POST/GET <host>/quarantine/taskmanager
> > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > >>>>
> > >>>> 3) How would blocking nodes behave with non-active resource
> managers,
> > >> i.e.
> > >>>> standalone or reactive mode?
> > >>>>
> > >>>> 4) To keep the implementation even more minimal, do we need the
> > timeout
> > >>>> behavior? If items are added/removed manually we could delegate this
> > >> to the
> > >>>> user easily. In my opinion the timeout behavior would better fit
> into
> > >>>> specific strategies at a later point.
> > >>>>
> > >>>> Looking forward to your thoughts.
> > >>>>
> > >>>> Cheers and thank you,
> > >>>>
> > >>>> Konstantin
> > >>>>
> > >>>> [1]
> > >>>>
> > >>>>
> > >>
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> > >>>>
> > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> > >>>> wangdachui9501@gmail.com>:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> Flink job failures may happen due to cluster node issues
> > >> (insufficient
> > >>>> disk
> > >>>>> space, bad hardware, network abnormalities). Flink will take care
> of
> > >> the
> > >>>>> failures and redeploy the tasks. However, due to data locality and
> > >>>> limited
> > >>>>> resources, the new tasks are very likely to be redeployed to the
> same
> > >>>>> nodes, which will result in continuous task abnormalities and
> affect
> > >> job
> > >>>>> progress.
> > >>>>>
> > >>>>> Currently, Flink users need to manually identify the problematic
> > >> node and
> > >>>>> take it offline to solve this problem. But this approach has
> > >> following
> > >>>>> disadvantages:
> > >>>>>
> > >>>>> 1. Taking a node offline can be a heavy process. Users may need to
> > >>>> contact
> > >>>>> cluster administors to do this. The operation can even be dangerous
> > >> and
> > >>>> not
> > >>>>> allowed during some important business events.
> > >>>>>
> > >>>>> 2. Identifying and solving this kind of problems manually would be
> > >> slow
> > >>>> and
> > >>>>> a waste of human resources.
> > >>>>>
> > >>>>> To solve this problem, Zhu Zhu and I propose to introduce a
> blacklist
> > >>>>> mechanism for Flink to filter out problematic resources.
> > >>>>>
> > >>>>>
> > >>>>> You can find more details in FLIP-224[1]. Looking forward to your
> > >>>> feedback.
> > >>>>> [1]
> > >>>>>
> > >>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > >>>>>
> > >>>>> Best,
> > >>>>>
> > >>>>> Lijie
> > >>>>>
> >
> >
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Роман Бойко <ro...@gmail.com>.

Thanks for good initiative, Lijie and Zhu!

If it's possible I'd like to participate in development.

I agree with 3rd point of Konstantin's reply - we should consider to move
somehow the information of blocklisted nodes/TMs from active
ResourceManager to non-active ones. Probably storing inside
Zookeeper/Configmap might be helpful here.

And I agree with Martijn that a lot of organizations don't want to expose
such API for a cluster user group. But I think it's necessary to have the
mechanism for unblocking the nodes/TMs anyway for avoiding incorrect
automatic behaviour.

And another one small suggestion - I think it would be better to extend the
*BlocklistedItem* class with the *endTimestamp* field and fill it at the
item creation. This simple addition will allow to:

   -

   Provide the ability to users to setup the exact time of blocklist end
   through RestAPI
   -

   Not being tied to a single value of
   *cluster.resource-blacklist.item.timeout*


On Mon, 2 May 2022 at 14:17, Chesnay Schepler <ch...@apache.org> wrote:

> I do share the concern between blurring the lines a bit.
>
> That said, I'd prefer to not have any auto-detection and only have an
> opt-in mechanism
> to manually block processes/nodes. To me this sounds yet again like one
> of those
> magical mechanisms that will rarely work just right.
> An external system can leverage way more information after all.
>
> Moreover, I'm quite concerned about the complexity of this proposal.
> Tracking on both the RM/JM side; syncing between components; adjustments
> to the
> slot and resource protocol.
>
> In a way it seems overly complicated.
>
> If we look at it purely from an active resource management perspective,
> then there
> isn't really a need to touch the slot protocol at all (or in fact to
> anything in the JobMaster),
> because there isn't any point in keeping around blocked TMs in the first
> place.
> They'd just be idling, potentially shutting down after a while by the RM
> because of
> it (unless we _also_ touch that logic).
> Here the blocking of a process (be it by blocking the process or node) is
> equivalent with shutting down the blocked process(es).
> Once the block is lifted we can just spin it back up.
>
> And I do wonder whether we couldn't apply the same line of thinking to
> standalone resource management.
> Here being able to stop/restart a process/node manually should be a core
> requirement for a Flink deployment anyway.
>
>
> On 02/05/2022 08:49, Martijn Visser wrote:
> > Hi everyone,
> >
> > Thanks for creating this FLIP. I can understand the problem and I see
> value
> > in the automatic detection and blocklisting. I do have some concerns with
> > the ability to manually specify to be blocked resources. I have two
> > concerns;
> >
> > * Most organizations explicitly have a separation of concerns, meaning
> that
> > there's a group who's responsible for managing a cluster and there's a
> user
> > group who uses that cluster. With the introduction of this mechanism, the
> > latter group now can influence the responsibility of the first group. So
> it
> > can be possible that someone from the user group blocks something, which
> > causes an outage (which could result in paging mechanism triggering etc)
> > which impacts the first group.
> > * How big is the group of people who can go through the process of
> manually
> > identifying a node that isn't behaving as it should be? I do think this
> > group is relatively limited. Does it then make sense to introduce such a
> > feature, which would only be used by a really small user group of Flink?
> We
> > still have to maintain, test and support such a feature.
> >
> > I'm +1 for the autodetection features, but I'm leaning towards not
> exposing
> > this to the user group but having this available strictly for cluster
> > operators. They could then also set up their paging/metrics/logging
> system
> > to take this into account.
> >
> > Best regards,
> >
> > Martijn Visser
> > https://twitter.com/MartijnVisser82
> > https://github.com/MartijnVisser
> >
> >
> > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com> wrote:
> >
> >> Thanks for driving this, Zhu and Lijie.
> >>
> >> +1 for the overall proposal. Just share some cents here:
> >>
> >> - Why do we need to expose
> >> cluster.resource-blacklist.item.timeout-check-interval to the user?
> >> I think the semantics of `cluster.resource-blacklist.item.timeout` is
> >> sufficient for the user. How to guarantee the timeout mechanism is
> >> Flink's internal implementation. I think it will be very confusing and
> >> we do not need to expose it to users.
> >>
> >> - ResourceManager can notify the exception of a task manager to
> >> `BlacklistHandler` as well.
> >> For example, the slot allocation might fail in case the target task
> >> manager is busy or has a network jitter. I don't mean we need to cover
> >> this case in this version, but we can also open a `notifyException` in
> >> `ResourceManagerBlacklistHandler`.
> >>
> >> - Before we sync the blocklist to ResourceManager, will the slot of a
> >> blocked task manager continues to be released and allocated?
> >>
> >> Best,
> >> Yangze Guo
> >>
> >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <wa...@gmail.com>
> >> wrote:
> >>> Hi Konstantin,
> >>>
> >>> Thanks for your feedback. I will response your 4 remarks:
> >>>
> >>>
> >>> 1) Thanks for reminding me of the controversy. I think “BlockList” is
> >> good
> >>> enough, and I will change it in FLIP.
> >>>
> >>>
> >>> 2) Your suggestion for the REST API is a good idea. Based on the
> above, I
> >>> would change REST API as following:
> >>>
> >>> POST/GET <host>/blocklist/nodes
> >>>
> >>> POST/GET <host>/blocklist/taskmanagers
> >>>
> >>> DELETE <host>/blocklist/node/<identifier>
> >>>
> >>> DELETE <host>/blocklist/taskmanager/<identifier>
> >>>
> >>>
> >>> 3) If a node is blocking/blocklisted, it means that all task managers
> on
> >>> this node are blocklisted. All slots on these TMs are not available.
> This
> >>> is actually a bit like TM losts, but these TMs are not really lost,
> they
> >>> are in an unavailable status, and they are still registered in this
> flink
> >>> cluster. They will be available again once the corresponding blocklist
> >> item
> >>> is removed. This behavior is the same in active/non-active clusters.
> >>> However in the active clusters, these TMs may be released due to idle
> >>> timeouts.
> >>>
> >>>
> >>> 4) For the item timeout, I prefer to keep it. The reasons are as
> >> following:
> >>> a) The timeout will not affect users adding or removing items via REST
> >> API,
> >>> and users can disable it by configuring it to Long.MAX_VALUE .
> >>>
> >>> b) Some node problems can recover after a period of time (such as
> machine
> >>> hotspots), in which case users may prefer that Flink can do this
> >>> automatically instead of requiring the user to do it manually.
> >>>
> >>>
> >>> Best,
> >>>
> >>> Lijie
> >>>
> >>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
> >>>
> >>>> Hi Lijie,
> >>>>
> >>>> I think, this makes sense and +1 to only support manually blocking
> >>>> taskmanagers and nodes. Maybe the different strategies can also be
> >>>> maintained outside of Apache Flink.
> >>>>
> >>>> A few remarks:
> >>>>
> >>>> 1) Can we use another term than "bla.cklist" due to the controversy
> >> around
> >>>> the term? [1] There was also a Jira Ticket about this topic a while
> >> back
> >>>> and there was generally a consensus to avoid the term blacklist &
> >> whitelist
> >>>> [2]? We could use "blocklist" "denylist" or "quarantined"
> >>>> 2) For the REST API, I'd prefer a slightly different design as verbs
> >> like
> >>>> add/remove often considered an anti-pattern for REST APIs. POST on a
> >> list
> >>>> item is generally the standard to add items. DELETE on the individual
> >>>> resource is standard to remove an item.
> >>>>
> >>>> POST <host>/quarantine/items
> >>>> DELETE <host>/quarantine/items/<itemidentifier>
> >>>>
> >>>> We could also consider to separate taskmanagers and nodes in the REST
> >> API
> >>>> (and internal data structures). Any opinion on this?
> >>>>
> >>>> POST/GET <host>/quarantine/nodes
> >>>> POST/GET <host>/quarantine/taskmanager
> >>>> DELETE <host>/quarantine/nodes/<identifier>
> >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> >>>>
> >>>> 3) How would blocking nodes behave with non-active resource managers,
> >> i.e.
> >>>> standalone or reactive mode?
> >>>>
> >>>> 4) To keep the implementation even more minimal, do we need the
> timeout
> >>>> behavior? If items are added/removed manually we could delegate this
> >> to the
> >>>> user easily. In my opinion the timeout behavior would better fit into
> >>>> specific strategies at a later point.
> >>>>
> >>>> Looking forward to your thoughts.
> >>>>
> >>>> Cheers and thank you,
> >>>>
> >>>> Konstantin
> >>>>
> >>>> [1]
> >>>>
> >>>>
> >>
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> >>>>
> >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> >>>> wangdachui9501@gmail.com>:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Flink job failures may happen due to cluster node issues
> >> (insufficient
> >>>> disk
> >>>>> space, bad hardware, network abnormalities). Flink will take care of
> >> the
> >>>>> failures and redeploy the tasks. However, due to data locality and
> >>>> limited
> >>>>> resources, the new tasks are very likely to be redeployed to the same
> >>>>> nodes, which will result in continuous task abnormalities and affect
> >> job
> >>>>> progress.
> >>>>>
> >>>>> Currently, Flink users need to manually identify the problematic
> >> node and
> >>>>> take it offline to solve this problem. But this approach has
> >> following
> >>>>> disadvantages:
> >>>>>
> >>>>> 1. Taking a node offline can be a heavy process. Users may need to
> >>>> contact
> >>>>> cluster administors to do this. The operation can even be dangerous
> >> and
> >>>> not
> >>>>> allowed during some important business events.
> >>>>>
> >>>>> 2. Identifying and solving this kind of problems manually would be
> >> slow
> >>>> and
> >>>>> a waste of human resources.
> >>>>>
> >>>>> To solve this problem, Zhu Zhu and I propose to introduce a blacklist
> >>>>> mechanism for Flink to filter out problematic resources.
> >>>>>
> >>>>>
> >>>>> You can find more details in FLIP-224[1]. Looking forward to your
> >>>> feedback.
> >>>>> [1]
> >>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Lijie
> >>>>>
>
>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Posted by Chesnay Schepler <ch...@apache.org>.

I do share the concern between blurring the lines a bit.

That said, I'd prefer to not have any auto-detection and only have an 
opt-in mechanism
to manually block processes/nodes. To me this sounds yet again like one 
of those
magical mechanisms that will rarely work just right.
An external system can leverage way more information after all.

Moreover, I'm quite concerned about the complexity of this proposal.
Tracking on both the RM/JM side; syncing between components; adjustments 
to the
slot and resource protocol.

In a way it seems overly complicated.

If we look at it purely from an active resource management perspective, 
then there
isn't really a need to touch the slot protocol at all (or in fact to 
anything in the JobMaster),
because there isn't any point in keeping around blocked TMs in the first 
place.
They'd just be idling, potentially shutting down after a while by the RM 
because of
it (unless we _also_ touch that logic).
Here the blocking of a process (be it by blocking the process or node) is
equivalent with shutting down the blocked process(es).
Once the block is lifted we can just spin it back up.

And I do wonder whether we couldn't apply the same line of thinking to 
standalone resource management.
Here being able to stop/restart a process/node manually should be a core 
requirement for a Flink deployment anyway.


On 02/05/2022 08:49, Martijn Visser wrote:
> Hi everyone,
>
> Thanks for creating this FLIP. I can understand the problem and I see value
> in the automatic detection and blocklisting. I do have some concerns with
> the ability to manually specify to be blocked resources. I have two
> concerns;
>
> * Most organizations explicitly have a separation of concerns, meaning that
> there's a group who's responsible for managing a cluster and there's a user
> group who uses that cluster. With the introduction of this mechanism, the
> latter group now can influence the responsibility of the first group. So it
> can be possible that someone from the user group blocks something, which
> causes an outage (which could result in paging mechanism triggering etc)
> which impacts the first group.
> * How big is the group of people who can go through the process of manually
> identifying a node that isn't behaving as it should be? I do think this
> group is relatively limited. Does it then make sense to introduce such a
> feature, which would only be used by a really small user group of Flink? We
> still have to maintain, test and support such a feature.
>
> I'm +1 for the autodetection features, but I'm leaning towards not exposing
> this to the user group but having this available strictly for cluster
> operators. They could then also set up their paging/metrics/logging system
> to take this into account.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>
>
> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <ka...@gmail.com> wrote:
>
>> Thanks for driving this, Zhu and Lijie.
>>
>> +1 for the overall proposal. Just share some cents here:
>>
>> - Why do we need to expose
>> cluster.resource-blacklist.item.timeout-check-interval to the user?
>> I think the semantics of `cluster.resource-blacklist.item.timeout` is
>> sufficient for the user. How to guarantee the timeout mechanism is
>> Flink's internal implementation. I think it will be very confusing and
>> we do not need to expose it to users.
>>
>> - ResourceManager can notify the exception of a task manager to
>> `BlacklistHandler` as well.
>> For example, the slot allocation might fail in case the target task
>> manager is busy or has a network jitter. I don't mean we need to cover
>> this case in this version, but we can also open a `notifyException` in
>> `ResourceManagerBlacklistHandler`.
>>
>> - Before we sync the blocklist to ResourceManager, will the slot of a
>> blocked task manager continues to be released and allocated?
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <wa...@gmail.com>
>> wrote:
>>> Hi Konstantin,
>>>
>>> Thanks for your feedback. I will response your 4 remarks:
>>>
>>>
>>> 1) Thanks for reminding me of the controversy. I think “BlockList” is
>> good
>>> enough, and I will change it in FLIP.
>>>
>>>
>>> 2) Your suggestion for the REST API is a good idea. Based on the above, I
>>> would change REST API as following:
>>>
>>> POST/GET <host>/blocklist/nodes
>>>
>>> POST/GET <host>/blocklist/taskmanagers
>>>
>>> DELETE <host>/blocklist/node/<identifier>
>>>
>>> DELETE <host>/blocklist/taskmanager/<identifier>
>>>
>>>
>>> 3) If a node is blocking/blocklisted, it means that all task managers on
>>> this node are blocklisted. All slots on these TMs are not available. This
>>> is actually a bit like TM losts, but these TMs are not really lost, they
>>> are in an unavailable status, and they are still registered in this flink
>>> cluster. They will be available again once the corresponding blocklist
>> item
>>> is removed. This behavior is the same in active/non-active clusters.
>>> However in the active clusters, these TMs may be released due to idle
>>> timeouts.
>>>
>>>
>>> 4) For the item timeout, I prefer to keep it. The reasons are as
>> following:
>>> a) The timeout will not affect users adding or removing items via REST
>> API,
>>> and users can disable it by configuring it to Long.MAX_VALUE .
>>>
>>> b) Some node problems can recover after a period of time (such as machine
>>> hotspots), in which case users may prefer that Flink can do this
>>> automatically instead of requiring the user to do it manually.
>>>
>>>
>>> Best,
>>>
>>> Lijie
>>>
>>> Konstantin Knauf <kn...@apache.org> 于2022年4月27日周三 19:23写道：
>>>
>>>> Hi Lijie,
>>>>
>>>> I think, this makes sense and +1 to only support manually blocking
>>>> taskmanagers and nodes. Maybe the different strategies can also be
>>>> maintained outside of Apache Flink.
>>>>
>>>> A few remarks:
>>>>
>>>> 1) Can we use another term than "bla.cklist" due to the controversy
>> around
>>>> the term? [1] There was also a Jira Ticket about this topic a while
>> back
>>>> and there was generally a consensus to avoid the term blacklist &
>> whitelist
>>>> [2]? We could use "blocklist" "denylist" or "quarantined"
>>>> 2) For the REST API, I'd prefer a slightly different design as verbs
>> like
>>>> add/remove often considered an anti-pattern for REST APIs. POST on a
>> list
>>>> item is generally the standard to add items. DELETE on the individual
>>>> resource is standard to remove an item.
>>>>
>>>> POST <host>/quarantine/items
>>>> DELETE <host>/quarantine/items/<itemidentifier>
>>>>
>>>> We could also consider to separate taskmanagers and nodes in the REST
>> API
>>>> (and internal data structures). Any opinion on this?
>>>>
>>>> POST/GET <host>/quarantine/nodes
>>>> POST/GET <host>/quarantine/taskmanager
>>>> DELETE <host>/quarantine/nodes/<identifier>
>>>> DELETE <host>/quarantine/taskmanager/<identifier>
>>>>
>>>> 3) How would blocking nodes behave with non-active resource managers,
>> i.e.
>>>> standalone or reactive mode?
>>>>
>>>> 4) To keep the implementation even more minimal, do we need the timeout
>>>> behavior? If items are added/removed manually we could delegate this
>> to the
>>>> user easily. In my opinion the timeout behavior would better fit into
>>>> specific strategies at a later point.
>>>>
>>>> Looking forward to your thoughts.
>>>>
>>>> Cheers and thank you,
>>>>
>>>> Konstantin
>>>>
>>>> [1]
>>>>
>>>>
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
>>>>
>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
>>>> wangdachui9501@gmail.com>:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Flink job failures may happen due to cluster node issues
>> (insufficient
>>>> disk
>>>>> space, bad hardware, network abnormalities). Flink will take care of
>> the
>>>>> failures and redeploy the tasks. However, due to data locality and
>>>> limited
>>>>> resources, the new tasks are very likely to be redeployed to the same
>>>>> nodes, which will result in continuous task abnormalities and affect
>> job
>>>>> progress.
>>>>>
>>>>> Currently, Flink users need to manually identify the problematic
>> node and
>>>>> take it offline to solve this problem. But this approach has
>> following
>>>>> disadvantages:
>>>>>
>>>>> 1. Taking a node offline can be a heavy process. Users may need to
>>>> contact
>>>>> cluster administors to do this. The operation can even be dangerous
>> and
>>>> not
>>>>> allowed during some important business events.
>>>>>
>>>>> 2. Identifying and solving this kind of problems manually would be
>> slow
>>>> and
>>>>> a waste of human resources.
>>>>>
>>>>> To solve this problem, Zhu Zhu and I propose to introduce a blacklist
>>>>> mechanism for Flink to filter out problematic resources.
>>>>>
>>>>>
>>>>> You can find more details in FLIP-224[1]. Looking forward to your
>>>> feedback.
>>>>> [1]
>>>>>
>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>>>>>
>>>>> Best,
>>>>>
>>>>> Lijie
>>>>>