You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Aaron Levin <aa...@stripe.com> on 2019/07/11 19:38:55 UTC

Graceful Task Manager Termination and Replacement

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing
it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm
interested in a way to replace a Task Manager that has currently-running
tasks. It would be great if it was possible to terminate a Task Manager
without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and
security. Each time we do this we stop the task manager running on the host
being cycled. This causes the entire job to restart, resulting in downtime
for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin

Re: Graceful Task Manager Termination and Replacement

Posted by Hao Sun <ha...@zendesk.com>.

I have a common interest in this topic. My k8s recycle hosts, and I am
facing the same issue. Flink can tolerate this situation, but I am
wondering if I can do better

On Thu, Jul 11, 2019, 12:39 Aaron Levin <aa...@stripe.com> wrote:

> Hello,
>
> Is there a way to gracefully terminate a Task Manager beyond just killing
> it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm
> interested in a way to replace a Task Manager that has currently-running
> tasks. It would be great if it was possible to terminate a Task Manager
> without restarting the job, though I'm not sure if this is possible.
>
> Context: at my work we regularly cycle our hosts for maintenance and
> security. Each time we do this we stop the task manager running on the host
> being cycled. This causes the entire job to restart, resulting in downtime
> for the job. I'd love to decrease this downtime if at all possible.
>
> Thanks! Any insight is appreciated!
>
> Best,
>
> Aaron Levin
>

Re: Graceful Task Manager Termination and Replacement

Posted by Biao Liu <mm...@gmail.com>.

Hi Yu,

That's a great proposal. Wish to see this feature soon!

On Mon, Jul 29, 2019 at 4:59 PM Yu Li <ca...@gmail.com> wrote:

> Belated but FWIW, besides the region failover and best-efforts failover
> efforts, I believe stop with checkpoint as proposed in FLINK-12619 and
> FLIP-45 could also help here, FYI.
>
> W.r.t k8s, there're also some offline discussion about supporting local
> recovery with persistent volume even when task assigned to other TMs during
> job failover.
>
> [1] https://issues.apache.org/jira/browse/FLINK-12619
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic
>
> Best Regards,
> Yu
>
>
> On Wed, 24 Jul 2019 at 17:00, Aaron Levin <aa...@stripe.com> wrote:
>
>> I was on vacation but wanted to thank Biao for summarizing the current
>> state! Thanks!
>>
>> On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mm...@gmail.com> wrote:
>>
>>> Hi Aaron,
>>>
>>> From my understanding, you want shutting down a Task Manager without
>>> restart the job which has tasks running on this Task Manager?
>>>
>>> Based on current implementation, if there is a Task Manager is down, the
>>> tasks on it would be treated as failed. The behavior of task failure is
>>> defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
>>> That's the reason why the whole job restarts when a Task Manager has
>>> gone. As Paul said, you could try "region restart failover strategy" when
>>> 1.9 is released. It might be helpful however it depends on your job
>>> topology.
>>>
>>> The deeper reason of this issue is the consistency semantics of Flink,
>>> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
>>> is no much choice of `FailoverStrategy`.
>>> It might be improved in the future. There are some discussions in the
>>> mailing list that providing some weaker consistency semantics to improve
>>> the `FailoverStrategy`. We are pushing forward this improvement. I hope it
>>> can be included in 1.10.
>>>
>>> Regarding your question, I guess the answer is no for now. A more
>>> frequent checkpoint or a savepoint manually triggered might be helpful by a
>>> quicker recovery.
>>>
>>>
>>> Paul Lam <pa...@gmail.com> 于2019年7月12日周五 上午10:25写道：
>>>
>>>> Hi,
>>>>
>>>> Maybe region restart strategy can help. It restarts minimum required
>>>> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
>>>> unless you’re running a stateless job.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10712
>>>>
>>>> Best,
>>>> Paul Lam
>>>>
>>>> 在 2019年7月12日，03:38，Aaron Levin <aa...@stripe.com> 写道：
>>>>
>>>> Hello,
>>>>
>>>> Is there a way to gracefully terminate a Task Manager beyond just
>>>> killing it (this seems to be what `./taskmanager.sh stop` does)?
>>>> Specifically I'm interested in a way to replace a Task Manager that has
>>>> currently-running tasks. It would be great if it was possible to terminate
>>>> a Task Manager without restarting the job, though I'm not sure if this is
>>>> possible.
>>>>
>>>> Context: at my work we regularly cycle our hosts for maintenance and
>>>> security. Each time we do this we stop the task manager running on the host
>>>> being cycled. This causes the entire job to restart, resulting in downtime
>>>> for the job. I'd love to decrease this downtime if at all possible.
>>>>
>>>> Thanks! Any insight is appreciated!
>>>>
>>>> Best,
>>>>
>>>> Aaron Levin
>>>>
>>>>
>>>>

Re: Graceful Task Manager Termination and Replacement

Posted by Yu Li <ca...@gmail.com>.

Belated but FWIW, besides the region failover and best-efforts failover
efforts, I believe stop with checkpoint as proposed in FLINK-12619 and
FLIP-45 could also help here, FYI.

W.r.t k8s, there're also some offline discussion about supporting local
recovery with persistent volume even when task assigned to other TMs during
job failover.

[1] https://issues.apache.org/jira/browse/FLINK-12619
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic

Best Regards,
Yu


On Wed, 24 Jul 2019 at 17:00, Aaron Levin <aa...@stripe.com> wrote:

> I was on vacation but wanted to thank Biao for summarizing the current
> state! Thanks!
>
> On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mm...@gmail.com> wrote:
>
>> Hi Aaron,
>>
>> From my understanding, you want shutting down a Task Manager without
>> restart the job which has tasks running on this Task Manager?
>>
>> Based on current implementation, if there is a Task Manager is down, the
>> tasks on it would be treated as failed. The behavior of task failure is
>> defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
>> That's the reason why the whole job restarts when a Task Manager has
>> gone. As Paul said, you could try "region restart failover strategy" when
>> 1.9 is released. It might be helpful however it depends on your job
>> topology.
>>
>> The deeper reason of this issue is the consistency semantics of Flink,
>> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
>> is no much choice of `FailoverStrategy`.
>> It might be improved in the future. There are some discussions in the
>> mailing list that providing some weaker consistency semantics to improve
>> the `FailoverStrategy`. We are pushing forward this improvement. I hope it
>> can be included in 1.10.
>>
>> Regarding your question, I guess the answer is no for now. A more
>> frequent checkpoint or a savepoint manually triggered might be helpful by a
>> quicker recovery.
>>
>>
>> Paul Lam <pa...@gmail.com> 于2019年7月12日周五 上午10:25写道：
>>
>>> Hi,
>>>
>>> Maybe region restart strategy can help. It restarts minimum required
>>> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
>>> unless you’re running a stateless job.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10712
>>>
>>> Best,
>>> Paul Lam
>>>
>>> 在 2019年7月12日，03:38，Aaron Levin <aa...@stripe.com> 写道：
>>>
>>> Hello,
>>>
>>> Is there a way to gracefully terminate a Task Manager beyond just
>>> killing it (this seems to be what `./taskmanager.sh stop` does)?
>>> Specifically I'm interested in a way to replace a Task Manager that has
>>> currently-running tasks. It would be great if it was possible to terminate
>>> a Task Manager without restarting the job, though I'm not sure if this is
>>> possible.
>>>
>>> Context: at my work we regularly cycle our hosts for maintenance and
>>> security. Each time we do this we stop the task manager running on the host
>>> being cycled. This causes the entire job to restart, resulting in downtime
>>> for the job. I'd love to decrease this downtime if at all possible.
>>>
>>> Thanks! Any insight is appreciated!
>>>
>>> Best,
>>>
>>> Aaron Levin
>>>
>>>
>>>

Re: Graceful Task Manager Termination and Replacement

Posted by Aaron Levin <aa...@stripe.com>.

I was on vacation but wanted to thank Biao for summarizing the current
state! Thanks!

On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mm...@gmail.com> wrote:

> Hi Aaron,
>
> From my understanding, you want shutting down a Task Manager without
> restart the job which has tasks running on this Task Manager?
>
> Based on current implementation, if there is a Task Manager is down, the
> tasks on it would be treated as failed. The behavior of task failure is
> defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
> That's the reason why the whole job restarts when a Task Manager has gone.
> As Paul said, you could try "region restart failover strategy" when 1.9 is
> released. It might be helpful however it depends on your job topology.
>
> The deeper reason of this issue is the consistency semantics of Flink,
> AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
> is no much choice of `FailoverStrategy`.
> It might be improved in the future. There are some discussions in the
> mailing list that providing some weaker consistency semantics to improve
> the `FailoverStrategy`. We are pushing forward this improvement. I hope it
> can be included in 1.10.
>
> Regarding your question, I guess the answer is no for now. A more frequent
> checkpoint or a savepoint manually triggered might be helpful by a quicker
> recovery.
>
>
> Paul Lam <pa...@gmail.com> 于2019年7月12日周五 上午10:25写道：
>
>> Hi,
>>
>> Maybe region restart strategy can help. It restarts minimum required
>> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
>> unless you’re running a stateless job.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10712
>>
>> Best,
>> Paul Lam
>>
>> 在 2019年7月12日，03:38，Aaron Levin <aa...@stripe.com> 写道：
>>
>> Hello,
>>
>> Is there a way to gracefully terminate a Task Manager beyond just killing
>> it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm
>> interested in a way to replace a Task Manager that has currently-running
>> tasks. It would be great if it was possible to terminate a Task Manager
>> without restarting the job, though I'm not sure if this is possible.
>>
>> Context: at my work we regularly cycle our hosts for maintenance and
>> security. Each time we do this we stop the task manager running on the host
>> being cycled. This causes the entire job to restart, resulting in downtime
>> for the job. I'd love to decrease this downtime if at all possible.
>>
>> Thanks! Any insight is appreciated!
>>
>> Best,
>>
>> Aaron Levin
>>
>>
>>

Re: Graceful Task Manager Termination and Replacement

Posted by Biao Liu <mm...@gmail.com>.

Hi Aaron,

From my understanding, you want shutting down a Task Manager without
restart the job which has tasks running on this Task Manager?

Based on current implementation, if there is a Task Manager is down, the
tasks on it would be treated as failed. The behavior of task failure is
defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
That's the reason why the whole job restarts when a Task Manager has gone.
As Paul said, you could try "region restart failover strategy" when 1.9 is
released. It might be helpful however it depends on your job topology.

The deeper reason of this issue is the consistency semantics of Flink,
AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there
is no much choice of `FailoverStrategy`.
It might be improved in the future. There are some discussions in the
mailing list that providing some weaker consistency semantics to improve
the `FailoverStrategy`. We are pushing forward this improvement. I hope it
can be included in 1.10.

Regarding your question, I guess the answer is no for now. A more frequent
checkpoint or a savepoint manually triggered might be helpful by a quicker
recovery.


Paul Lam <pa...@gmail.com> 于2019年7月12日周五 上午10:25写道：

> Hi,
>
> Maybe region restart strategy can help. It restarts minimum required
> tasks. Note that it’s recommended to use only after 1.9 release, see [1],
> unless you’re running a stateless job.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10712
>
> Best,
> Paul Lam
>
> 在 2019年7月12日，03:38，Aaron Levin <aa...@stripe.com> 写道：
>
> Hello,
>
> Is there a way to gracefully terminate a Task Manager beyond just killing
> it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm
> interested in a way to replace a Task Manager that has currently-running
> tasks. It would be great if it was possible to terminate a Task Manager
> without restarting the job, though I'm not sure if this is possible.
>
> Context: at my work we regularly cycle our hosts for maintenance and
> security. Each time we do this we stop the task manager running on the host
> being cycled. This causes the entire job to restart, resulting in downtime
> for the job. I'd love to decrease this downtime if at all possible.
>
> Thanks! Any insight is appreciated!
>
> Best,
>
> Aaron Levin
>
>
>

Re: Graceful Task Manager Termination and Replacement

Posted by Paul Lam <pa...@gmail.com>.

Hi,

Maybe region restart strategy can help. It restarts minimum required tasks. Note that it’s recommended to use only after 1.9 release, see [1], unless you’re running a stateless job.

[1] https://issues.apache.org/jira/browse/FLINK-10712 <https://issues.apache.org/jira/browse/FLINK-10712>

Best,
Paul Lam

> 在 2019年7月12日，03:38，Aaron Levin <aa...@stripe.com> 写道：
> 
> Hello,
> 
> Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.
> 
> Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.
> 
> Thanks! Any insight is appreciated!
> 
> Best,
> 
> Aaron Levin