You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Weihua Hu <hu...@gmail.com> on 2022/11/04 08:45:35 UTC

[DISCUSS]Introduce a time-segment based restart strategy

Hi, everyone

I'd like to bring up a discussion about restart strategy. Flink supports 3
kinds of restart strategy. These work very well for jobs with specific
configs, but for platform users who manage hundreds of jobs, there is no
common strategy to use.

Let me explain the reason. We manage a lot of jobs, some are
keyby-connected with one region per job, some are rescale-connected with
many regions per job, and when using the failure rate restart strategy, we
cannot achieve the same control with the same configuration.

For example, if I want the job to fail when there are 3 exceptions within 5
minutes, the config would look like this:

> restart-strategy.failure-rate.max-failures-per-interval: 3
>
> restart-strategy.failure-rate.failure-rate-interval: 5 min
>
For the keyby-connected job, this config works well.

However, for the rescale-connected job, we need to consider the number of
regions and the number of slots per TaskManager. If each TM has 3 slots,
and these 3 slots run the task of 3 regions, then when one TaskManager
crashes, it will trigger 3 regions to fail, and the job will fail because
it exceeds the threshold of the restart strategy. To avoid the effect of
single TM crashes, I must increase the max-failures-per-interval to 9, but
after the change, user task exceptions will be more tolerant than I want.


Therefore, I want to introduce a new restart strategy based on time
periods. A continuous period of time (e.g., 5 minutes) is divided into
segments of a specific length (e.g., 1 minute). If an exception occurs
within a segment (no matter how many times), it is marked as a failed
segment. Similar to failure-rate restart strategy, the job will fail when
there are 'm' failed segments in the interval of 'n' .

In this mode, the keyby-connected and rescale-connected jobs can use
unified configurations.

This is a user-relevant change, so if you think this is worth to do, maybe
I can create a FLIP to describe it in detail.
Best,
Weihua

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by weijie guo <gu...@gmail.com>.

Hi, all,

Thank you very much for this interesting discussion.

TBH, Dong's proposal made me very excited. Our users don't need to be
tortured by choosing the right one among many strategies.

However, as Gen said, it may need to make some changes to the
RestartBackoffTimeStrategy interface as it does not perceive the concept of
pipelined region now. Even though it is easy to pass in the required
parameters (involved regions) to it. Personally, it is very important to
keep the protocol simple and maintain the components' own responsibilities
well. Pipelined region should not be remembered by the
RestartBackoffTimeStrategy at least at present. it is not necessary for me
to make this change before seeing more benefits.

Based on the above reasons and other factors he mentioned,+1 for Gen's
proposal.

Best regards,

Weijie


Gen Luo <lu...@gmail.com> 于2022年11月25日周五 19:42写道：

> Hi all,
>
> Sorry for the late jumping in.
>
> To meet Weihua's need, Dong's proposal seems pretty fine, but the
> modification it requires, I'm afraid, is not really easy.
> RestartBackoffTimeStrategy is quite a simple interface. The strategy even
> doesn't know which task is failing, not to mention the division of pipeline
> regions.
> To distinguish the failure count of each regions, it lacks too much
> information, which is not easy to acquire for the strategy.
> One approch I can figure out is to create different strategy instances to
> different regions. In this way we do not need to modify the strategy but do
> need to modify the schedulers or the ExecutionFailureHandler.
>
> On the other hand, I realize another case that the restart strategy may
> need to be aware of the types and occurrence rate of the exceptions. That
> is to avoid failing over but directly fail the job when some errors happen.
> I know that there's an annotation
> `@ThrowableAnnotation(ThrowableType.NonRecoverableError)` that can fail the
> job, but I'm afraid there can be some scenarios that can not annotate the
> exceptions, or catch and warp with an annotated exception.
> In such cases, handling in the restart strategy can be a good choice.
> Such a strategy can even combines with other existing strategies which
> handle the failure rate rather than the cause type.
>
> Besides, given that new strategies may be necessary, and existing
> strategies may also need to enhance, maybe we should make the
> RestartBackoffTimeStrategy a plugin rather than the enumerations, or
> introduce a new custom type strategy which can load customized
> implementations.
> This can not solve the problem immediately, but makes the choice of restart
> strategy more flexiable.
> What do you think about this?
>
> Thanks.
>
> Paul Lam <pa...@gmail.com> 于 2022年11月21日周一 17:46写道：
>
> > Dong’s proposal LGTM.
> >
> > Best,
> > Paul Lam
> >
> > > 2022年11月19日 10:50，Dong Lin <li...@gmail.com> 写道：
> > >
> > > Hey Weihua,
> > >
> > > Thanks for proposing the new strategy!
> > >
> > > If I understand correctly, the main issue is that different failover
> > > regions can be restarted independently, but they share the same counter
> > > when counting the number of failures in an interval. So the number of
> > > failures for a given region is less than what users expect.
> > >
> > > Given that regions can be restarted independently, it might be more
> > usable
> > > and intuitive to count the number of failures for each region when
> > > executing the failover strategy. Thus, instead of adding a new failover
> > > strategy, how about we update the existing failure-rate strategy, and
> > > probably other existing strategies as well, to use the following
> > semantics:
> > >
> > > - For any given region in the job, its number of failures in
> > > failure-rate-interval should not exceed max-failures-per-interval.
> > > Otherwise, the job will fail without being restarted.
> > >
> > > By using this updated semantics, the keyby-connected job will have the
> > same
> > > behavior as the existing Flink when we use failure-rate strategy. For
> > > the rescale-connected
> > > job, in the case you described above, after the TM fails, each of the 3
> > > regions will increment its failure count from 0 to 1, which is still
> less
> > > than max-failures-per-interval. Thus the rescale-connected job can
> > continue
> > > to work.
> > >
> > > This alternative approach can solve the problem without increasing the
> > > complexity of the failover strategy choice. And this approach does not
> > > require us to check whether two exceptions belong to the same root
> cause.
> > > Do you think it can work?
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > > On Fri, Nov 4, 2022 at 4:46 PM Weihua Hu <hu...@gmail.com>
> wrote:
> > >
> > >> Hi, everyone
> > >>
> > >> I'd like to bring up a discussion about restart strategy. Flink
> > supports 3
> > >> kinds of restart strategy. These work very well for jobs with specific
> > >> configs, but for platform users who manage hundreds of jobs, there is
> no
> > >> common strategy to use.
> > >>
> > >> Let me explain the reason. We manage a lot of jobs, some are
> > >> keyby-connected with one region per job, some are rescale-connected
> with
> > >> many regions per job, and when using the failure rate restart
> strategy,
> > we
> > >> cannot achieve the same control with the same configuration.
> > >>
> > >> For example, if I want the job to fail when there are 3 exceptions
> > within 5
> > >> minutes, the config would look like this:
> > >>
> > >>> restart-strategy.failure-rate.max-failures-per-interval: 3
> > >>>
> > >>> restart-strategy.failure-rate.failure-rate-interval: 5 min
> > >>>
> > >> For the keyby-connected job, this config works well.
> > >>
> > >> However, for the rescale-connected job, we need to consider the number
> > of
> > >> regions and the number of slots per TaskManager. If each TM has 3
> slots,
> > >> and these 3 slots run the task of 3 regions, then when one TaskManager
> > >> crashes, it will trigger 3 regions to fail, and the job will fail
> > because
> > >> it exceeds the threshold of the restart strategy. To avoid the effect
> of
> > >> single TM crashes, I must increase the max-failures-per-interval to 9,
> > but
> > >> after the change, user task exceptions will be more tolerant than I
> > want.
> > >>
> > >>
> > >> Therefore, I want to introduce a new restart strategy based on time
> > >> periods. A continuous period of time (e.g., 5 minutes) is divided into
> > >> segments of a specific length (e.g., 1 minute). If an exception occurs
> > >> within a segment (no matter how many times), it is marked as a failed
> > >> segment. Similar to failure-rate restart strategy, the job will fail
> > when
> > >> there are 'm' failed segments in the interval of 'n' .
> > >>
> > >> In this mode, the keyby-connected and rescale-connected jobs can use
> > >> unified configurations.
> > >>
> > >> This is a user-relevant change, so if you think this is worth to do,
> > maybe
> > >> I can create a FLIP to describe it in detail.
> > >> Best,
> > >> Weihua
> > >>
> >
> >
>

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by Dong Lin <li...@gmail.com>.

Hi Gen,

Thanks for the suggestions!

Regarding how to implement the per-region RestartBackoffTimeStrategy as
proposed previously, I think your approach works well.

Here are more details:
- Keep the RestartBackoffTimeStrategy interface API unchanged and only
change its semantics, such that all strategies (e.g. failure rate, fixed
delay, exponential delay) are applied per region.
- Update ExecutionFailureHandler to create one
RestartBackoffTimeStrategy instance for each region.
ExecutionFailureHandler can get the region information from its
SchedulingTopology.
- ExecutionFailureHandler::getFailureHandlingResult() will use the strategy
instance for the given failedTask's region to make the failover decision.

Please see the other comment inline.

Regards,
Dong

On Fri, Nov 25, 2022 at 7:42 PM Gen Luo <lu...@gmail.com> wrote:

> Hi all,
>
> Sorry for the late jumping in.
>
> To meet Weihua's need, Dong's proposal seems pretty fine, but the
> modification it requires, I'm afraid, is not really easy.
> RestartBackoffTimeStrategy is quite a simple interface. The strategy even
> doesn't know which task is failing, not to mention the division of pipeline
> regions.
> To distinguish the failure count of each regions, it lacks too much
> information, which is not easy to acquire for the strategy.
> One approch I can figure out is to create different strategy instances to
> different regions. In this way we do not need to modify the strategy but do
> need to modify the schedulers or the ExecutionFailureHandler.
>
> On the other hand, I realize another case that the restart strategy may
> need to be aware of the types and occurrence rate of the exceptions. That
> is to avoid failing over but directly fail the job when some errors happen.
> I know that there's an annotation
> `@ThrowableAnnotation(ThrowableType.NonRecoverableError)` that can fail the
> job, but I'm afraid there can be some scenarios that can not annotate the
> exceptions, or catch and warp with an annotated exception.
>

While this is possible, do you have a concrete use-case that can not use
and catch the annotated exception? It is probably safer to only add a new
strategy (which is a public API) when we are sure we need it :)

In such cases, handling in the restart strategy can be a good choice.
> Such a strategy can even combines with other existing strategies which
> handle the failure rate rather than the cause type.
>
> Besides, given that new strategies may be necessary, and existing
> strategies may also need to enhance, maybe we should make the
> RestartBackoffTimeStrategy a plugin rather than the enumerations, or
> introduce a new custom type strategy which can load customized
> implementations.
> This can not solve the problem immediately, but makes the choice of restart
> strategy more flexiable.
> What do you think about this?
>
> Thanks.
>
> Paul Lam <pa...@gmail.com> 于 2022年11月21日周一 17:46写道：
>
> > Dong’s proposal LGTM.
> >
> > Best,
> > Paul Lam
> >
> > > 2022年11月19日 10:50，Dong Lin <li...@gmail.com> 写道：
> > >
> > > Hey Weihua,
> > >
> > > Thanks for proposing the new strategy!
> > >
> > > If I understand correctly, the main issue is that different failover
> > > regions can be restarted independently, but they share the same counter
> > > when counting the number of failures in an interval. So the number of
> > > failures for a given region is less than what users expect.
> > >
> > > Given that regions can be restarted independently, it might be more
> > usable
> > > and intuitive to count the number of failures for each region when
> > > executing the failover strategy. Thus, instead of adding a new failover
> > > strategy, how about we update the existing failure-rate strategy, and
> > > probably other existing strategies as well, to use the following
> > semantics:
> > >
> > > - For any given region in the job, its number of failures in
> > > failure-rate-interval should not exceed max-failures-per-interval.
> > > Otherwise, the job will fail without being restarted.
> > >
> > > By using this updated semantics, the keyby-connected job will have the
> > same
> > > behavior as the existing Flink when we use failure-rate strategy. For
> > > the rescale-connected
> > > job, in the case you described above, after the TM fails, each of the 3
> > > regions will increment its failure count from 0 to 1, which is still
> less
> > > than max-failures-per-interval. Thus the rescale-connected job can
> > continue
> > > to work.
> > >
> > > This alternative approach can solve the problem without increasing the
> > > complexity of the failover strategy choice. And this approach does not
> > > require us to check whether two exceptions belong to the same root
> cause.
> > > Do you think it can work?
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > > On Fri, Nov 4, 2022 at 4:46 PM Weihua Hu <hu...@gmail.com>
> wrote:
> > >
> > >> Hi, everyone
> > >>
> > >> I'd like to bring up a discussion about restart strategy. Flink
> > supports 3
> > >> kinds of restart strategy. These work very well for jobs with specific
> > >> configs, but for platform users who manage hundreds of jobs, there is
> no
> > >> common strategy to use.
> > >>
> > >> Let me explain the reason. We manage a lot of jobs, some are
> > >> keyby-connected with one region per job, some are rescale-connected
> with
> > >> many regions per job, and when using the failure rate restart
> strategy,
> > we
> > >> cannot achieve the same control with the same configuration.
> > >>
> > >> For example, if I want the job to fail when there are 3 exceptions
> > within 5
> > >> minutes, the config would look like this:
> > >>
> > >>> restart-strategy.failure-rate.max-failures-per-interval: 3
> > >>>
> > >>> restart-strategy.failure-rate.failure-rate-interval: 5 min
> > >>>
> > >> For the keyby-connected job, this config works well.
> > >>
> > >> However, for the rescale-connected job, we need to consider the number
> > of
> > >> regions and the number of slots per TaskManager. If each TM has 3
> slots,
> > >> and these 3 slots run the task of 3 regions, then when one TaskManager
> > >> crashes, it will trigger 3 regions to fail, and the job will fail
> > because
> > >> it exceeds the threshold of the restart strategy. To avoid the effect
> of
> > >> single TM crashes, I must increase the max-failures-per-interval to 9,
> > but
> > >> after the change, user task exceptions will be more tolerant than I
> > want.
> > >>
> > >>
> > >> Therefore, I want to introduce a new restart strategy based on time
> > >> periods. A continuous period of time (e.g., 5 minutes) is divided into
> > >> segments of a specific length (e.g., 1 minute). If an exception occurs
> > >> within a segment (no matter how many times), it is marked as a failed
> > >> segment. Similar to failure-rate restart strategy, the job will fail
> > when
> > >> there are 'm' failed segments in the interval of 'n' .
> > >>
> > >> In this mode, the keyby-connected and rescale-connected jobs can use
> > >> unified configurations.
> > >>
> > >> This is a user-relevant change, so if you think this is worth to do,
> > maybe
> > >> I can create a FLIP to describe it in detail.
> > >> Best,
> > >> Weihua
> > >>
> >
> >
>

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by Gen Luo <lu...@gmail.com>.

Hi all,

Sorry for the late jumping in.

To meet Weihua's need, Dong's proposal seems pretty fine, but the
modification it requires, I'm afraid, is not really easy.
RestartBackoffTimeStrategy is quite a simple interface. The strategy even
doesn't know which task is failing, not to mention the division of pipeline
regions.
To distinguish the failure count of each regions, it lacks too much
information, which is not easy to acquire for the strategy.
One approch I can figure out is to create different strategy instances to
different regions. In this way we do not need to modify the strategy but do
need to modify the schedulers or the ExecutionFailureHandler.

On the other hand, I realize another case that the restart strategy may
need to be aware of the types and occurrence rate of the exceptions. That
is to avoid failing over but directly fail the job when some errors happen.
I know that there's an annotation
`@ThrowableAnnotation(ThrowableType.NonRecoverableError)` that can fail the
job, but I'm afraid there can be some scenarios that can not annotate the
exceptions, or catch and warp with an annotated exception.
In such cases, handling in the restart strategy can be a good choice.
Such a strategy can even combines with other existing strategies which
handle the failure rate rather than the cause type.

Besides, given that new strategies may be necessary, and existing
strategies may also need to enhance, maybe we should make the
RestartBackoffTimeStrategy a plugin rather than the enumerations, or
introduce a new custom type strategy which can load customized
implementations.
This can not solve the problem immediately, but makes the choice of restart
strategy more flexiable.
What do you think about this?

Thanks.

Paul Lam <pa...@gmail.com> 于 2022年11月21日周一 17:46写道：

> Dong’s proposal LGTM.
>
> Best,
> Paul Lam
>
> > 2022年11月19日 10:50，Dong Lin <li...@gmail.com> 写道：
> >
> > Hey Weihua,
> >
> > Thanks for proposing the new strategy!
> >
> > If I understand correctly, the main issue is that different failover
> > regions can be restarted independently, but they share the same counter
> > when counting the number of failures in an interval. So the number of
> > failures for a given region is less than what users expect.
> >
> > Given that regions can be restarted independently, it might be more
> usable
> > and intuitive to count the number of failures for each region when
> > executing the failover strategy. Thus, instead of adding a new failover
> > strategy, how about we update the existing failure-rate strategy, and
> > probably other existing strategies as well, to use the following
> semantics:
> >
> > - For any given region in the job, its number of failures in
> > failure-rate-interval should not exceed max-failures-per-interval.
> > Otherwise, the job will fail without being restarted.
> >
> > By using this updated semantics, the keyby-connected job will have the
> same
> > behavior as the existing Flink when we use failure-rate strategy. For
> > the rescale-connected
> > job, in the case you described above, after the TM fails, each of the 3
> > regions will increment its failure count from 0 to 1, which is still less
> > than max-failures-per-interval. Thus the rescale-connected job can
> continue
> > to work.
> >
> > This alternative approach can solve the problem without increasing the
> > complexity of the failover strategy choice. And this approach does not
> > require us to check whether two exceptions belong to the same root cause.
> > Do you think it can work?
> >
> > Thanks,
> > Dong
> >
> >
> > On Fri, Nov 4, 2022 at 4:46 PM Weihua Hu <hu...@gmail.com> wrote:
> >
> >> Hi, everyone
> >>
> >> I'd like to bring up a discussion about restart strategy. Flink
> supports 3
> >> kinds of restart strategy. These work very well for jobs with specific
> >> configs, but for platform users who manage hundreds of jobs, there is no
> >> common strategy to use.
> >>
> >> Let me explain the reason. We manage a lot of jobs, some are
> >> keyby-connected with one region per job, some are rescale-connected with
> >> many regions per job, and when using the failure rate restart strategy,
> we
> >> cannot achieve the same control with the same configuration.
> >>
> >> For example, if I want the job to fail when there are 3 exceptions
> within 5
> >> minutes, the config would look like this:
> >>
> >>> restart-strategy.failure-rate.max-failures-per-interval: 3
> >>>
> >>> restart-strategy.failure-rate.failure-rate-interval: 5 min
> >>>
> >> For the keyby-connected job, this config works well.
> >>
> >> However, for the rescale-connected job, we need to consider the number
> of
> >> regions and the number of slots per TaskManager. If each TM has 3 slots,
> >> and these 3 slots run the task of 3 regions, then when one TaskManager
> >> crashes, it will trigger 3 regions to fail, and the job will fail
> because
> >> it exceeds the threshold of the restart strategy. To avoid the effect of
> >> single TM crashes, I must increase the max-failures-per-interval to 9,
> but
> >> after the change, user task exceptions will be more tolerant than I
> want.
> >>
> >>
> >> Therefore, I want to introduce a new restart strategy based on time
> >> periods. A continuous period of time (e.g., 5 minutes) is divided into
> >> segments of a specific length (e.g., 1 minute). If an exception occurs
> >> within a segment (no matter how many times), it is marked as a failed
> >> segment. Similar to failure-rate restart strategy, the job will fail
> when
> >> there are 'm' failed segments in the interval of 'n' .
> >>
> >> In this mode, the keyby-connected and rescale-connected jobs can use
> >> unified configurations.
> >>
> >> This is a user-relevant change, so if you think this is worth to do,
> maybe
> >> I can create a FLIP to describe it in detail.
> >> Best,
> >> Weihua
> >>
>
>

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by Paul Lam <pa...@gmail.com>.

Dong’s proposal LGTM.

Best,
Paul Lam

> 2022年11月19日 10:50，Dong Lin <li...@gmail.com> 写道：
> 
> Hey Weihua,
> 
> Thanks for proposing the new strategy!
> 
> If I understand correctly, the main issue is that different failover
> regions can be restarted independently, but they share the same counter
> when counting the number of failures in an interval. So the number of
> failures for a given region is less than what users expect.
> 
> Given that regions can be restarted independently, it might be more usable
> and intuitive to count the number of failures for each region when
> executing the failover strategy. Thus, instead of adding a new failover
> strategy, how about we update the existing failure-rate strategy, and
> probably other existing strategies as well, to use the following semantics:
> 
> - For any given region in the job, its number of failures in
> failure-rate-interval should not exceed max-failures-per-interval.
> Otherwise, the job will fail without being restarted.
> 
> By using this updated semantics, the keyby-connected job will have the same
> behavior as the existing Flink when we use failure-rate strategy. For
> the rescale-connected
> job, in the case you described above, after the TM fails, each of the 3
> regions will increment its failure count from 0 to 1, which is still less
> than max-failures-per-interval. Thus the rescale-connected job can continue
> to work.
> 
> This alternative approach can solve the problem without increasing the
> complexity of the failover strategy choice. And this approach does not
> require us to check whether two exceptions belong to the same root cause.
> Do you think it can work?
> 
> Thanks,
> Dong
> 
> 
> On Fri, Nov 4, 2022 at 4:46 PM Weihua Hu <hu...@gmail.com> wrote:
> 
>> Hi, everyone
>> 
>> I'd like to bring up a discussion about restart strategy. Flink supports 3
>> kinds of restart strategy. These work very well for jobs with specific
>> configs, but for platform users who manage hundreds of jobs, there is no
>> common strategy to use.
>> 
>> Let me explain the reason. We manage a lot of jobs, some are
>> keyby-connected with one region per job, some are rescale-connected with
>> many regions per job, and when using the failure rate restart strategy, we
>> cannot achieve the same control with the same configuration.
>> 
>> For example, if I want the job to fail when there are 3 exceptions within 5
>> minutes, the config would look like this:
>> 
>>> restart-strategy.failure-rate.max-failures-per-interval: 3
>>> 
>>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>>> 
>> For the keyby-connected job, this config works well.
>> 
>> However, for the rescale-connected job, we need to consider the number of
>> regions and the number of slots per TaskManager. If each TM has 3 slots,
>> and these 3 slots run the task of 3 regions, then when one TaskManager
>> crashes, it will trigger 3 regions to fail, and the job will fail because
>> it exceeds the threshold of the restart strategy. To avoid the effect of
>> single TM crashes, I must increase the max-failures-per-interval to 9, but
>> after the change, user task exceptions will be more tolerant than I want.
>> 
>> 
>> Therefore, I want to introduce a new restart strategy based on time
>> periods. A continuous period of time (e.g., 5 minutes) is divided into
>> segments of a specific length (e.g., 1 minute). If an exception occurs
>> within a segment (no matter how many times), it is marked as a failed
>> segment. Similar to failure-rate restart strategy, the job will fail when
>> there are 'm' failed segments in the interval of 'n' .
>> 
>> In this mode, the keyby-connected and rescale-connected jobs can use
>> unified configurations.
>> 
>> This is a user-relevant change, so if you think this is worth to do, maybe
>> I can create a FLIP to describe it in detail.
>> Best,
>> Weihua
>>

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by Dong Lin <li...@gmail.com>.

Hey Weihua,

Thanks for proposing the new strategy!

If I understand correctly, the main issue is that different failover
regions can be restarted independently, but they share the same counter
when counting the number of failures in an interval. So the number of
failures for a given region is less than what users expect.

Given that regions can be restarted independently, it might be more usable
and intuitive to count the number of failures for each region when
executing the failover strategy. Thus, instead of adding a new failover
strategy, how about we update the existing failure-rate strategy, and
probably other existing strategies as well, to use the following semantics:

- For any given region in the job, its number of failures in
failure-rate-interval should not exceed max-failures-per-interval.
Otherwise, the job will fail without being restarted.

By using this updated semantics, the keyby-connected job will have the same
behavior as the existing Flink when we use failure-rate strategy. For
the rescale-connected
job, in the case you described above, after the TM fails, each of the 3
regions will increment its failure count from 0 to 1, which is still less
than max-failures-per-interval. Thus the rescale-connected job can continue
to work.

This alternative approach can solve the problem without increasing the
complexity of the failover strategy choice. And this approach does not
require us to check whether two exceptions belong to the same root cause.
Do you think it can work?

Thanks,
Dong

On Fri, Nov 4, 2022 at 4:46 PM Weihua Hu <hu...@gmail.com> wrote:

> Hi, everyone
>
> I'd like to bring up a discussion about restart strategy. Flink supports 3
> kinds of restart strategy. These work very well for jobs with specific
> configs, but for platform users who manage hundreds of jobs, there is no
> common strategy to use.
>
> Let me explain the reason. We manage a lot of jobs, some are
> keyby-connected with one region per job, some are rescale-connected with
> many regions per job, and when using the failure rate restart strategy, we
> cannot achieve the same control with the same configuration.
>
> For example, if I want the job to fail when there are 3 exceptions within 5
> minutes, the config would look like this:
>
> > restart-strategy.failure-rate.max-failures-per-interval: 3
> >
> > restart-strategy.failure-rate.failure-rate-interval: 5 min
> >
> For the keyby-connected job, this config works well.
>
> However, for the rescale-connected job, we need to consider the number of
> regions and the number of slots per TaskManager. If each TM has 3 slots,
> and these 3 slots run the task of 3 regions, then when one TaskManager
> crashes, it will trigger 3 regions to fail, and the job will fail because
> it exceeds the threshold of the restart strategy. To avoid the effect of
> single TM crashes, I must increase the max-failures-per-interval to 9, but
> after the change, user task exceptions will be more tolerant than I want.
>
>
> Therefore, I want to introduce a new restart strategy based on time
> periods. A continuous period of time (e.g., 5 minutes) is divided into
> segments of a specific length (e.g., 1 minute). If an exception occurs
> within a segment (no matter how many times), it is marked as a failed
> segment. Similar to failure-rate restart strategy, the job will fail when
> there are 'm' failed segments in the interval of 'n' .
>
> In this mode, the keyby-connected and rescale-connected jobs can use
> unified configurations.
>
> This is a user-relevant change, so if you think this is worth to do, maybe
> I can create a FLIP to describe it in detail.
> Best,
> Weihua
>

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by Weihua Hu <hu...@gmail.com>.

HI, @Paul Lam Thanks for the reply. I think it makes a lot of sense to
distinguish exceptions, but it might add complexity to the restart policy
maintenance, and some exceptions might be wrapped in the
FlinkRuntimeException or something else.
Maybe we can implement the first version based on the time segment.


Can someone grant me edit permissions so that I can create a new FLIP?

Best,
Weihua


On Fri, Nov 4, 2022 at 7:32 PM Paul Lam <pa...@gmail.com> wrote:

> In addition, there’s another viable alternative strategy that could be
> applied with or without the proposed strategy.
>
> We could group the exceptions occurred in an interval by exception
> class. Only a distinct exception within an interval is counted as one
> failure.
>
> The upside is that it’s more fine-grained and wouldn’t increase the
> unnecessary retry time if the job are failed due to different causes.
>
> Best,
> Paul Lam
>
> > 2022年11月4日 17:33，Paul Lam <pa...@gmail.com> 写道：
> >
> > Hi Weihua,
> >
> > +1 for the new restart strategy you suggested.
> >
> > We’re also using failure-rate strategy as the cluster-wide default and
> > faced the same problem, which we solved with a similar approach.
> >
> > FYI. We added a freeze period config option to failure-rate strategy.
> > The freeze period would prevent counting further errors after the first
> >  failure happens, so that a burst errors would not exhaust the
> > number of allow failures.
> >
> > Best,
> > Paul Lam
> >
> >> 2022年11月4日 16:45，Weihua Hu <huweihua.ckl@gmail.com <mailto:
> huweihua.ckl@gmail.com>> 写道：
> >>
> >> Hi, everyone
> >>
> >> I'd like to bring up a discussion about restart strategy. Flink
> supports 3
> >> kinds of restart strategy. These work very well for jobs with specific
> >> configs, but for platform users who manage hundreds of jobs, there is no
> >> common strategy to use.
> >>
> >> Let me explain the reason. We manage a lot of jobs, some are
> >> keyby-connected with one region per job, some are rescale-connected with
> >> many regions per job, and when using the failure rate restart strategy,
> we
> >> cannot achieve the same control with the same configuration.
> >>
> >> For example, if I want the job to fail when there are 3 exceptions
> within 5
> >> minutes, the config would look like this:
> >>
> >>> restart-strategy.failure-rate.max-failures-per-interval: 3
> >>>
> >>> restart-strategy.failure-rate.failure-rate-interval: 5 min
> >>>
> >> For the keyby-connected job, this config works well.
> >>
> >> However, for the rescale-connected job, we need to consider the number
> of
> >> regions and the number of slots per TaskManager. If each TM has 3 slots,
> >> and these 3 slots run the task of 3 regions, then when one TaskManager
> >> crashes, it will trigger 3 regions to fail, and the job will fail
> because
> >> it exceeds the threshold of the restart strategy. To avoid the effect of
> >> single TM crashes, I must increase the max-failures-per-interval to 9,
> but
> >> after the change, user task exceptions will be more tolerant than I
> want.
> >>
> >>
> >> Therefore, I want to introduce a new restart strategy based on time
> >> periods. A continuous period of time (e.g., 5 minutes) is divided into
> >> segments of a specific length (e.g., 1 minute). If an exception occurs
> >> within a segment (no matter how many times), it is marked as a failed
> >> segment. Similar to failure-rate restart strategy, the job will fail
> when
> >> there are 'm' failed segments in the interval of 'n' .
> >>
> >> In this mode, the keyby-connected and rescale-connected jobs can use
> >> unified configurations.
> >>
> >> This is a user-relevant change, so if you think this is worth to do,
> maybe
> >> I can create a FLIP to describe it in detail.
> >> Best,
> >> Weihua
> >
>
>

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by Paul Lam <pa...@gmail.com>.

In addition, there’s another viable alternative strategy that could be 
applied with or without the proposed strategy.

We could group the exceptions occurred in an interval by exception
class. Only a distinct exception within an interval is counted as one
failure. 

The upside is that it’s more fine-grained and wouldn’t increase the
unnecessary retry time if the job are failed due to different causes.

Best,
Paul Lam

> 2022年11月4日 17:33，Paul Lam <pa...@gmail.com> 写道：
> 
> Hi Weihua,
> 
> +1 for the new restart strategy you suggested.
> 
> We’re also using failure-rate strategy as the cluster-wide default and 
> faced the same problem, which we solved with a similar approach.
> 
> FYI. We added a freeze period config option to failure-rate strategy. 
> The freeze period would prevent counting further errors after the first
>  failure happens, so that a burst errors would not exhaust the 
> number of allow failures. 
> 
> Best,
> Paul Lam
> 
>> 2022年11月4日 16:45，Weihua Hu <huweihua.ckl@gmail.com <ma...@gmail.com>> 写道：
>> 
>> Hi, everyone
>> 
>> I'd like to bring up a discussion about restart strategy. Flink supports 3
>> kinds of restart strategy. These work very well for jobs with specific
>> configs, but for platform users who manage hundreds of jobs, there is no
>> common strategy to use.
>> 
>> Let me explain the reason. We manage a lot of jobs, some are
>> keyby-connected with one region per job, some are rescale-connected with
>> many regions per job, and when using the failure rate restart strategy, we
>> cannot achieve the same control with the same configuration.
>> 
>> For example, if I want the job to fail when there are 3 exceptions within 5
>> minutes, the config would look like this:
>> 
>>> restart-strategy.failure-rate.max-failures-per-interval: 3
>>> 
>>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>>> 
>> For the keyby-connected job, this config works well.
>> 
>> However, for the rescale-connected job, we need to consider the number of
>> regions and the number of slots per TaskManager. If each TM has 3 slots,
>> and these 3 slots run the task of 3 regions, then when one TaskManager
>> crashes, it will trigger 3 regions to fail, and the job will fail because
>> it exceeds the threshold of the restart strategy. To avoid the effect of
>> single TM crashes, I must increase the max-failures-per-interval to 9, but
>> after the change, user task exceptions will be more tolerant than I want.
>> 
>> 
>> Therefore, I want to introduce a new restart strategy based on time
>> periods. A continuous period of time (e.g., 5 minutes) is divided into
>> segments of a specific length (e.g., 1 minute). If an exception occurs
>> within a segment (no matter how many times), it is marked as a failed
>> segment. Similar to failure-rate restart strategy, the job will fail when
>> there are 'm' failed segments in the interval of 'n' .
>> 
>> In this mode, the keyby-connected and rescale-connected jobs can use
>> unified configurations.
>> 
>> This is a user-relevant change, so if you think this is worth to do, maybe
>> I can create a FLIP to describe it in detail.
>> Best,
>> Weihua
>

Re: [DISCUSS]Introduce a time-segment based restart strategy

Posted by Paul Lam <pa...@gmail.com>.

Hi Weihua,

+1 for the new restart strategy you suggested.

We’re also using failure-rate strategy as the cluster-wide default and 
faced the same problem, which we solved with a similar approach.

FYI. We added a freeze period config option to failure-rate strategy. 
The freeze period would prevent counting further errors after the first
 failure happens, so that a burst errors would not exhaust the 
number of allow failures. 

Best,
Paul Lam

> 2022年11月4日 16:45，Weihua Hu <hu...@gmail.com> 写道：
> 
> Hi, everyone
> 
> I'd like to bring up a discussion about restart strategy. Flink supports 3
> kinds of restart strategy. These work very well for jobs with specific
> configs, but for platform users who manage hundreds of jobs, there is no
> common strategy to use.
> 
> Let me explain the reason. We manage a lot of jobs, some are
> keyby-connected with one region per job, some are rescale-connected with
> many regions per job, and when using the failure rate restart strategy, we
> cannot achieve the same control with the same configuration.
> 
> For example, if I want the job to fail when there are 3 exceptions within 5
> minutes, the config would look like this:
> 
>> restart-strategy.failure-rate.max-failures-per-interval: 3
>> 
>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>> 
> For the keyby-connected job, this config works well.
> 
> However, for the rescale-connected job, we need to consider the number of
> regions and the number of slots per TaskManager. If each TM has 3 slots,
> and these 3 slots run the task of 3 regions, then when one TaskManager
> crashes, it will trigger 3 regions to fail, and the job will fail because
> it exceeds the threshold of the restart strategy. To avoid the effect of
> single TM crashes, I must increase the max-failures-per-interval to 9, but
> after the change, user task exceptions will be more tolerant than I want.
> 
> 
> Therefore, I want to introduce a new restart strategy based on time
> periods. A continuous period of time (e.g., 5 minutes) is divided into
> segments of a specific length (e.g., 1 minute). If an exception occurs
> within a segment (no matter how many times), it is marked as a failed
> segment. Similar to failure-rate restart strategy, the job will fail when
> there are 'm' failed segments in the interval of 'n' .
> 
> In this mode, the keyby-connected and rescale-connected jobs can use
> unified configurations.
> 
> This is a user-relevant change, so if you think this is worth to do, maybe
> I can create a FLIP to describe it in detail.
> Best,
> Weihua