You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Zhu Zhu <re...@gmail.com> on 2019/09/12 11:11:13 UTC

[SURVEY] How many people are using customized RestartStrategy(s)

Hi everyone,

I wanted to reach out to you and ask how many of you are using a customized
RestartStrategy[1] in production jobs.

We are currently developing the new Flink scheduler[2] which interacts
with restart strategies in a different way. We have to re-design the
interfaces for the new restart strategies (so called
RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
work any more with the new scheduler.

We want to know whether we should keep the way
to customized RestartBackoffTimeStrategy so that existing customized
RestartStrategy can be migrated.

I'd appreciate if you can share the status if you are using customized
RestartStrategy. That will be valuable for use to make decisions.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
[2] https://issues.apache.org/jira/browse/FLINK-10429

Thanks,
Zhu Zhu

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
We will then keep the decision that we do not support customized restart
strategy in Flink 1.10.

Thanks Steven for the inputs!

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月26日周四 上午12:13写道:

> Zhu Zhu, that is correct.
>
> On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Hi Steven,
>>
>> As a conclusion, since we will have a meter metric[1] for restarts,
>> customized restart strategy is not needed in your case.
>> Is that right?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月25日周三 上午2:30写道:
>>
>>> Zhu Zhu,
>>>
>>> Sorry, I was using different terminology. yes, Flink meter is what I was
>>> talking about regarding "fullRestarts" for threshold based alerting.
>>>
>>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Steven,
>>>>
>>>> In my mind, Flink counter only stores its accumulated count and reports
>>>> that value. Are you using an external counter directly?
>>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>>> calculates the rate. And it will report its "count" as well as "rate" to
>>>> external metric services.
>>>>
>>>> The counter "task_failures" only works if the individual failover
>>>> strategy is enabled. However, it is not a public interface and is not
>>>> suggested to use, as the fine grained recovery (region failover) now
>>>> supersedes it.
>>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>>> fine grained recovery.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>>>>
>>>>>
>>>>> When we setup alert like "fullRestarts > 1" for some rolling window,
>>>>> we want to use counter. if it is a Gauge, "fullRestarts" will never go
>>>>> below 1 after a first full restart. So alert condition will always be true
>>>>> after first job restart. If we can apply a derivative to the Gauge value, I
>>>>> guess alert can probably work. I can explore if that is an option or not.
>>>>>
>>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Steven,
>>>>>>
>>>>>> Thanks for the information. If we can determine this a common issue,
>>>>>> we can solve it in Flink core.
>>>>>> To get to that state, I have two questions which need your help:
>>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>>>> different due to the metric type?
>>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>>> value will not increment in such cases.
>>>>>>
>>>>>> I'd appreciate if you can help with these questions and we can make
>>>>>> better decisions for Flink.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>>>>
>>>>>>> Zhu Zhu,
>>>>>>>
>>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Steven
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Steven for the feedback!
>>>>>>>> Could you share more information about the metrics you add in you
>>>>>>>> customized restart strategy?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>>>>
>>>>>>>>> We do use config like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>>
>>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>>> As it is not used from the feedbacks of this survey, I'll
>>>>>>>>>> conclude that we do not need to support customized RestartStrategy for the
>>>>>>>>>> new scheduler in Flink 1.10
>>>>>>>>>>
>>>>>>>>>> Other usages are still supported, including all the strategies
>>>>>>>>>> and configuring ways described in
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>>>
>>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>>
>>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>>
>>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>>> with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>>
>>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>>
>>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>>
>>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>>>> decisions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
We will then keep the decision that we do not support customized restart
strategy in Flink 1.10.

Thanks Steven for the inputs!

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月26日周四 上午12:13写道:

> Zhu Zhu, that is correct.
>
> On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Hi Steven,
>>
>> As a conclusion, since we will have a meter metric[1] for restarts,
>> customized restart strategy is not needed in your case.
>> Is that right?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月25日周三 上午2:30写道:
>>
>>> Zhu Zhu,
>>>
>>> Sorry, I was using different terminology. yes, Flink meter is what I was
>>> talking about regarding "fullRestarts" for threshold based alerting.
>>>
>>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Steven,
>>>>
>>>> In my mind, Flink counter only stores its accumulated count and reports
>>>> that value. Are you using an external counter directly?
>>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>>> calculates the rate. And it will report its "count" as well as "rate" to
>>>> external metric services.
>>>>
>>>> The counter "task_failures" only works if the individual failover
>>>> strategy is enabled. However, it is not a public interface and is not
>>>> suggested to use, as the fine grained recovery (region failover) now
>>>> supersedes it.
>>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>>> fine grained recovery.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>>>>
>>>>>
>>>>> When we setup alert like "fullRestarts > 1" for some rolling window,
>>>>> we want to use counter. if it is a Gauge, "fullRestarts" will never go
>>>>> below 1 after a first full restart. So alert condition will always be true
>>>>> after first job restart. If we can apply a derivative to the Gauge value, I
>>>>> guess alert can probably work. I can explore if that is an option or not.
>>>>>
>>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Steven,
>>>>>>
>>>>>> Thanks for the information. If we can determine this a common issue,
>>>>>> we can solve it in Flink core.
>>>>>> To get to that state, I have two questions which need your help:
>>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>>>> different due to the metric type?
>>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>>> value will not increment in such cases.
>>>>>>
>>>>>> I'd appreciate if you can help with these questions and we can make
>>>>>> better decisions for Flink.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>>>>
>>>>>>> Zhu Zhu,
>>>>>>>
>>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Steven
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Steven for the feedback!
>>>>>>>> Could you share more information about the metrics you add in you
>>>>>>>> customized restart strategy?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>>>>
>>>>>>>>> We do use config like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>>
>>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>>> As it is not used from the feedbacks of this survey, I'll
>>>>>>>>>> conclude that we do not need to support customized RestartStrategy for the
>>>>>>>>>> new scheduler in Flink 1.10
>>>>>>>>>>
>>>>>>>>>> Other usages are still supported, including all the strategies
>>>>>>>>>> and configuring ways described in
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>>>
>>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>>
>>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>>
>>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>>> with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>>
>>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>>
>>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>>
>>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>>>> decisions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
Zhu Zhu, that is correct.

On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <re...@gmail.com> wrote:

> Hi Steven,
>
> As a conclusion, since we will have a meter metric[1] for restarts,
> customized restart strategy is not needed in your case.
> Is that right?
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月25日周三 上午2:30写道:
>
>> Zhu Zhu,
>>
>> Sorry, I was using different terminology. yes, Flink meter is what I was
>> talking about regarding "fullRestarts" for threshold based alerting.
>>
>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Steven,
>>>
>>> In my mind, Flink counter only stores its accumulated count and reports
>>> that value. Are you using an external counter directly?
>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>> calculates the rate. And it will report its "count" as well as "rate" to
>>> external metric services.
>>>
>>> The counter "task_failures" only works if the individual failover
>>> strategy is enabled. However, it is not a public interface and is not
>>> suggested to use, as the fine grained recovery (region failover) now
>>> supersedes it.
>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>> fine grained recovery.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>>>
>>>>
>>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>>> after a first full restart. So alert condition will always be true after
>>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>>> alert can probably work. I can explore if that is an option or not.
>>>>
>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>>
>>>>
>>>>
>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Steven,
>>>>>
>>>>> Thanks for the information. If we can determine this a common issue,
>>>>> we can solve it in Flink core.
>>>>> To get to that state, I have two questions which need your help:
>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>>> different due to the metric type?
>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>> value will not increment in such cases.
>>>>>
>>>>> I'd appreciate if you can help with these questions and we can make
>>>>> better decisions for Flink.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>>>
>>>>>> Zhu Zhu,
>>>>>>
>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Steven for the feedback!
>>>>>>> Could you share more information about the metrics you add in you
>>>>>>> customized restart strategy?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>>>
>>>>>>>> We do use config like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>
>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>
>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>>> scheduler in Flink 1.10
>>>>>>>>>
>>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>>> configuring ways described in
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>>
>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>
>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>
>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>> with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>>
>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>
>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>
>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>
>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>
>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>>> decisions.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
Zhu Zhu, that is correct.

On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <re...@gmail.com> wrote:

> Hi Steven,
>
> As a conclusion, since we will have a meter metric[1] for restarts,
> customized restart strategy is not needed in your case.
> Is that right?
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月25日周三 上午2:30写道:
>
>> Zhu Zhu,
>>
>> Sorry, I was using different terminology. yes, Flink meter is what I was
>> talking about regarding "fullRestarts" for threshold based alerting.
>>
>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Steven,
>>>
>>> In my mind, Flink counter only stores its accumulated count and reports
>>> that value. Are you using an external counter directly?
>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>> calculates the rate. And it will report its "count" as well as "rate" to
>>> external metric services.
>>>
>>> The counter "task_failures" only works if the individual failover
>>> strategy is enabled. However, it is not a public interface and is not
>>> suggested to use, as the fine grained recovery (region failover) now
>>> supersedes it.
>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>> fine grained recovery.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>>>
>>>>
>>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>>> after a first full restart. So alert condition will always be true after
>>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>>> alert can probably work. I can explore if that is an option or not.
>>>>
>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>>
>>>>
>>>>
>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Steven,
>>>>>
>>>>> Thanks for the information. If we can determine this a common issue,
>>>>> we can solve it in Flink core.
>>>>> To get to that state, I have two questions which need your help:
>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>>> different due to the metric type?
>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>> value will not increment in such cases.
>>>>>
>>>>> I'd appreciate if you can help with these questions and we can make
>>>>> better decisions for Flink.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>>>
>>>>>> Zhu Zhu,
>>>>>>
>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Steven for the feedback!
>>>>>>> Could you share more information about the metrics you add in you
>>>>>>> customized restart strategy?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>>>
>>>>>>>> We do use config like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>
>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>
>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>>> scheduler in Flink 1.10
>>>>>>>>>
>>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>>> configuring ways described in
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>>
>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>
>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>
>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>> with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>>
>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>
>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>
>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>
>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>
>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>>> decisions.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Hi Steven,

As a conclusion, since we will have a meter metric[1] for restarts,
customized restart strategy is not needed in your case.
Is that right?

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月25日周三 上午2:30写道:

> Zhu Zhu,
>
> Sorry, I was using different terminology. yes, Flink meter is what I was
> talking about regarding "fullRestarts" for threshold based alerting.
>
> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Steven,
>>
>> In my mind, Flink counter only stores its accumulated count and reports
>> that value. Are you using an external counter directly?
>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>> calculates the rate. And it will report its "count" as well as "rate" to
>> external metric services.
>>
>> The counter "task_failures" only works if the individual failover
>> strategy is enabled. However, it is not a public interface and is not
>> suggested to use, as the fine grained recovery (region failover) now
>> supersedes it.
>> I've opened a ticket[1] to add a metric to show failovers that respects
>> fine grained recovery.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>>
>>>
>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>> after a first full restart. So alert condition will always be true after
>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>> alert can probably work. I can explore if that is an option or not.
>>>
>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>
>>>
>>>
>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Steven,
>>>>
>>>> Thanks for the information. If we can determine this a common issue, we
>>>> can solve it in Flink core.
>>>> To get to that state, I have two questions which need your help:
>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>> different due to the metric type?
>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>     "fullRestart" reveals how many times entire job graph has been
>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>> would not be restarted when task failures happen and the "fullRestart"
>>>> value will not increment in such cases.
>>>>
>>>> I'd appreciate if you can help with these questions and we can make
>>>> better decisions for Flink.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>>
>>>>> Zhu Zhu,
>>>>>
>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Steven for the feedback!
>>>>>> Could you share more information about the metrics you add in you
>>>>>> customized restart strategy?
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>>
>>>>>>> We do use config like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>> metrics than the Flink provided ones.
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks everyone for the input.
>>>>>>>>
>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>> interface as it is not explicitly documented.
>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>> scheduler in Flink 1.10
>>>>>>>>
>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>> configuring ways described in
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> .
>>>>>>>>
>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>
>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>
>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>
>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>> with the new scheduler.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>
>>>>>>>>>> Hi Zhu,
>>>>>>>>>>
>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>
>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Oytun Tez
>>>>>>>>>>
>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>
>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>
>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>> decisions.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Hi Steven,

As a conclusion, since we will have a meter metric[1] for restarts,
customized restart strategy is not needed in your case.
Is that right?

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月25日周三 上午2:30写道:

> Zhu Zhu,
>
> Sorry, I was using different terminology. yes, Flink meter is what I was
> talking about regarding "fullRestarts" for threshold based alerting.
>
> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Steven,
>>
>> In my mind, Flink counter only stores its accumulated count and reports
>> that value. Are you using an external counter directly?
>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>> calculates the rate. And it will report its "count" as well as "rate" to
>> external metric services.
>>
>> The counter "task_failures" only works if the individual failover
>> strategy is enabled. However, it is not a public interface and is not
>> suggested to use, as the fine grained recovery (region failover) now
>> supersedes it.
>> I've opened a ticket[1] to add a metric to show failovers that respects
>> fine grained recovery.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>>
>>>
>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>> after a first full restart. So alert condition will always be true after
>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>> alert can probably work. I can explore if that is an option or not.
>>>
>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>
>>>
>>>
>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Steven,
>>>>
>>>> Thanks for the information. If we can determine this a common issue, we
>>>> can solve it in Flink core.
>>>> To get to that state, I have two questions which need your help:
>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>> different due to the metric type?
>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>     "fullRestart" reveals how many times entire job graph has been
>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>> would not be restarted when task failures happen and the "fullRestart"
>>>> value will not increment in such cases.
>>>>
>>>> I'd appreciate if you can help with these questions and we can make
>>>> better decisions for Flink.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>>
>>>>> Zhu Zhu,
>>>>>
>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Steven for the feedback!
>>>>>> Could you share more information about the metrics you add in you
>>>>>> customized restart strategy?
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>>
>>>>>>> We do use config like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>> metrics than the Flink provided ones.
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks everyone for the input.
>>>>>>>>
>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>> interface as it is not explicitly documented.
>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>> scheduler in Flink 1.10
>>>>>>>>
>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>> configuring ways described in
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> .
>>>>>>>>
>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>
>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>
>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>
>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>> with the new scheduler.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>
>>>>>>>>>> Hi Zhu,
>>>>>>>>>>
>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>
>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Oytun Tez
>>>>>>>>>>
>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>
>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>
>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>> decisions.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
Zhu Zhu,

Sorry, I was using different terminology. yes, Flink meter is what I was
talking about regarding "fullRestarts" for threshold based alerting.

On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:

> Steven,
>
> In my mind, Flink counter only stores its accumulated count and reports
> that value. Are you using an external counter directly?
> Maybe Flink Meter/MeterView is what you need? It stores the count and
> calculates the rate. And it will report its "count" as well as "rate" to
> external metric services.
>
> The counter "task_failures" only works if the individual failover strategy
> is enabled. However, it is not a public interface and is not suggested to
> use, as the fine grained recovery (region failover) now supersedes it.
> I've opened a ticket[1] to add a metric to show failovers that respects
> fine grained recovery.
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>
>>
>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>> after a first full restart. So alert condition will always be true after
>> first job restart. If we can apply a derivative to the Gauge value, I guess
>> alert can probably work. I can explore if that is an option or not.
>>
>> Yeah. Understood that "fullRestart" won't increment when fine grained
>> recovery happened. I think "task_failures" counter already exists in Flink.
>>
>>
>>
>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Steven,
>>>
>>> Thanks for the information. If we can determine this a common issue, we
>>> can solve it in Flink core.
>>> To get to that state, I have two questions which need your help:
>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>> Gauge<Long> to external services in different ways? Or anything else can be
>>> different due to the metric type?
>>> 2. Is the "number of restarts" what you actually need, rather than
>>> the "fullRestart" count? If so, I believe we will have such a counter
>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>     "fullRestart" reveals how many times entire job graph has been
>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>> would not be restarted when task failures happen and the "fullRestart"
>>> value will not increment in such cases.
>>>
>>> I'd appreciate if you can help with these questions and we can make
>>> better decisions for Flink.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>
>>>> Zhu Zhu,
>>>>
>>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>>> We publish an equivalent Counter metric for alerting purpose.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Thanks Steven for the feedback!
>>>>> Could you share more information about the metrics you add in you
>>>>> customized restart strategy?
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>
>>>>>> We do use config like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>> metrics than the Flink provided ones.
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks everyone for the input.
>>>>>>>
>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>> interface as it is not explicitly documented.
>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>> scheduler in Flink 1.10
>>>>>>>
>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>> configuring ways described in
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>> .
>>>>>>>
>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>
>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>
>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>
>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>> with the new scheduler.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>
>>>>>>>>> Hi Zhu,
>>>>>>>>>
>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>
>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Oytun Tez
>>>>>>>>>
>>>>>>>>> *M O T A W O R D*
>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>
>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>
>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>> decisions.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
Zhu Zhu,

Sorry, I was using different terminology. yes, Flink meter is what I was
talking about regarding "fullRestarts" for threshold based alerting.

On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <re...@gmail.com> wrote:

> Steven,
>
> In my mind, Flink counter only stores its accumulated count and reports
> that value. Are you using an external counter directly?
> Maybe Flink Meter/MeterView is what you need? It stores the count and
> calculates the rate. And it will report its "count" as well as "rate" to
> external metric services.
>
> The counter "task_failures" only works if the individual failover strategy
> is enabled. However, it is not a public interface and is not suggested to
> use, as the fine grained recovery (region failover) now supersedes it.
> I've opened a ticket[1] to add a metric to show failovers that respects
> fine grained recovery.
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:
>
>>
>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>> after a first full restart. So alert condition will always be true after
>> first job restart. If we can apply a derivative to the Gauge value, I guess
>> alert can probably work. I can explore if that is an option or not.
>>
>> Yeah. Understood that "fullRestart" won't increment when fine grained
>> recovery happened. I think "task_failures" counter already exists in Flink.
>>
>>
>>
>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Steven,
>>>
>>> Thanks for the information. If we can determine this a common issue, we
>>> can solve it in Flink core.
>>> To get to that state, I have two questions which need your help:
>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>> Gauge<Long> to external services in different ways? Or anything else can be
>>> different due to the metric type?
>>> 2. Is the "number of restarts" what you actually need, rather than
>>> the "fullRestart" count? If so, I believe we will have such a counter
>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>     "fullRestart" reveals how many times entire job graph has been
>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>> would not be restarted when task failures happen and the "fullRestart"
>>> value will not increment in such cases.
>>>
>>> I'd appreciate if you can help with these questions and we can make
>>> better decisions for Flink.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>>
>>>> Zhu Zhu,
>>>>
>>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>>> We publish an equivalent Counter metric for alerting purpose.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Thanks Steven for the feedback!
>>>>> Could you share more information about the metrics you add in you
>>>>> customized restart strategy?
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>>
>>>>>> We do use config like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>> metrics than the Flink provided ones.
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks everyone for the input.
>>>>>>>
>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>> interface as it is not explicitly documented.
>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>> scheduler in Flink 1.10
>>>>>>>
>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>> configuring ways described in
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>> .
>>>>>>>
>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>>
>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>
>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>
>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>> with the new scheduler.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>
>>>>>>>>> Hi Zhu,
>>>>>>>>>
>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>
>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Oytun Tez
>>>>>>>>>
>>>>>>>>> *M O T A W O R D*
>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>
>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>
>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>> decisions.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Steven,

In my mind, Flink counter only stores its accumulated count and reports
that value. Are you using an external counter directly?
Maybe Flink Meter/MeterView is what you need? It stores the count and
calculates the rate. And it will report its "count" as well as "rate" to
external metric services.

The counter "task_failures" only works if the individual failover strategy
is enabled. However, it is not a public interface and is not suggested to
use, as the fine grained recovery (region failover) now supersedes it.
I've opened a ticket[1] to add a metric to show failovers that respects
fine grained recovery.

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:

>
> When we setup alert like "fullRestarts > 1" for some rolling window, we
> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
> after a first full restart. So alert condition will always be true after
> first job restart. If we can apply a derivative to the Gauge value, I guess
> alert can probably work. I can explore if that is an option or not.
>
> Yeah. Understood that "fullRestart" won't increment when fine grained
> recovery happened. I think "task_failures" counter already exists in Flink.
>
>
>
> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Steven,
>>
>> Thanks for the information. If we can determine this a common issue, we
>> can solve it in Flink core.
>> To get to that state, I have two questions which need your help:
>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>> Gauge<Long>. Does the metric reporter you use report Counter and
>> Gauge<Long> to external services in different ways? Or anything else can be
>> different due to the metric type?
>> 2. Is the "number of restarts" what you actually need, rather than
>> the "fullRestart" count? If so, I believe we will have such a counter
>> metric in 1.10, since the previous "fullRestart" metric value is not the
>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>     "fullRestart" reveals how many times entire job graph has been
>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>> would not be restarted when task failures happen and the "fullRestart"
>> value will not increment in such cases.
>>
>> I'd appreciate if you can help with these questions and we can make
>> better decisions for Flink.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>
>>> Zhu Zhu,
>>>
>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>> We publish an equivalent Counter metric for alerting purpose.
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Thanks Steven for the feedback!
>>>> Could you share more information about the metrics you add in you
>>>> customized restart strategy?
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>
>>>>> We do use config like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>> metrics than the Flink provided ones.
>>>>>
>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Thanks everyone for the input.
>>>>>>
>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>> interface as it is not explicitly documented.
>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>> scheduler in Flink 1.10
>>>>>>
>>>>>> Other usages are still supported, including all the strategies and
>>>>>> configuring ways described in
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>> .
>>>>>>
>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>
>>>>>>> Thanks Oytun for the reply!
>>>>>>>
>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>
>>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>>> the new scheduler.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>
>>>>>>>> Hi Zhu,
>>>>>>>>
>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>
>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Oytun Tez
>>>>>>>>
>>>>>>>> *M O T A W O R D*
>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>
>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>
>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>
>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>> decisions.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Steven,

In my mind, Flink counter only stores its accumulated count and reports
that value. Are you using an external counter directly?
Maybe Flink Meter/MeterView is what you need? It stores the count and
calculates the rate. And it will report its "count" as well as "rate" to
external metric services.

The counter "task_failures" only works if the individual failover strategy
is enabled. However, it is not a public interface and is not suggested to
use, as the fine grained recovery (region failover) now supersedes it.
I've opened a ticket[1] to add a metric to show failovers that respects
fine grained recovery.

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月24日周二 上午6:41写道:

>
> When we setup alert like "fullRestarts > 1" for some rolling window, we
> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
> after a first full restart. So alert condition will always be true after
> first job restart. If we can apply a derivative to the Gauge value, I guess
> alert can probably work. I can explore if that is an option or not.
>
> Yeah. Understood that "fullRestart" won't increment when fine grained
> recovery happened. I think "task_failures" counter already exists in Flink.
>
>
>
> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Steven,
>>
>> Thanks for the information. If we can determine this a common issue, we
>> can solve it in Flink core.
>> To get to that state, I have two questions which need your help:
>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>> Gauge<Long>. Does the metric reporter you use report Counter and
>> Gauge<Long> to external services in different ways? Or anything else can be
>> different due to the metric type?
>> 2. Is the "number of restarts" what you actually need, rather than
>> the "fullRestart" count? If so, I believe we will have such a counter
>> metric in 1.10, since the previous "fullRestart" metric value is not the
>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>     "fullRestart" reveals how many times entire job graph has been
>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>> would not be restarted when task failures happen and the "fullRestart"
>> value will not increment in such cases.
>>
>> I'd appreciate if you can help with these questions and we can make
>> better decisions for Flink.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>>
>>> Zhu Zhu,
>>>
>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>> We publish an equivalent Counter metric for alerting purpose.
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Thanks Steven for the feedback!
>>>> Could you share more information about the metrics you add in you
>>>> customized restart strategy?
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>>
>>>>> We do use config like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>> metrics than the Flink provided ones.
>>>>>
>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Thanks everyone for the input.
>>>>>>
>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>> interface as it is not explicitly documented.
>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>> scheduler in Flink 1.10
>>>>>>
>>>>>> Other usages are still supported, including all the strategies and
>>>>>> configuring ways described in
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>> .
>>>>>>
>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>>
>>>>>>> Thanks Oytun for the reply!
>>>>>>>
>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>
>>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>>> the new scheduler.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>>
>>>>>>>> Hi Zhu,
>>>>>>>>
>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>
>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Oytun Tez
>>>>>>>>
>>>>>>>> *M O T A W O R D*
>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>
>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>
>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>
>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>> decisions.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
When we setup alert like "fullRestarts > 1" for some rolling window, we
want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
after a first full restart. So alert condition will always be true after
first job restart. If we can apply a derivative to the Gauge value, I guess
alert can probably work. I can explore if that is an option or not.

Yeah. Understood that "fullRestart" won't increment when fine grained
recovery happened. I think "task_failures" counter already exists in Flink.



On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:

> Steven,
>
> Thanks for the information. If we can determine this a common issue, we
> can solve it in Flink core.
> To get to that state, I have two questions which need your help:
> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
> Gauge<Long>. Does the metric reporter you use report Counter and
> Gauge<Long> to external services in different ways? Or anything else can be
> different due to the metric type?
> 2. Is the "number of restarts" what you actually need, rather than
> the "fullRestart" count? If so, I believe we will have such a counter
> metric in 1.10, since the previous "fullRestart" metric value is not the
> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>     "fullRestart" reveals how many times entire job graph has been
> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
> would not be restarted when task failures happen and the "fullRestart"
> value will not increment in such cases.
>
> I'd appreciate if you can help with these questions and we can make better
> decisions for Flink.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>
>> Zhu Zhu,
>>
>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>> We publish an equivalent Counter metric for alerting purpose.
>>
>> Thanks,
>> Steven
>>
>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Thanks Steven for the feedback!
>>> Could you share more information about the metrics you add in you
>>> customized restart strategy?
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>
>>>> We do use config like "restart-strategy:
>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>> metrics than the Flink provided ones.
>>>>
>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Thanks everyone for the input.
>>>>>
>>>>> The RestartStrategy customization is not recognized as a public
>>>>> interface as it is not explicitly documented.
>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>> that we do not need to support customized RestartStrategy for the new
>>>>> scheduler in Flink 1.10
>>>>>
>>>>> Other usages are still supported, including all the strategies and
>>>>> configuring ways described in
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>> .
>>>>>
>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>
>>>>>> Thanks Oytun for the reply!
>>>>>>
>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>> RestartStrategy", we mean that users implement an
>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>>> themselves and use it by configuring like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>
>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>> the new scheduler.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>
>>>>>>> Hi Zhu,
>>>>>>>
>>>>>>> We are using custom restart strategy like this:
>>>>>>>
>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> Oytun Tez
>>>>>>>
>>>>>>> *M O T A W O R D*
>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>
>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>> work any more with the new scheduler.
>>>>>>>>
>>>>>>>> We want to know whether we should keep the way
>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>
>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>> decisions.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
When we setup alert like "fullRestarts > 1" for some rolling window, we
want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
after a first full restart. So alert condition will always be true after
first job restart. If we can apply a derivative to the Gauge value, I guess
alert can probably work. I can explore if that is an option or not.

Yeah. Understood that "fullRestart" won't increment when fine grained
recovery happened. I think "task_failures" counter already exists in Flink.



On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <re...@gmail.com> wrote:

> Steven,
>
> Thanks for the information. If we can determine this a common issue, we
> can solve it in Flink core.
> To get to that state, I have two questions which need your help:
> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
> Gauge<Long>. Does the metric reporter you use report Counter and
> Gauge<Long> to external services in different ways? Or anything else can be
> different due to the metric type?
> 2. Is the "number of restarts" what you actually need, rather than
> the "fullRestart" count? If so, I believe we will have such a counter
> metric in 1.10, since the previous "fullRestart" metric value is not the
> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>     "fullRestart" reveals how many times entire job graph has been
> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
> would not be restarted when task failures happen and the "fullRestart"
> value will not increment in such cases.
>
> I'd appreciate if you can help with these questions and we can make better
> decisions for Flink.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:
>
>> Zhu Zhu,
>>
>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>> We publish an equivalent Counter metric for alerting purpose.
>>
>> Thanks,
>> Steven
>>
>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Thanks Steven for the feedback!
>>> Could you share more information about the metrics you add in you
>>> customized restart strategy?
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>>
>>>> We do use config like "restart-strategy:
>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>> metrics than the Flink provided ones.
>>>>
>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Thanks everyone for the input.
>>>>>
>>>>> The RestartStrategy customization is not recognized as a public
>>>>> interface as it is not explicitly documented.
>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>> that we do not need to support customized RestartStrategy for the new
>>>>> scheduler in Flink 1.10
>>>>>
>>>>> Other usages are still supported, including all the strategies and
>>>>> configuring ways described in
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>> .
>>>>>
>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>>
>>>>>> Thanks Oytun for the reply!
>>>>>>
>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>> RestartStrategy", we mean that users implement an
>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>>> themselves and use it by configuring like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>
>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>> the new scheduler.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>>
>>>>>>> Hi Zhu,
>>>>>>>
>>>>>>> We are using custom restart strategy like this:
>>>>>>>
>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> Oytun Tez
>>>>>>>
>>>>>>> *M O T A W O R D*
>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>
>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>> work any more with the new scheduler.
>>>>>>>>
>>>>>>>> We want to know whether we should keep the way
>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>
>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>> decisions.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Steven,

Thanks for the information. If we can determine this a common issue, we can
solve it in Flink core.
To get to that state, I have two questions which need your help:
1. Why is gauge not good for alerting? The metric "fullRestart" is a
Gauge<Long>. Does the metric reporter you use report Counter and
Gauge<Long> to external services in different ways? Or anything else can be
different due to the metric type?
2. Is the "number of restarts" what you actually need, rather than
the "fullRestart" count? If so, I believe we will have such a counter
metric in 1.10, since the previous "fullRestart" metric value is not the
number of restarts when grained recovery (feature added 1.9.0) is enabled.
    "fullRestart" reveals how many times entire job graph has been
restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
would not be restarted when task failures happen and the "fullRestart"
value will not increment in such cases.

I'd appreciate if you can help with these questions and we can make better
decisions for Flink.

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:

> Zhu Zhu,
>
> Flink fullRestart metric is a Gauge, which is not good for alerting on. We
> publish an equivalent Counter metric for alerting purpose.
>
> Thanks,
> Steven
>
> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Thanks Steven for the feedback!
>> Could you share more information about the metrics you add in you
>> customized restart strategy?
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>
>>> We do use config like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>> metrics than the Flink provided ones.
>>>
>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Thanks everyone for the input.
>>>>
>>>> The RestartStrategy customization is not recognized as a public
>>>> interface as it is not explicitly documented.
>>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>>> we do not need to support customized RestartStrategy for the new scheduler
>>>> in Flink 1.10
>>>>
>>>> Other usages are still supported, including all the strategies and
>>>> configuring ways described in
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>> .
>>>>
>>>> Feel free to share in this thread if you has any concern for it.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>
>>>>> Thanks Oytun for the reply!
>>>>>
>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>> RestartStrategy", we mean that users implement an
>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>> themselves and use it by configuring like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>
>>>>> The usage of restart strategies you mentioned will keep working with
>>>>> the new scheduler.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>
>>>>>> Hi Zhu,
>>>>>>
>>>>>> We are using custom restart strategy like this:
>>>>>>
>>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>>>> Time.minutes(10)));
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Oytun Tez
>>>>>>
>>>>>> *M O T A W O R D*
>>>>>> The World's Fastest Human Translation Platform.
>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>
>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>> work any more with the new scheduler.
>>>>>>>
>>>>>>> We want to know whether we should keep the way
>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>> RestartStrategy can be migrated.
>>>>>>>
>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>> decisions.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Steven,

Thanks for the information. If we can determine this a common issue, we can
solve it in Flink core.
To get to that state, I have two questions which need your help:
1. Why is gauge not good for alerting? The metric "fullRestart" is a
Gauge<Long>. Does the metric reporter you use report Counter and
Gauge<Long> to external services in different ways? Or anything else can be
different due to the metric type?
2. Is the "number of restarts" what you actually need, rather than
the "fullRestart" count? If so, I believe we will have such a counter
metric in 1.10, since the previous "fullRestart" metric value is not the
number of restarts when grained recovery (feature added 1.9.0) is enabled.
    "fullRestart" reveals how many times entire job graph has been
restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
would not be restarted when task failures happen and the "fullRestart"
value will not increment in such cases.

I'd appreciate if you can help with these questions and we can make better
decisions for Flink.

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月22日周日 上午3:31写道:

> Zhu Zhu,
>
> Flink fullRestart metric is a Gauge, which is not good for alerting on. We
> publish an equivalent Counter metric for alerting purpose.
>
> Thanks,
> Steven
>
> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> Thanks Steven for the feedback!
>> Could you share more information about the metrics you add in you
>> customized restart strategy?
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>
>>> We do use config like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>> metrics than the Flink provided ones.
>>>
>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Thanks everyone for the input.
>>>>
>>>> The RestartStrategy customization is not recognized as a public
>>>> interface as it is not explicitly documented.
>>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>>> we do not need to support customized RestartStrategy for the new scheduler
>>>> in Flink 1.10
>>>>
>>>> Other usages are still supported, including all the strategies and
>>>> configuring ways described in
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>> .
>>>>
>>>> Feel free to share in this thread if you has any concern for it.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>
>>>>> Thanks Oytun for the reply!
>>>>>
>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>> RestartStrategy", we mean that users implement an
>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>> themselves and use it by configuring like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>
>>>>> The usage of restart strategies you mentioned will keep working with
>>>>> the new scheduler.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>
>>>>>> Hi Zhu,
>>>>>>
>>>>>> We are using custom restart strategy like this:
>>>>>>
>>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>>>> Time.minutes(10)));
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Oytun Tez
>>>>>>
>>>>>> *M O T A W O R D*
>>>>>> The World's Fastest Human Translation Platform.
>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>
>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>> work any more with the new scheduler.
>>>>>>>
>>>>>>> We want to know whether we should keep the way
>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>> RestartStrategy can be migrated.
>>>>>>>
>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>> decisions.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
Zhu Zhu,

Flink fullRestart metric is a Gauge, which is not good for alerting on. We
publish an equivalent Counter metric for alerting purpose.

Thanks,
Steven

On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:

> Thanks Steven for the feedback!
> Could you share more information about the metrics you add in you
> customized restart strategy?
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>
>> We do use config like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>> metrics than the Flink provided ones.
>>
>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Thanks everyone for the input.
>>>
>>> The RestartStrategy customization is not recognized as a public
>>> interface as it is not explicitly documented.
>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>> we do not need to support customized RestartStrategy for the new scheduler
>>> in Flink 1.10
>>>
>>> Other usages are still supported, including all the strategies and
>>> configuring ways described in
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>> .
>>>
>>> Feel free to share in this thread if you has any concern for it.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>
>>>> Thanks Oytun for the reply!
>>>>
>>>> Sorry for not have stated it clearly. When saying "customized
>>>> RestartStrategy", we mean that users implement an
>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>> themselves and use it by configuring like "restart-strategy:
>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>
>>>> The usage of restart strategies you mentioned will keep working with
>>>> the new scheduler.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>
>>>>> Hi Zhu,
>>>>>
>>>>> We are using custom restart strategy like this:
>>>>>
>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>>> Time.minutes(10)));
>>>>>
>>>>>
>>>>> ---
>>>>> Oytun Tez
>>>>>
>>>>> *M O T A W O R D*
>>>>> The World's Fastest Human Translation Platform.
>>>>> oytun@motaword.com — www.motaword.com
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>
>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>> the interfaces for the new restart strategies (so called
>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>> work any more with the new scheduler.
>>>>>>
>>>>>> We want to know whether we should keep the way
>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>> RestartStrategy can be migrated.
>>>>>>
>>>>>> I'd appreciate if you can share the status if you are
>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>> decisions.
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
Zhu Zhu,

Flink fullRestart metric is a Gauge, which is not good for alerting on. We
publish an equivalent Counter metric for alerting purpose.

Thanks,
Steven

On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <re...@gmail.com> wrote:

> Thanks Steven for the feedback!
> Could you share more information about the metrics you add in you
> customized restart strategy?
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>
>> We do use config like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>> metrics than the Flink provided ones.
>>
>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Thanks everyone for the input.
>>>
>>> The RestartStrategy customization is not recognized as a public
>>> interface as it is not explicitly documented.
>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>> we do not need to support customized RestartStrategy for the new scheduler
>>> in Flink 1.10
>>>
>>> Other usages are still supported, including all the strategies and
>>> configuring ways described in
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>> .
>>>
>>> Feel free to share in this thread if you has any concern for it.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>
>>>> Thanks Oytun for the reply!
>>>>
>>>> Sorry for not have stated it clearly. When saying "customized
>>>> RestartStrategy", we mean that users implement an
>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>> themselves and use it by configuring like "restart-strategy:
>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>
>>>> The usage of restart strategies you mentioned will keep working with
>>>> the new scheduler.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>
>>>>> Hi Zhu,
>>>>>
>>>>> We are using custom restart strategy like this:
>>>>>
>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>>> Time.minutes(10)));
>>>>>
>>>>>
>>>>> ---
>>>>> Oytun Tez
>>>>>
>>>>> *M O T A W O R D*
>>>>> The World's Fastest Human Translation Platform.
>>>>> oytun@motaword.com — www.motaword.com
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>
>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>> the interfaces for the new restart strategies (so called
>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>> work any more with the new scheduler.
>>>>>>
>>>>>> We want to know whether we should keep the way
>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>> RestartStrategy can be migrated.
>>>>>>
>>>>>> I'd appreciate if you can share the status if you are
>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>> decisions.
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Thanks Steven for the feedback!
Could you share more information about the metrics you add in you
customized restart strategy?

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:

> We do use config like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
> metrics than the Flink provided ones.
>
> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>
>> Thanks everyone for the input.
>>
>> The RestartStrategy customization is not recognized as a public interface
>> as it is not explicitly documented.
>> As it is not used from the feedbacks of this survey, I'll conclude that
>> we do not need to support customized RestartStrategy for the new scheduler
>> in Flink 1.10
>>
>> Other usages are still supported, including all the strategies and
>> configuring ways described in
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>> .
>>
>> Feel free to share in this thread if you has any concern for it.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>
>>> Thanks Oytun for the reply!
>>>
>>> Sorry for not have stated it clearly. When saying "customized
>>> RestartStrategy", we mean that users implement an
>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>> themselves and use it by configuring like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>
>>> The usage of restart strategies you mentioned will keep working with the
>>> new scheduler.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>
>>>> Hi Zhu,
>>>>
>>>> We are using custom restart strategy like this:
>>>>
>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>> Time.minutes(10)));
>>>>
>>>>
>>>> ---
>>>> Oytun Tez
>>>>
>>>> *M O T A W O R D*
>>>> The World's Fastest Human Translation Platform.
>>>> oytun@motaword.com — www.motaword.com
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>> customized RestartStrategy[1] in production jobs.
>>>>>
>>>>> We are currently developing the new Flink scheduler[2] which interacts
>>>>> with restart strategies in a different way. We have to re-design the
>>>>> interfaces for the new restart strategies (so called
>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>> work any more with the new scheduler.
>>>>>
>>>>> We want to know whether we should keep the way
>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>> RestartStrategy can be migrated.
>>>>>
>>>>> I'd appreciate if you can share the status if you are using customized
>>>>> RestartStrategy. That will be valuable for use to make decisions.
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Thanks Steven for the feedback!
Could you share more information about the metrics you add in you
customized restart strategy?

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月20日周五 上午7:11写道:

> We do use config like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
> metrics than the Flink provided ones.
>
> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:
>
>> Thanks everyone for the input.
>>
>> The RestartStrategy customization is not recognized as a public interface
>> as it is not explicitly documented.
>> As it is not used from the feedbacks of this survey, I'll conclude that
>> we do not need to support customized RestartStrategy for the new scheduler
>> in Flink 1.10
>>
>> Other usages are still supported, including all the strategies and
>> configuring ways described in
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>> .
>>
>> Feel free to share in this thread if you has any concern for it.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>
>>> Thanks Oytun for the reply!
>>>
>>> Sorry for not have stated it clearly. When saying "customized
>>> RestartStrategy", we mean that users implement an
>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>> themselves and use it by configuring like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>
>>> The usage of restart strategies you mentioned will keep working with the
>>> new scheduler.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>
>>>> Hi Zhu,
>>>>
>>>> We are using custom restart strategy like this:
>>>>
>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>> Time.minutes(10)));
>>>>
>>>>
>>>> ---
>>>> Oytun Tez
>>>>
>>>> *M O T A W O R D*
>>>> The World's Fastest Human Translation Platform.
>>>> oytun@motaword.com — www.motaword.com
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>> customized RestartStrategy[1] in production jobs.
>>>>>
>>>>> We are currently developing the new Flink scheduler[2] which interacts
>>>>> with restart strategies in a different way. We have to re-design the
>>>>> interfaces for the new restart strategies (so called
>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>> work any more with the new scheduler.
>>>>>
>>>>> We want to know whether we should keep the way
>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>> RestartStrategy can be migrated.
>>>>>
>>>>> I'd appreciate if you can share the status if you are using customized
>>>>> RestartStrategy. That will be valuable for use to make decisions.
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
We do use config like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
metrics than the Flink provided ones.

On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:

> Thanks everyone for the input.
>
> The RestartStrategy customization is not recognized as a public interface
> as it is not explicitly documented.
> As it is not used from the feedbacks of this survey, I'll conclude that we
> do not need to support customized RestartStrategy for the new scheduler in
> Flink 1.10
>
> Other usages are still supported, including all the strategies and
> configuring ways described in
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
> .
>
> Feel free to share in this thread if you has any concern for it.
>
> Thanks,
> Zhu Zhu
>
> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>
>> Thanks Oytun for the reply!
>>
>> Sorry for not have stated it clearly. When saying "customized
>> RestartStrategy", we mean that users implement an
>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>> themselves and use it by configuring like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory".
>>
>> The usage of restart strategies you mentioned will keep working with the
>> new scheduler.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>
>>> Hi Zhu,
>>>
>>> We are using custom restart strategy like this:
>>>
>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>> Time.minutes(10)));
>>>
>>>
>>> ---
>>> Oytun Tez
>>>
>>> *M O T A W O R D*
>>> The World's Fastest Human Translation Platform.
>>> oytun@motaword.com — www.motaword.com
>>>
>>>
>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I wanted to reach out to you and ask how many of you are using a
>>>> customized RestartStrategy[1] in production jobs.
>>>>
>>>> We are currently developing the new Flink scheduler[2] which interacts
>>>> with restart strategies in a different way. We have to re-design the
>>>> interfaces for the new restart strategies (so called
>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>> work any more with the new scheduler.
>>>>
>>>> We want to know whether we should keep the way
>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>> RestartStrategy can be migrated.
>>>>
>>>> I'd appreciate if you can share the status if you are using customized
>>>> RestartStrategy. That will be valuable for use to make decisions.
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Steven Wu <st...@gmail.com>.
We do use config like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
metrics than the Flink provided ones.

On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <re...@gmail.com> wrote:

> Thanks everyone for the input.
>
> The RestartStrategy customization is not recognized as a public interface
> as it is not explicitly documented.
> As it is not used from the feedbacks of this survey, I'll conclude that we
> do not need to support customized RestartStrategy for the new scheduler in
> Flink 1.10
>
> Other usages are still supported, including all the strategies and
> configuring ways described in
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
> .
>
> Feel free to share in this thread if you has any concern for it.
>
> Thanks,
> Zhu Zhu
>
> Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>
>> Thanks Oytun for the reply!
>>
>> Sorry for not have stated it clearly. When saying "customized
>> RestartStrategy", we mean that users implement an
>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>> themselves and use it by configuring like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory".
>>
>> The usage of restart strategies you mentioned will keep working with the
>> new scheduler.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>
>>> Hi Zhu,
>>>
>>> We are using custom restart strategy like this:
>>>
>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>> Time.minutes(10)));
>>>
>>>
>>> ---
>>> Oytun Tez
>>>
>>> *M O T A W O R D*
>>> The World's Fastest Human Translation Platform.
>>> oytun@motaword.com — www.motaword.com
>>>
>>>
>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I wanted to reach out to you and ask how many of you are using a
>>>> customized RestartStrategy[1] in production jobs.
>>>>
>>>> We are currently developing the new Flink scheduler[2] which interacts
>>>> with restart strategies in a different way. We have to re-design the
>>>> interfaces for the new restart strategies (so called
>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>> work any more with the new scheduler.
>>>>
>>>> We want to know whether we should keep the way
>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>> RestartStrategy can be migrated.
>>>>
>>>> I'd appreciate if you can share the status if you are using customized
>>>> RestartStrategy. That will be valuable for use to make decisions.
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Thanks everyone for the input.

The RestartStrategy customization is not recognized as a public interface
as it is not explicitly documented.
As it is not used from the feedbacks of this survey, I'll conclude that we
do not need to support customized RestartStrategy for the new scheduler in
Flink 1.10

Other usages are still supported, including all the strategies and
configuring ways described in
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
.

Feel free to share in this thread if you has any concern for it.

Thanks,
Zhu Zhu

Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:

> Thanks Oytun for the reply!
>
> Sorry for not have stated it clearly. When saying "customized
> RestartStrategy", we mean that users implement an
> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
> themselves and use it by configuring like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory".
>
> The usage of restart strategies you mentioned will keep working with the
> new scheduler.
>
> Thanks,
> Zhu Zhu
>
> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>
>> Hi Zhu,
>>
>> We are using custom restart strategy like this:
>>
>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>> Time.minutes(10)));
>>
>>
>> ---
>> Oytun Tez
>>
>> *M O T A W O R D*
>> The World's Fastest Human Translation Platform.
>> oytun@motaword.com — www.motaword.com
>>
>>
>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I wanted to reach out to you and ask how many of you are using a
>>> customized RestartStrategy[1] in production jobs.
>>>
>>> We are currently developing the new Flink scheduler[2] which interacts
>>> with restart strategies in a different way. We have to re-design the
>>> interfaces for the new restart strategies (so called
>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>> work any more with the new scheduler.
>>>
>>> We want to know whether we should keep the way
>>> to customized RestartBackoffTimeStrategy so that existing customized
>>> RestartStrategy can be migrated.
>>>
>>> I'd appreciate if you can share the status if you are using customized
>>> RestartStrategy. That will be valuable for use to make decisions.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Thanks everyone for the input.

The RestartStrategy customization is not recognized as a public interface
as it is not explicitly documented.
As it is not used from the feedbacks of this survey, I'll conclude that we
do not need to support customized RestartStrategy for the new scheduler in
Flink 1.10

Other usages are still supported, including all the strategies and
configuring ways described in
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
.

Feel free to share in this thread if you has any concern for it.

Thanks,
Zhu Zhu

Zhu Zhu <re...@gmail.com> 于2019年9月12日周四 下午10:33写道:

> Thanks Oytun for the reply!
>
> Sorry for not have stated it clearly. When saying "customized
> RestartStrategy", we mean that users implement an
> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
> themselves and use it by configuring like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory".
>
> The usage of restart strategies you mentioned will keep working with the
> new scheduler.
>
> Thanks,
> Zhu Zhu
>
> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>
>> Hi Zhu,
>>
>> We are using custom restart strategy like this:
>>
>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>> Time.minutes(10)));
>>
>>
>> ---
>> Oytun Tez
>>
>> *M O T A W O R D*
>> The World's Fastest Human Translation Platform.
>> oytun@motaword.com — www.motaword.com
>>
>>
>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I wanted to reach out to you and ask how many of you are using a
>>> customized RestartStrategy[1] in production jobs.
>>>
>>> We are currently developing the new Flink scheduler[2] which interacts
>>> with restart strategies in a different way. We have to re-design the
>>> interfaces for the new restart strategies (so called
>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>> work any more with the new scheduler.
>>>
>>> We want to know whether we should keep the way
>>> to customized RestartBackoffTimeStrategy so that existing customized
>>> RestartStrategy can be migrated.
>>>
>>> I'd appreciate if you can share the status if you are using customized
>>> RestartStrategy. That will be valuable for use to make decisions.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Thanks Oytun for the reply!

Sorry for not have stated it clearly. When saying "customized
RestartStrategy", we mean that users implement an
*org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
themselves and use it by configuring like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory".

The usage of restart strategies you mentioned will keep working with the
new scheduler.

Thanks,
Zhu Zhu

Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:

> Hi Zhu,
>
> We are using custom restart strategy like this:
>
> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
> Time.minutes(10)));
>
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oytun@motaword.com — www.motaword.com
>
>
> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I wanted to reach out to you and ask how many of you are using a
>> customized RestartStrategy[1] in production jobs.
>>
>> We are currently developing the new Flink scheduler[2] which interacts
>> with restart strategies in a different way. We have to re-design the
>> interfaces for the new restart strategies (so called
>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>> work any more with the new scheduler.
>>
>> We want to know whether we should keep the way
>> to customized RestartBackoffTimeStrategy so that existing customized
>> RestartStrategy can be migrated.
>>
>> I'd appreciate if you can share the status if you are using customized
>> RestartStrategy. That will be valuable for use to make decisions.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>
>> Thanks,
>> Zhu Zhu
>>
>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Zhu Zhu <re...@gmail.com>.
Thanks Oytun for the reply!

Sorry for not have stated it clearly. When saying "customized
RestartStrategy", we mean that users implement an
*org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
themselves and use it by configuring like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory".

The usage of restart strategies you mentioned will keep working with the
new scheduler.

Thanks,
Zhu Zhu

Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:

> Hi Zhu,
>
> We are using custom restart strategy like this:
>
> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
> Time.minutes(10)));
>
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oytun@motaword.com — www.motaword.com
>
>
> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I wanted to reach out to you and ask how many of you are using a
>> customized RestartStrategy[1] in production jobs.
>>
>> We are currently developing the new Flink scheduler[2] which interacts
>> with restart strategies in a different way. We have to re-design the
>> interfaces for the new restart strategies (so called
>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>> work any more with the new scheduler.
>>
>> We want to know whether we should keep the way
>> to customized RestartBackoffTimeStrategy so that existing customized
>> RestartStrategy can be migrated.
>>
>> I'd appreciate if you can share the status if you are using customized
>> RestartStrategy. That will be valuable for use to make decisions.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>
>> Thanks,
>> Zhu Zhu
>>
>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Oytun Tez <oy...@motaword.com>.
Hi Zhu,

We are using custom restart strategy like this:

environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
Time.minutes(10)));


---
Oytun Tez

*M O T A W O R D*
The World's Fastest Human Translation Platform.
oytun@motaword.com — www.motaword.com


On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:

> Hi everyone,
>
> I wanted to reach out to you and ask how many of you are using a
> customized RestartStrategy[1] in production jobs.
>
> We are currently developing the new Flink scheduler[2] which interacts
> with restart strategies in a different way. We have to re-design the
> interfaces for the new restart strategies (so called
> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
> work any more with the new scheduler.
>
> We want to know whether we should keep the way
> to customized RestartBackoffTimeStrategy so that existing customized
> RestartStrategy can be migrated.
>
> I'd appreciate if you can share the status if you are using customized
> RestartStrategy. That will be valuable for use to make decisions.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
> [2] https://issues.apache.org/jira/browse/FLINK-10429
>
> Thanks,
> Zhu Zhu
>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Posted by Oytun Tez <oy...@motaword.com>.
Hi Zhu,

We are using custom restart strategy like this:

environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
Time.minutes(10)));


---
Oytun Tez

*M O T A W O R D*
The World's Fastest Human Translation Platform.
oytun@motaword.com — www.motaword.com


On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <re...@gmail.com> wrote:

> Hi everyone,
>
> I wanted to reach out to you and ask how many of you are using a
> customized RestartStrategy[1] in production jobs.
>
> We are currently developing the new Flink scheduler[2] which interacts
> with restart strategies in a different way. We have to re-design the
> interfaces for the new restart strategies (so called
> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
> work any more with the new scheduler.
>
> We want to know whether we should keep the way
> to customized RestartBackoffTimeStrategy so that existing customized
> RestartStrategy can be migrated.
>
> I'd appreciate if you can share the status if you are using customized
> RestartStrategy. That will be valuable for use to make decisions.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
> [2] https://issues.apache.org/jira/browse/FLINK-10429
>
> Thanks,
> Zhu Zhu
>