You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Till Rohrmann <tr...@apache.org> on 2019/08/30 14:07:11 UTC

[SURVEY] Is the default restart delay of 0s causing problems?

Hi everyone,

I wanted to reach out to you and ask whether decreasing the default delay
to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
user reported that he would like to increase the default value because it
can cause restart storms in case of systematic faults [2].

The downside of increasing the default delay would be a slightly increased
restart time if this config option is not explicitly set.

[1] https://issues.apache.org/jira/browse/FLINK-9158
[2] https://issues.apache.org/jira/browse/FLINK-11218

Cheers,
Till

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Till Rohrmann <tr...@apache.org>.
The FLIP-62 discuss thread can be found here [1].

[1]
https://lists.apache.org/thread.html/9602b342602a0181fcb618581f3b12e692ed2fad98c59fd6c1caeabd@%3Cdev.flink.apache.org%3E

Cheers,
Till

On Tue, Sep 3, 2019 at 11:13 AM Till Rohrmann <tr...@apache.org> wrote:

> Thanks everyone for the input again. I'll then conclude this survey thread
> and start a discuss thread to set the default restart delay to 1s.
>
> @Arvid, I agree that a better documentation how to tune Flink with sane
> settings for certain scenarios is super helpful. However, as you've said it
> is somewhat hijacking the discussion and I would exclude it from my
> proposed changes. The best thing to do would be to start a separate
> discussion/effort for it.
>
> Concerning the restart strategy configuration options, they are currently
> only documented here [1]. I'm about to change it with this PR [2].
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html
> [2] https://github.com/apache/flink/pull/9562
>
> Cheers,
> Till
>
> On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <ar...@data-artisans.com>
> wrote:
>
>> Hi all,
>>
>> just wanted to share my experience with configurations with you. For
>> non-expert users configurations of Flink can be very daunting. The list of
>> common properties is already helping a lot [1], but it's not clear how they
>> depend on each other and settings common for specific use cases are not
>> listed.
>>
>> If we can give somewhat clear recommendations for the start for the most
>> common use cases (batch small/large cluster, streaming high throughput/low
>> latency), I think users would be able start much more quickly with a
>> somewhat well-configured system and fine-tune the settings later. For
>> example, Kafka Streams has a section on how to set the parameters for
>> maximum resilience [2].
>>
>> I'd propose to leave the current configuration page as a reference page,
>> but also have a recommended configuration settings page that's directly
>> linked in the first section, such that new users are not overwhelmed.
>>
>> Sorry if this response is hijacking the discussion.
>> Btw, is restart-strategy configuration missing in the main configuration
>> page? Is this a conscious decision?
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options
>> [2]
>> https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency
>>
>> On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> 1s looks good to me.
>>> And I think the conclusion that when a user should override the delay is
>>> worth to be documented.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:
>>>
>>>> 1s sounds a good tradeoff to me.
>>>>
>>>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks a lot for all your feedback. I see there is a slight tendency
>>>>> towards having a non zero default delay so far.
>>>>>
>>>>> However, Yu has brought up some valid points. Maybe I can shed some
>>>>> light on a).
>>>>>
>>>>> Before FLINK-9158 we set the default delay to 10s because Flink did
>>>>> not support queued scheduling which meant that if one slot was
>>>>> missing/still being occupied, then Flink would fail right away with
>>>>> a NoResourceAvailableException. In order to prevent this we added the
>>>>> delay. This also covered the case when the job was failing because of an
>>>>> overloaded external system.
>>>>>
>>>>> When we finished FLIP-6, we thought that we could improve the user
>>>>> experience by decreasing the default delay to 0s because all Flink related
>>>>> problems (slot still occupied, slot missing because of reconnecting TM)
>>>>> could be handled by the default slot request time out which allowed the
>>>>> slots to become ready after the scheduling was kicked off. However, we did
>>>>> not properly take the case of overloaded external systems into account.
>>>>>
>>>>> For b) I agree that any default value should be properly documented.
>>>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>>>>> believe that there won't be the solve it all default value. There are
>>>>> always cases where one needs to adapt it to ones needs. But this is ok. The
>>>>> goal should be to find the default value which works for most cases.
>>>>>
>>>>> So maybe the middle ground between 10s and 0s could be a solution.
>>>>> Setting the default restart delay to 1s should prevent restart storms
>>>>> caused by overloaded external systems and still be fast enough to not slow
>>>>> down recoveries noticeably in most cases. If one needs a super fast
>>>>> recovery, then one should set the delay value to 0s. If one requires a
>>>>> longer delay because of a particular infrastructure, then one needs to
>>>>> change the value too. What do you think?
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>>>>
>>>>>> -1 on increasing the default delay to none zero, with below reasons:
>>>>>>
>>>>>> a) I could see some concerns about setting the delay to zero in the
>>>>>> very original JIRA (FLINK-2993
>>>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we
>>>>>> still decided to make the change, so I'm wondering whether the decision
>>>>>> also came from any customer requirement? If so, how could we judge whether
>>>>>> one requirement override the other?
>>>>>>
>>>>>> b) There could be valid reasons for both default values depending on
>>>>>> different use cases, as well as relative work around (like based on latest
>>>>>> policy, setting the config manually to 10s could resolve the problem
>>>>>> mentioned), and from former replies to this thread we could see users have
>>>>>> already taken actions. Changing it back to non-zero again won't affect such
>>>>>> users but might cause surprises to those depending on 0 as default.
>>>>>>
>>>>>> Last but not least, no matter what decision we make this time, I'd
>>>>>> suggest to make it final and document in our release note explicitly.
>>>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>>>>> the change on default restart delay and we'd better learn from it this
>>>>>> time. Thanks.
>>>>>>
>>>>>> [1]
>>>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>>>>> [2]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>>>>
>>>>>> Best Regards,
>>>>>> Yu
>>>>>>
>>>>>>
>>>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 on what Zhu Zhu said.
>>>>>>>
>>>>>>> We also override the default to 10 s.
>>>>>>>
>>>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>>>>> We once encountered cases that external services are overwhelmed by
>>>>>>>> reconnections from frequent restarted tasks.
>>>>>>>> As a safer though not optimized option, a default delay larger than
>>>>>>>> 0 s is better in my opinion.
>>>>>>>>
>>>>>>>>
>>>>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I thinks it's better to increase the default value. +1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------ 原始邮件 ------------------
>>>>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I wanted to reach out to you and ask whether decreasing the
>>>>>>>>> default delay
>>>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing
>>>>>>>>> trouble. A
>>>>>>>>> user reported that he would like to increase the default value
>>>>>>>>> because it
>>>>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>>>>
>>>>>>>>> The downside of increasing the default delay would be a slightly
>>>>>>>>> increased
>>>>>>>>> restart time if this config option is not explicitly set.
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>
>>>>>>>>
>>
>> --
>>
>> Arvid Heise | Senior Software Engineer
>>
>> <https://www.ververica.com/>
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>> Ververica GmbH
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>>
>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Till Rohrmann <tr...@apache.org>.
The FLIP-62 discuss thread can be found here [1].

[1]
https://lists.apache.org/thread.html/9602b342602a0181fcb618581f3b12e692ed2fad98c59fd6c1caeabd@%3Cdev.flink.apache.org%3E

Cheers,
Till

On Tue, Sep 3, 2019 at 11:13 AM Till Rohrmann <tr...@apache.org> wrote:

> Thanks everyone for the input again. I'll then conclude this survey thread
> and start a discuss thread to set the default restart delay to 1s.
>
> @Arvid, I agree that a better documentation how to tune Flink with sane
> settings for certain scenarios is super helpful. However, as you've said it
> is somewhat hijacking the discussion and I would exclude it from my
> proposed changes. The best thing to do would be to start a separate
> discussion/effort for it.
>
> Concerning the restart strategy configuration options, they are currently
> only documented here [1]. I'm about to change it with this PR [2].
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html
> [2] https://github.com/apache/flink/pull/9562
>
> Cheers,
> Till
>
> On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <ar...@data-artisans.com>
> wrote:
>
>> Hi all,
>>
>> just wanted to share my experience with configurations with you. For
>> non-expert users configurations of Flink can be very daunting. The list of
>> common properties is already helping a lot [1], but it's not clear how they
>> depend on each other and settings common for specific use cases are not
>> listed.
>>
>> If we can give somewhat clear recommendations for the start for the most
>> common use cases (batch small/large cluster, streaming high throughput/low
>> latency), I think users would be able start much more quickly with a
>> somewhat well-configured system and fine-tune the settings later. For
>> example, Kafka Streams has a section on how to set the parameters for
>> maximum resilience [2].
>>
>> I'd propose to leave the current configuration page as a reference page,
>> but also have a recommended configuration settings page that's directly
>> linked in the first section, such that new users are not overwhelmed.
>>
>> Sorry if this response is hijacking the discussion.
>> Btw, is restart-strategy configuration missing in the main configuration
>> page? Is this a conscious decision?
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options
>> [2]
>> https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency
>>
>> On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> 1s looks good to me.
>>> And I think the conclusion that when a user should override the delay is
>>> worth to be documented.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:
>>>
>>>> 1s sounds a good tradeoff to me.
>>>>
>>>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks a lot for all your feedback. I see there is a slight tendency
>>>>> towards having a non zero default delay so far.
>>>>>
>>>>> However, Yu has brought up some valid points. Maybe I can shed some
>>>>> light on a).
>>>>>
>>>>> Before FLINK-9158 we set the default delay to 10s because Flink did
>>>>> not support queued scheduling which meant that if one slot was
>>>>> missing/still being occupied, then Flink would fail right away with
>>>>> a NoResourceAvailableException. In order to prevent this we added the
>>>>> delay. This also covered the case when the job was failing because of an
>>>>> overloaded external system.
>>>>>
>>>>> When we finished FLIP-6, we thought that we could improve the user
>>>>> experience by decreasing the default delay to 0s because all Flink related
>>>>> problems (slot still occupied, slot missing because of reconnecting TM)
>>>>> could be handled by the default slot request time out which allowed the
>>>>> slots to become ready after the scheduling was kicked off. However, we did
>>>>> not properly take the case of overloaded external systems into account.
>>>>>
>>>>> For b) I agree that any default value should be properly documented.
>>>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>>>>> believe that there won't be the solve it all default value. There are
>>>>> always cases where one needs to adapt it to ones needs. But this is ok. The
>>>>> goal should be to find the default value which works for most cases.
>>>>>
>>>>> So maybe the middle ground between 10s and 0s could be a solution.
>>>>> Setting the default restart delay to 1s should prevent restart storms
>>>>> caused by overloaded external systems and still be fast enough to not slow
>>>>> down recoveries noticeably in most cases. If one needs a super fast
>>>>> recovery, then one should set the delay value to 0s. If one requires a
>>>>> longer delay because of a particular infrastructure, then one needs to
>>>>> change the value too. What do you think?
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>>>>
>>>>>> -1 on increasing the default delay to none zero, with below reasons:
>>>>>>
>>>>>> a) I could see some concerns about setting the delay to zero in the
>>>>>> very original JIRA (FLINK-2993
>>>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we
>>>>>> still decided to make the change, so I'm wondering whether the decision
>>>>>> also came from any customer requirement? If so, how could we judge whether
>>>>>> one requirement override the other?
>>>>>>
>>>>>> b) There could be valid reasons for both default values depending on
>>>>>> different use cases, as well as relative work around (like based on latest
>>>>>> policy, setting the config manually to 10s could resolve the problem
>>>>>> mentioned), and from former replies to this thread we could see users have
>>>>>> already taken actions. Changing it back to non-zero again won't affect such
>>>>>> users but might cause surprises to those depending on 0 as default.
>>>>>>
>>>>>> Last but not least, no matter what decision we make this time, I'd
>>>>>> suggest to make it final and document in our release note explicitly.
>>>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>>>>> the change on default restart delay and we'd better learn from it this
>>>>>> time. Thanks.
>>>>>>
>>>>>> [1]
>>>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>>>>> [2]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>>>>
>>>>>> Best Regards,
>>>>>> Yu
>>>>>>
>>>>>>
>>>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 on what Zhu Zhu said.
>>>>>>>
>>>>>>> We also override the default to 10 s.
>>>>>>>
>>>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>>
>>>>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>>>>> We once encountered cases that external services are overwhelmed by
>>>>>>>> reconnections from frequent restarted tasks.
>>>>>>>> As a safer though not optimized option, a default delay larger than
>>>>>>>> 0 s is better in my opinion.
>>>>>>>>
>>>>>>>>
>>>>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I thinks it's better to increase the default value. +1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------ 原始邮件 ------------------
>>>>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I wanted to reach out to you and ask whether decreasing the
>>>>>>>>> default delay
>>>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing
>>>>>>>>> trouble. A
>>>>>>>>> user reported that he would like to increase the default value
>>>>>>>>> because it
>>>>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>>>>
>>>>>>>>> The downside of increasing the default delay would be a slightly
>>>>>>>>> increased
>>>>>>>>> restart time if this config option is not explicitly set.
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>
>>>>>>>>
>>
>> --
>>
>> Arvid Heise | Senior Software Engineer
>>
>> <https://www.ververica.com/>
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>> Ververica GmbH
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>>
>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Till Rohrmann <tr...@apache.org>.
Thanks everyone for the input again. I'll then conclude this survey thread
and start a discuss thread to set the default restart delay to 1s.

@Arvid, I agree that a better documentation how to tune Flink with sane
settings for certain scenarios is super helpful. However, as you've said it
is somewhat hijacking the discussion and I would exclude it from my
proposed changes. The best thing to do would be to start a separate
discussion/effort for it.

Concerning the restart strategy configuration options, they are currently
only documented here [1]. I'm about to change it with this PR [2].

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html
[2] https://github.com/apache/flink/pull/9562

Cheers,
Till

On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <ar...@data-artisans.com> wrote:

> Hi all,
>
> just wanted to share my experience with configurations with you. For
> non-expert users configurations of Flink can be very daunting. The list of
> common properties is already helping a lot [1], but it's not clear how they
> depend on each other and settings common for specific use cases are not
> listed.
>
> If we can give somewhat clear recommendations for the start for the most
> common use cases (batch small/large cluster, streaming high throughput/low
> latency), I think users would be able start much more quickly with a
> somewhat well-configured system and fine-tune the settings later. For
> example, Kafka Streams has a section on how to set the parameters for
> maximum resilience [2].
>
> I'd propose to leave the current configuration page as a reference page,
> but also have a recommended configuration settings page that's directly
> linked in the first section, such that new users are not overwhelmed.
>
> Sorry if this response is hijacking the discussion.
> Btw, is restart-strategy configuration missing in the main configuration
> page? Is this a conscious decision?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options
> [2]
> https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency
>
> On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <re...@gmail.com> wrote:
>
>> 1s looks good to me.
>> And I think the conclusion that when a user should override the delay is
>> worth to be documented.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:
>>
>>> 1s sounds a good tradeoff to me.
>>>
>>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Thanks a lot for all your feedback. I see there is a slight tendency
>>>> towards having a non zero default delay so far.
>>>>
>>>> However, Yu has brought up some valid points. Maybe I can shed some
>>>> light on a).
>>>>
>>>> Before FLINK-9158 we set the default delay to 10s because Flink did not
>>>> support queued scheduling which meant that if one slot was missing/still
>>>> being occupied, then Flink would fail right away with
>>>> a NoResourceAvailableException. In order to prevent this we added the
>>>> delay. This also covered the case when the job was failing because of an
>>>> overloaded external system.
>>>>
>>>> When we finished FLIP-6, we thought that we could improve the user
>>>> experience by decreasing the default delay to 0s because all Flink related
>>>> problems (slot still occupied, slot missing because of reconnecting TM)
>>>> could be handled by the default slot request time out which allowed the
>>>> slots to become ready after the scheduling was kicked off. However, we did
>>>> not properly take the case of overloaded external systems into account.
>>>>
>>>> For b) I agree that any default value should be properly documented.
>>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>>>> believe that there won't be the solve it all default value. There are
>>>> always cases where one needs to adapt it to ones needs. But this is ok. The
>>>> goal should be to find the default value which works for most cases.
>>>>
>>>> So maybe the middle ground between 10s and 0s could be a solution.
>>>> Setting the default restart delay to 1s should prevent restart storms
>>>> caused by overloaded external systems and still be fast enough to not slow
>>>> down recoveries noticeably in most cases. If one needs a super fast
>>>> recovery, then one should set the delay value to 0s. If one requires a
>>>> longer delay because of a particular infrastructure, then one needs to
>>>> change the value too. What do you think?
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>>>
>>>>> -1 on increasing the default delay to none zero, with below reasons:
>>>>>
>>>>> a) I could see some concerns about setting the delay to zero in the
>>>>> very original JIRA (FLINK-2993
>>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we
>>>>> still decided to make the change, so I'm wondering whether the decision
>>>>> also came from any customer requirement? If so, how could we judge whether
>>>>> one requirement override the other?
>>>>>
>>>>> b) There could be valid reasons for both default values depending on
>>>>> different use cases, as well as relative work around (like based on latest
>>>>> policy, setting the config manually to 10s could resolve the problem
>>>>> mentioned), and from former replies to this thread we could see users have
>>>>> already taken actions. Changing it back to non-zero again won't affect such
>>>>> users but might cause surprises to those depending on 0 as default.
>>>>>
>>>>> Last but not least, no matter what decision we make this time, I'd
>>>>> suggest to make it final and document in our release note explicitly.
>>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>>>> the change on default restart delay and we'd better learn from it this
>>>>> time. Thanks.
>>>>>
>>>>> [1]
>>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>>>> [2]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>>>
>>>>> Best Regards,
>>>>> Yu
>>>>>
>>>>>
>>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>>>
>>>>>> +1 on what Zhu Zhu said.
>>>>>>
>>>>>> We also override the default to 10 s.
>>>>>>
>>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>>>> We once encountered cases that external services are overwhelmed by
>>>>>>> reconnections from frequent restarted tasks.
>>>>>>> As a safer though not optimized option, a default delay larger than
>>>>>>> 0 s is better in my opinion.
>>>>>>>
>>>>>>>
>>>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I thinks it's better to increase the default value. +1
>>>>>>>>
>>>>>>>>
>>>>>>>> Best.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------ 原始邮件 ------------------
>>>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>>>>> delay
>>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing
>>>>>>>> trouble. A
>>>>>>>> user reported that he would like to increase the default value
>>>>>>>> because it
>>>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>>>
>>>>>>>> The downside of increasing the default delay would be a slightly
>>>>>>>> increased
>>>>>>>> restart time if this config option is not explicitly set.
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>
>>>>>>>
>
> --
>
> Arvid Heise | Senior Software Engineer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Till Rohrmann <tr...@apache.org>.
Thanks everyone for the input again. I'll then conclude this survey thread
and start a discuss thread to set the default restart delay to 1s.

@Arvid, I agree that a better documentation how to tune Flink with sane
settings for certain scenarios is super helpful. However, as you've said it
is somewhat hijacking the discussion and I would exclude it from my
proposed changes. The best thing to do would be to start a separate
discussion/effort for it.

Concerning the restart strategy configuration options, they are currently
only documented here [1]. I'm about to change it with this PR [2].

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html
[2] https://github.com/apache/flink/pull/9562

Cheers,
Till

On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <ar...@data-artisans.com> wrote:

> Hi all,
>
> just wanted to share my experience with configurations with you. For
> non-expert users configurations of Flink can be very daunting. The list of
> common properties is already helping a lot [1], but it's not clear how they
> depend on each other and settings common for specific use cases are not
> listed.
>
> If we can give somewhat clear recommendations for the start for the most
> common use cases (batch small/large cluster, streaming high throughput/low
> latency), I think users would be able start much more quickly with a
> somewhat well-configured system and fine-tune the settings later. For
> example, Kafka Streams has a section on how to set the parameters for
> maximum resilience [2].
>
> I'd propose to leave the current configuration page as a reference page,
> but also have a recommended configuration settings page that's directly
> linked in the first section, such that new users are not overwhelmed.
>
> Sorry if this response is hijacking the discussion.
> Btw, is restart-strategy configuration missing in the main configuration
> page? Is this a conscious decision?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options
> [2]
> https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency
>
> On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <re...@gmail.com> wrote:
>
>> 1s looks good to me.
>> And I think the conclusion that when a user should override the delay is
>> worth to be documented.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:
>>
>>> 1s sounds a good tradeoff to me.
>>>
>>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Thanks a lot for all your feedback. I see there is a slight tendency
>>>> towards having a non zero default delay so far.
>>>>
>>>> However, Yu has brought up some valid points. Maybe I can shed some
>>>> light on a).
>>>>
>>>> Before FLINK-9158 we set the default delay to 10s because Flink did not
>>>> support queued scheduling which meant that if one slot was missing/still
>>>> being occupied, then Flink would fail right away with
>>>> a NoResourceAvailableException. In order to prevent this we added the
>>>> delay. This also covered the case when the job was failing because of an
>>>> overloaded external system.
>>>>
>>>> When we finished FLIP-6, we thought that we could improve the user
>>>> experience by decreasing the default delay to 0s because all Flink related
>>>> problems (slot still occupied, slot missing because of reconnecting TM)
>>>> could be handled by the default slot request time out which allowed the
>>>> slots to become ready after the scheduling was kicked off. However, we did
>>>> not properly take the case of overloaded external systems into account.
>>>>
>>>> For b) I agree that any default value should be properly documented.
>>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>>>> believe that there won't be the solve it all default value. There are
>>>> always cases where one needs to adapt it to ones needs. But this is ok. The
>>>> goal should be to find the default value which works for most cases.
>>>>
>>>> So maybe the middle ground between 10s and 0s could be a solution.
>>>> Setting the default restart delay to 1s should prevent restart storms
>>>> caused by overloaded external systems and still be fast enough to not slow
>>>> down recoveries noticeably in most cases. If one needs a super fast
>>>> recovery, then one should set the delay value to 0s. If one requires a
>>>> longer delay because of a particular infrastructure, then one needs to
>>>> change the value too. What do you think?
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>>>
>>>>> -1 on increasing the default delay to none zero, with below reasons:
>>>>>
>>>>> a) I could see some concerns about setting the delay to zero in the
>>>>> very original JIRA (FLINK-2993
>>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we
>>>>> still decided to make the change, so I'm wondering whether the decision
>>>>> also came from any customer requirement? If so, how could we judge whether
>>>>> one requirement override the other?
>>>>>
>>>>> b) There could be valid reasons for both default values depending on
>>>>> different use cases, as well as relative work around (like based on latest
>>>>> policy, setting the config manually to 10s could resolve the problem
>>>>> mentioned), and from former replies to this thread we could see users have
>>>>> already taken actions. Changing it back to non-zero again won't affect such
>>>>> users but might cause surprises to those depending on 0 as default.
>>>>>
>>>>> Last but not least, no matter what decision we make this time, I'd
>>>>> suggest to make it final and document in our release note explicitly.
>>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>>>> the change on default restart delay and we'd better learn from it this
>>>>> time. Thanks.
>>>>>
>>>>> [1]
>>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>>>> [2]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>>>
>>>>> Best Regards,
>>>>> Yu
>>>>>
>>>>>
>>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>>>
>>>>>> +1 on what Zhu Zhu said.
>>>>>>
>>>>>> We also override the default to 10 s.
>>>>>>
>>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>>
>>>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>>>> We once encountered cases that external services are overwhelmed by
>>>>>>> reconnections from frequent restarted tasks.
>>>>>>> As a safer though not optimized option, a default delay larger than
>>>>>>> 0 s is better in my opinion.
>>>>>>>
>>>>>>>
>>>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I thinks it's better to increase the default value. +1
>>>>>>>>
>>>>>>>>
>>>>>>>> Best.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------ 原始邮件 ------------------
>>>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>>>>> delay
>>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing
>>>>>>>> trouble. A
>>>>>>>> user reported that he would like to increase the default value
>>>>>>>> because it
>>>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>>>
>>>>>>>> The downside of increasing the default delay would be a slightly
>>>>>>>> increased
>>>>>>>> restart time if this config option is not explicitly set.
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>
>>>>>>>
>
> --
>
> Arvid Heise | Senior Software Engineer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Arvid Heise <ar...@data-artisans.com>.
Hi all,

just wanted to share my experience with configurations with you. For
non-expert users configurations of Flink can be very daunting. The list of
common properties is already helping a lot [1], but it's not clear how they
depend on each other and settings common for specific use cases are not
listed.

If we can give somewhat clear recommendations for the start for the most
common use cases (batch small/large cluster, streaming high throughput/low
latency), I think users would be able start much more quickly with a
somewhat well-configured system and fine-tune the settings later. For
example, Kafka Streams has a section on how to set the parameters for
maximum resilience [2].

I'd propose to leave the current configuration page as a reference page,
but also have a recommended configuration settings page that's directly
linked in the first section, such that new users are not overwhelmed.

Sorry if this response is hijacking the discussion.
Btw, is restart-strategy configuration missing in the main configuration
page? Is this a conscious decision?

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options
[2]
https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency

On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <re...@gmail.com> wrote:

> 1s looks good to me.
> And I think the conclusion that when a user should override the delay is
> worth to be documented.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:
>
>> 1s sounds a good tradeoff to me.
>>
>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Thanks a lot for all your feedback. I see there is a slight tendency
>>> towards having a non zero default delay so far.
>>>
>>> However, Yu has brought up some valid points. Maybe I can shed some
>>> light on a).
>>>
>>> Before FLINK-9158 we set the default delay to 10s because Flink did not
>>> support queued scheduling which meant that if one slot was missing/still
>>> being occupied, then Flink would fail right away with
>>> a NoResourceAvailableException. In order to prevent this we added the
>>> delay. This also covered the case when the job was failing because of an
>>> overloaded external system.
>>>
>>> When we finished FLIP-6, we thought that we could improve the user
>>> experience by decreasing the default delay to 0s because all Flink related
>>> problems (slot still occupied, slot missing because of reconnecting TM)
>>> could be handled by the default slot request time out which allowed the
>>> slots to become ready after the scheduling was kicked off. However, we did
>>> not properly take the case of overloaded external systems into account.
>>>
>>> For b) I agree that any default value should be properly documented.
>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>>> believe that there won't be the solve it all default value. There are
>>> always cases where one needs to adapt it to ones needs. But this is ok. The
>>> goal should be to find the default value which works for most cases.
>>>
>>> So maybe the middle ground between 10s and 0s could be a solution.
>>> Setting the default restart delay to 1s should prevent restart storms
>>> caused by overloaded external systems and still be fast enough to not slow
>>> down recoveries noticeably in most cases. If one needs a super fast
>>> recovery, then one should set the delay value to 0s. If one requires a
>>> longer delay because of a particular infrastructure, then one needs to
>>> change the value too. What do you think?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>>
>>>> -1 on increasing the default delay to none zero, with below reasons:
>>>>
>>>> a) I could see some concerns about setting the delay to zero in the
>>>> very original JIRA (FLINK-2993
>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
>>>> decided to make the change, so I'm wondering whether the decision also came
>>>> from any customer requirement? If so, how could we judge whether one
>>>> requirement override the other?
>>>>
>>>> b) There could be valid reasons for both default values depending on
>>>> different use cases, as well as relative work around (like based on latest
>>>> policy, setting the config manually to 10s could resolve the problem
>>>> mentioned), and from former replies to this thread we could see users have
>>>> already taken actions. Changing it back to non-zero again won't affect such
>>>> users but might cause surprises to those depending on 0 as default.
>>>>
>>>> Last but not least, no matter what decision we make this time, I'd
>>>> suggest to make it final and document in our release note explicitly.
>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>>> the change on default restart delay and we'd better learn from it this
>>>> time. Thanks.
>>>>
>>>> [1]
>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>>> [2]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>>
>>>> Best Regards,
>>>> Yu
>>>>
>>>>
>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>>
>>>>> +1 on what Zhu Zhu said.
>>>>>
>>>>> We also override the default to 10 s.
>>>>>
>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>>> We once encountered cases that external services are overwhelmed by
>>>>>> reconnections from frequent restarted tasks.
>>>>>> As a safer though not optimized option, a default delay larger than 0
>>>>>> s is better in my opinion.
>>>>>>
>>>>>>
>>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I thinks it's better to increase the default value. +1
>>>>>>>
>>>>>>>
>>>>>>> Best.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------ 原始邮件 ------------------
>>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>>>> delay
>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing
>>>>>>> trouble. A
>>>>>>> user reported that he would like to increase the default value
>>>>>>> because it
>>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>>
>>>>>>> The downside of increasing the default delay would be a slightly
>>>>>>> increased
>>>>>>> restart time if this config option is not explicitly set.
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>
>>>>>>

-- 

Arvid Heise | Senior Software Engineer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Arvid Heise <ar...@data-artisans.com>.
Hi all,

just wanted to share my experience with configurations with you. For
non-expert users configurations of Flink can be very daunting. The list of
common properties is already helping a lot [1], but it's not clear how they
depend on each other and settings common for specific use cases are not
listed.

If we can give somewhat clear recommendations for the start for the most
common use cases (batch small/large cluster, streaming high throughput/low
latency), I think users would be able start much more quickly with a
somewhat well-configured system and fine-tune the settings later. For
example, Kafka Streams has a section on how to set the parameters for
maximum resilience [2].

I'd propose to leave the current configuration page as a reference page,
but also have a recommended configuration settings page that's directly
linked in the first section, such that new users are not overwhelmed.

Sorry if this response is hijacking the discussion.
Btw, is restart-strategy configuration missing in the main configuration
page? Is this a conscious decision?

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options
[2]
https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency

On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <re...@gmail.com> wrote:

> 1s looks good to me.
> And I think the conclusion that when a user should override the delay is
> worth to be documented.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:
>
>> 1s sounds a good tradeoff to me.
>>
>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Thanks a lot for all your feedback. I see there is a slight tendency
>>> towards having a non zero default delay so far.
>>>
>>> However, Yu has brought up some valid points. Maybe I can shed some
>>> light on a).
>>>
>>> Before FLINK-9158 we set the default delay to 10s because Flink did not
>>> support queued scheduling which meant that if one slot was missing/still
>>> being occupied, then Flink would fail right away with
>>> a NoResourceAvailableException. In order to prevent this we added the
>>> delay. This also covered the case when the job was failing because of an
>>> overloaded external system.
>>>
>>> When we finished FLIP-6, we thought that we could improve the user
>>> experience by decreasing the default delay to 0s because all Flink related
>>> problems (slot still occupied, slot missing because of reconnecting TM)
>>> could be handled by the default slot request time out which allowed the
>>> slots to become ready after the scheduling was kicked off. However, we did
>>> not properly take the case of overloaded external systems into account.
>>>
>>> For b) I agree that any default value should be properly documented.
>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>>> believe that there won't be the solve it all default value. There are
>>> always cases where one needs to adapt it to ones needs. But this is ok. The
>>> goal should be to find the default value which works for most cases.
>>>
>>> So maybe the middle ground between 10s and 0s could be a solution.
>>> Setting the default restart delay to 1s should prevent restart storms
>>> caused by overloaded external systems and still be fast enough to not slow
>>> down recoveries noticeably in most cases. If one needs a super fast
>>> recovery, then one should set the delay value to 0s. If one requires a
>>> longer delay because of a particular infrastructure, then one needs to
>>> change the value too. What do you think?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>>
>>>> -1 on increasing the default delay to none zero, with below reasons:
>>>>
>>>> a) I could see some concerns about setting the delay to zero in the
>>>> very original JIRA (FLINK-2993
>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
>>>> decided to make the change, so I'm wondering whether the decision also came
>>>> from any customer requirement? If so, how could we judge whether one
>>>> requirement override the other?
>>>>
>>>> b) There could be valid reasons for both default values depending on
>>>> different use cases, as well as relative work around (like based on latest
>>>> policy, setting the config manually to 10s could resolve the problem
>>>> mentioned), and from former replies to this thread we could see users have
>>>> already taken actions. Changing it back to non-zero again won't affect such
>>>> users but might cause surprises to those depending on 0 as default.
>>>>
>>>> Last but not least, no matter what decision we make this time, I'd
>>>> suggest to make it final and document in our release note explicitly.
>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>>> the change on default restart delay and we'd better learn from it this
>>>> time. Thanks.
>>>>
>>>> [1]
>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>>> [2]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>>
>>>> Best Regards,
>>>> Yu
>>>>
>>>>
>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>>
>>>>> +1 on what Zhu Zhu said.
>>>>>
>>>>> We also override the default to 10 s.
>>>>>
>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>>
>>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>>> We once encountered cases that external services are overwhelmed by
>>>>>> reconnections from frequent restarted tasks.
>>>>>> As a safer though not optimized option, a default delay larger than 0
>>>>>> s is better in my opinion.
>>>>>>
>>>>>>
>>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I thinks it's better to increase the default value. +1
>>>>>>>
>>>>>>>
>>>>>>> Best.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------ 原始邮件 ------------------
>>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>>>> delay
>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing
>>>>>>> trouble. A
>>>>>>> user reported that he would like to increase the default value
>>>>>>> because it
>>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>>
>>>>>>> The downside of increasing the default delay would be a slightly
>>>>>>> increased
>>>>>>> restart time if this config option is not explicitly set.
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>
>>>>>>

-- 

Arvid Heise | Senior Software Engineer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Zhu Zhu <re...@gmail.com>.
1s looks good to me.
And I think the conclusion that when a user should override the delay is
worth to be documented.

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:

> 1s sounds a good tradeoff to me.
>
> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org> wrote:
>
>> Thanks a lot for all your feedback. I see there is a slight tendency
>> towards having a non zero default delay so far.
>>
>> However, Yu has brought up some valid points. Maybe I can shed some light
>> on a).
>>
>> Before FLINK-9158 we set the default delay to 10s because Flink did not
>> support queued scheduling which meant that if one slot was missing/still
>> being occupied, then Flink would fail right away with
>> a NoResourceAvailableException. In order to prevent this we added the
>> delay. This also covered the case when the job was failing because of an
>> overloaded external system.
>>
>> When we finished FLIP-6, we thought that we could improve the user
>> experience by decreasing the default delay to 0s because all Flink related
>> problems (slot still occupied, slot missing because of reconnecting TM)
>> could be handled by the default slot request time out which allowed the
>> slots to become ready after the scheduling was kicked off. However, we did
>> not properly take the case of overloaded external systems into account.
>>
>> For b) I agree that any default value should be properly documented. This
>> was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>> believe that there won't be the solve it all default value. There are
>> always cases where one needs to adapt it to ones needs. But this is ok. The
>> goal should be to find the default value which works for most cases.
>>
>> So maybe the middle ground between 10s and 0s could be a solution.
>> Setting the default restart delay to 1s should prevent restart storms
>> caused by overloaded external systems and still be fast enough to not slow
>> down recoveries noticeably in most cases. If one needs a super fast
>> recovery, then one should set the delay value to 0s. If one requires a
>> longer delay because of a particular infrastructure, then one needs to
>> change the value too. What do you think?
>>
>> Cheers,
>> Till
>>
>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>
>>> -1 on increasing the default delay to none zero, with below reasons:
>>>
>>> a) I could see some concerns about setting the delay to zero in the very
>>> original JIRA (FLINK-2993
>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
>>> decided to make the change, so I'm wondering whether the decision also came
>>> from any customer requirement? If so, how could we judge whether one
>>> requirement override the other?
>>>
>>> b) There could be valid reasons for both default values depending on
>>> different use cases, as well as relative work around (like based on latest
>>> policy, setting the config manually to 10s could resolve the problem
>>> mentioned), and from former replies to this thread we could see users have
>>> already taken actions. Changing it back to non-zero again won't affect such
>>> users but might cause surprises to those depending on 0 as default.
>>>
>>> Last but not least, no matter what decision we make this time, I'd
>>> suggest to make it final and document in our release note explicitly.
>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>> the change on default restart delay and we'd better learn from it this
>>> time. Thanks.
>>>
>>> [1]
>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>
>>> Best Regards,
>>> Yu
>>>
>>>
>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>
>>>> +1 on what Zhu Zhu said.
>>>>
>>>> We also override the default to 10 s.
>>>>
>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>> We once encountered cases that external services are overwhelmed by
>>>>> reconnections from frequent restarted tasks.
>>>>> As a safer though not optimized option, a default delay larger than 0
>>>>> s is better in my opinion.
>>>>>
>>>>>
>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I thinks it's better to increase the default value. +1
>>>>>>
>>>>>>
>>>>>> Best.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------ 原始邮件 ------------------
>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>>> delay
>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble.
>>>>>> A
>>>>>> user reported that he would like to increase the default value
>>>>>> because it
>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>
>>>>>> The downside of increasing the default delay would be a slightly
>>>>>> increased
>>>>>> restart time if this config option is not explicitly set.
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>
>>>>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Zhu Zhu <re...@gmail.com>.
1s looks good to me.
And I think the conclusion that when a user should override the delay is
worth to be documented.

Thanks,
Zhu Zhu

Steven Wu <st...@gmail.com> 于2019年9月3日周二 上午4:42写道:

> 1s sounds a good tradeoff to me.
>
> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org> wrote:
>
>> Thanks a lot for all your feedback. I see there is a slight tendency
>> towards having a non zero default delay so far.
>>
>> However, Yu has brought up some valid points. Maybe I can shed some light
>> on a).
>>
>> Before FLINK-9158 we set the default delay to 10s because Flink did not
>> support queued scheduling which meant that if one slot was missing/still
>> being occupied, then Flink would fail right away with
>> a NoResourceAvailableException. In order to prevent this we added the
>> delay. This also covered the case when the job was failing because of an
>> overloaded external system.
>>
>> When we finished FLIP-6, we thought that we could improve the user
>> experience by decreasing the default delay to 0s because all Flink related
>> problems (slot still occupied, slot missing because of reconnecting TM)
>> could be handled by the default slot request time out which allowed the
>> slots to become ready after the scheduling was kicked off. However, we did
>> not properly take the case of overloaded external systems into account.
>>
>> For b) I agree that any default value should be properly documented. This
>> was clearly an oversight when FLINK-9158 has been merged. Moreover, I
>> believe that there won't be the solve it all default value. There are
>> always cases where one needs to adapt it to ones needs. But this is ok. The
>> goal should be to find the default value which works for most cases.
>>
>> So maybe the middle ground between 10s and 0s could be a solution.
>> Setting the default restart delay to 1s should prevent restart storms
>> caused by overloaded external systems and still be fast enough to not slow
>> down recoveries noticeably in most cases. If one needs a super fast
>> recovery, then one should set the delay value to 0s. If one requires a
>> longer delay because of a particular infrastructure, then one needs to
>> change the value too. What do you think?
>>
>> Cheers,
>> Till
>>
>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>>
>>> -1 on increasing the default delay to none zero, with below reasons:
>>>
>>> a) I could see some concerns about setting the delay to zero in the very
>>> original JIRA (FLINK-2993
>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
>>> decided to make the change, so I'm wondering whether the decision also came
>>> from any customer requirement? If so, how could we judge whether one
>>> requirement override the other?
>>>
>>> b) There could be valid reasons for both default values depending on
>>> different use cases, as well as relative work around (like based on latest
>>> policy, setting the config manually to 10s could resolve the problem
>>> mentioned), and from former replies to this thread we could see users have
>>> already taken actions. Changing it back to non-zero again won't affect such
>>> users but might cause surprises to those depending on 0 as default.
>>>
>>> Last but not least, no matter what decision we make this time, I'd
>>> suggest to make it final and document in our release note explicitly.
>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>>> the change on default restart delay and we'd better learn from it this
>>> time. Thanks.
>>>
>>> [1]
>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>>
>>> Best Regards,
>>> Yu
>>>
>>>
>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>>
>>>> +1 on what Zhu Zhu said.
>>>>
>>>> We also override the default to 10 s.
>>>>
>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>
>>>>> In our production, we usually override the restart delay to be 10 s.
>>>>> We once encountered cases that external services are overwhelmed by
>>>>> reconnections from frequent restarted tasks.
>>>>> As a safer though not optimized option, a default delay larger than 0
>>>>> s is better in my opinion.
>>>>>
>>>>>
>>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I thinks it's better to increase the default value. +1
>>>>>>
>>>>>>
>>>>>> Best.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------ 原始邮件 ------------------
>>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>>> delay
>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble.
>>>>>> A
>>>>>> user reported that he would like to increase the default value
>>>>>> because it
>>>>>> can cause restart storms in case of systematic faults [2].
>>>>>>
>>>>>> The downside of increasing the default delay would be a slightly
>>>>>> increased
>>>>>> restart time if this config option is not explicitly set.
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>
>>>>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Steven Wu <st...@gmail.com>.
1s sounds a good tradeoff to me.

On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org> wrote:

> Thanks a lot for all your feedback. I see there is a slight tendency
> towards having a non zero default delay so far.
>
> However, Yu has brought up some valid points. Maybe I can shed some light
> on a).
>
> Before FLINK-9158 we set the default delay to 10s because Flink did not
> support queued scheduling which meant that if one slot was missing/still
> being occupied, then Flink would fail right away with
> a NoResourceAvailableException. In order to prevent this we added the
> delay. This also covered the case when the job was failing because of an
> overloaded external system.
>
> When we finished FLIP-6, we thought that we could improve the user
> experience by decreasing the default delay to 0s because all Flink related
> problems (slot still occupied, slot missing because of reconnecting TM)
> could be handled by the default slot request time out which allowed the
> slots to become ready after the scheduling was kicked off. However, we did
> not properly take the case of overloaded external systems into account.
>
> For b) I agree that any default value should be properly documented. This
> was clearly an oversight when FLINK-9158 has been merged. Moreover, I
> believe that there won't be the solve it all default value. There are
> always cases where one needs to adapt it to ones needs. But this is ok. The
> goal should be to find the default value which works for most cases.
>
> So maybe the middle ground between 10s and 0s could be a solution. Setting
> the default restart delay to 1s should prevent restart storms caused by
> overloaded external systems and still be fast enough to not slow down
> recoveries noticeably in most cases. If one needs a super fast recovery,
> then one should set the delay value to 0s. If one requires a longer delay
> because of a particular infrastructure, then one needs to change the value
> too. What do you think?
>
> Cheers,
> Till
>
> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>
>> -1 on increasing the default delay to none zero, with below reasons:
>>
>> a) I could see some concerns about setting the delay to zero in the very
>> original JIRA (FLINK-2993
>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
>> decided to make the change, so I'm wondering whether the decision also came
>> from any customer requirement? If so, how could we judge whether one
>> requirement override the other?
>>
>> b) There could be valid reasons for both default values depending on
>> different use cases, as well as relative work around (like based on latest
>> policy, setting the config manually to 10s could resolve the problem
>> mentioned), and from former replies to this thread we could see users have
>> already taken actions. Changing it back to non-zero again won't affect such
>> users but might cause surprises to those depending on 0 as default.
>>
>> Last but not least, no matter what decision we make this time, I'd
>> suggest to make it final and document in our release note explicitly.
>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>> the change on default restart delay and we'd better learn from it this
>> time. Thanks.
>>
>> [1]
>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>
>> Best Regards,
>> Yu
>>
>>
>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>
>>> +1 on what Zhu Zhu said.
>>>
>>> We also override the default to 10 s.
>>>
>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> In our production, we usually override the restart delay to be 10 s.
>>>> We once encountered cases that external services are overwhelmed by
>>>> reconnections from frequent restarted tasks.
>>>> As a safer though not optimized option, a default delay larger than 0 s
>>>> is better in my opinion.
>>>>
>>>>
>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> I thinks it's better to increase the default value. +1
>>>>>
>>>>>
>>>>> Best.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------ 原始邮件 ------------------
>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>
>>>>>
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>> delay
>>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>>>> user reported that he would like to increase the default value because
>>>>> it
>>>>> can cause restart storms in case of systematic faults [2].
>>>>>
>>>>> The downside of increasing the default delay would be a slightly
>>>>> increased
>>>>> restart time if this config option is not explicitly set.
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>
>>>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Steven Wu <st...@gmail.com>.
1s sounds a good tradeoff to me.

On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <tr...@apache.org> wrote:

> Thanks a lot for all your feedback. I see there is a slight tendency
> towards having a non zero default delay so far.
>
> However, Yu has brought up some valid points. Maybe I can shed some light
> on a).
>
> Before FLINK-9158 we set the default delay to 10s because Flink did not
> support queued scheduling which meant that if one slot was missing/still
> being occupied, then Flink would fail right away with
> a NoResourceAvailableException. In order to prevent this we added the
> delay. This also covered the case when the job was failing because of an
> overloaded external system.
>
> When we finished FLIP-6, we thought that we could improve the user
> experience by decreasing the default delay to 0s because all Flink related
> problems (slot still occupied, slot missing because of reconnecting TM)
> could be handled by the default slot request time out which allowed the
> slots to become ready after the scheduling was kicked off. However, we did
> not properly take the case of overloaded external systems into account.
>
> For b) I agree that any default value should be properly documented. This
> was clearly an oversight when FLINK-9158 has been merged. Moreover, I
> believe that there won't be the solve it all default value. There are
> always cases where one needs to adapt it to ones needs. But this is ok. The
> goal should be to find the default value which works for most cases.
>
> So maybe the middle ground between 10s and 0s could be a solution. Setting
> the default restart delay to 1s should prevent restart storms caused by
> overloaded external systems and still be fast enough to not slow down
> recoveries noticeably in most cases. If one needs a super fast recovery,
> then one should set the delay value to 0s. If one requires a longer delay
> because of a particular infrastructure, then one needs to change the value
> too. What do you think?
>
> Cheers,
> Till
>
> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:
>
>> -1 on increasing the default delay to none zero, with below reasons:
>>
>> a) I could see some concerns about setting the delay to zero in the very
>> original JIRA (FLINK-2993
>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
>> decided to make the change, so I'm wondering whether the decision also came
>> from any customer requirement? If so, how could we judge whether one
>> requirement override the other?
>>
>> b) There could be valid reasons for both default values depending on
>> different use cases, as well as relative work around (like based on latest
>> policy, setting the config manually to 10s could resolve the problem
>> mentioned), and from former replies to this thread we could see users have
>> already taken actions. Changing it back to non-zero again won't affect such
>> users but might cause surprises to those depending on 0 as default.
>>
>> Last but not least, no matter what decision we make this time, I'd
>> suggest to make it final and document in our release note explicitly.
>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about
>> the change on default restart delay and we'd better learn from it this
>> time. Thanks.
>>
>> [1]
>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>>
>> Best Regards,
>> Yu
>>
>>
>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>>
>>> +1 on what Zhu Zhu said.
>>>
>>> We also override the default to 10 s.
>>>
>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>>
>>>> In our production, we usually override the restart delay to be 10 s.
>>>> We once encountered cases that external services are overwhelmed by
>>>> reconnections from frequent restarted tasks.
>>>> As a safer though not optimized option, a default delay larger than 0 s
>>>> is better in my opinion.
>>>>
>>>>
>>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> I thinks it's better to increase the default value. +1
>>>>>
>>>>>
>>>>> Best.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------ 原始邮件 ------------------
>>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>>
>>>>>
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I wanted to reach out to you and ask whether decreasing the default
>>>>> delay
>>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>>>> user reported that he would like to increase the default value because
>>>>> it
>>>>> can cause restart storms in case of systematic faults [2].
>>>>>
>>>>> The downside of increasing the default delay would be a slightly
>>>>> increased
>>>>> restart time if this config option is not explicitly set.
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>
>>>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Till Rohrmann <tr...@apache.org>.
Thanks a lot for all your feedback. I see there is a slight tendency
towards having a non zero default delay so far.

However, Yu has brought up some valid points. Maybe I can shed some light
on a).

Before FLINK-9158 we set the default delay to 10s because Flink did not
support queued scheduling which meant that if one slot was missing/still
being occupied, then Flink would fail right away with
a NoResourceAvailableException. In order to prevent this we added the
delay. This also covered the case when the job was failing because of an
overloaded external system.

When we finished FLIP-6, we thought that we could improve the user
experience by decreasing the default delay to 0s because all Flink related
problems (slot still occupied, slot missing because of reconnecting TM)
could be handled by the default slot request time out which allowed the
slots to become ready after the scheduling was kicked off. However, we did
not properly take the case of overloaded external systems into account.

For b) I agree that any default value should be properly documented. This
was clearly an oversight when FLINK-9158 has been merged. Moreover, I
believe that there won't be the solve it all default value. There are
always cases where one needs to adapt it to ones needs. But this is ok. The
goal should be to find the default value which works for most cases.

So maybe the middle ground between 10s and 0s could be a solution. Setting
the default restart delay to 1s should prevent restart storms caused by
overloaded external systems and still be fast enough to not slow down
recoveries noticeably in most cases. If one needs a super fast recovery,
then one should set the delay value to 0s. If one requires a longer delay
because of a particular infrastructure, then one needs to change the value
too. What do you think?

Cheers,
Till

On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:

> -1 on increasing the default delay to none zero, with below reasons:
>
> a) I could see some concerns about setting the delay to zero in the very
> original JIRA (FLINK-2993
> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
> decided to make the change, so I'm wondering whether the decision also came
> from any customer requirement? If so, how could we judge whether one
> requirement override the other?
>
> b) There could be valid reasons for both default values depending on
> different use cases, as well as relative work around (like based on latest
> policy, setting the config manually to 10s could resolve the problem
> mentioned), and from former replies to this thread we could see users have
> already taken actions. Changing it back to non-zero again won't affect such
> users but might cause surprises to those depending on 0 as default.
>
> Last but not least, no matter what decision we make this time, I'd suggest
> to make it final and document in our release note explicitly. Checking the
> 1.5.0 release note [1] [2] it seems we didn't mention about the change on
> default restart delay and we'd better learn from it this time. Thanks.
>
> [1]
> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>
> Best Regards,
> Yu
>
>
> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>
>> +1 on what Zhu Zhu said.
>>
>> We also override the default to 10 s.
>>
>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> In our production, we usually override the restart delay to be 10 s.
>>> We once encountered cases that external services are overwhelmed by
>>> reconnections from frequent restarted tasks.
>>> As a safer though not optimized option, a default delay larger than 0 s
>>> is better in my opinion.
>>>
>>>
>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I thinks it's better to increase the default value. +1
>>>>
>>>>
>>>> Best.
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>
>>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> I wanted to reach out to you and ask whether decreasing the default
>>>> delay
>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>>> user reported that he would like to increase the default value because
>>>> it
>>>> can cause restart storms in case of systematic faults [2].
>>>>
>>>> The downside of increasing the default delay would be a slightly
>>>> increased
>>>> restart time if this config option is not explicitly set.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>
>>>> Cheers,
>>>> Till
>>>
>>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Till Rohrmann <tr...@apache.org>.
Thanks a lot for all your feedback. I see there is a slight tendency
towards having a non zero default delay so far.

However, Yu has brought up some valid points. Maybe I can shed some light
on a).

Before FLINK-9158 we set the default delay to 10s because Flink did not
support queued scheduling which meant that if one slot was missing/still
being occupied, then Flink would fail right away with
a NoResourceAvailableException. In order to prevent this we added the
delay. This also covered the case when the job was failing because of an
overloaded external system.

When we finished FLIP-6, we thought that we could improve the user
experience by decreasing the default delay to 0s because all Flink related
problems (slot still occupied, slot missing because of reconnecting TM)
could be handled by the default slot request time out which allowed the
slots to become ready after the scheduling was kicked off. However, we did
not properly take the case of overloaded external systems into account.

For b) I agree that any default value should be properly documented. This
was clearly an oversight when FLINK-9158 has been merged. Moreover, I
believe that there won't be the solve it all default value. There are
always cases where one needs to adapt it to ones needs. But this is ok. The
goal should be to find the default value which works for most cases.

So maybe the middle ground between 10s and 0s could be a solution. Setting
the default restart delay to 1s should prevent restart storms caused by
overloaded external systems and still be fast enough to not slow down
recoveries noticeably in most cases. If one needs a super fast recovery,
then one should set the delay value to 0s. If one requires a longer delay
because of a particular infrastructure, then one needs to change the value
too. What do you think?

Cheers,
Till

On Sun, Sep 1, 2019 at 11:56 PM Yu Li <ca...@gmail.com> wrote:

> -1 on increasing the default delay to none zero, with below reasons:
>
> a) I could see some concerns about setting the delay to zero in the very
> original JIRA (FLINK-2993
> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
> decided to make the change, so I'm wondering whether the decision also came
> from any customer requirement? If so, how could we judge whether one
> requirement override the other?
>
> b) There could be valid reasons for both default values depending on
> different use cases, as well as relative work around (like based on latest
> policy, setting the config manually to 10s could resolve the problem
> mentioned), and from former replies to this thread we could see users have
> already taken actions. Changing it back to non-zero again won't affect such
> users but might cause surprises to those depending on 0 as default.
>
> Last but not least, no matter what decision we make this time, I'd suggest
> to make it final and document in our release note explicitly. Checking the
> 1.5.0 release note [1] [2] it seems we didn't mention about the change on
> default restart delay and we'd better learn from it this time. Thanks.
>
> [1]
> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>
> Best Regards,
> Yu
>
>
> On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:
>
>> +1 on what Zhu Zhu said.
>>
>> We also override the default to 10 s.
>>
>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>>
>>> In our production, we usually override the restart delay to be 10 s.
>>> We once encountered cases that external services are overwhelmed by
>>> reconnections from frequent restarted tasks.
>>> As a safer though not optimized option, a default delay larger than 0 s
>>> is better in my opinion.
>>>
>>>
>>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I thinks it's better to increase the default value. +1
>>>>
>>>>
>>>> Best.
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>
>>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> I wanted to reach out to you and ask whether decreasing the default
>>>> delay
>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>>> user reported that he would like to increase the default value because
>>>> it
>>>> can cause restart storms in case of systematic faults [2].
>>>>
>>>> The downside of increasing the default delay would be a slightly
>>>> increased
>>>> restart time if this config option is not explicitly set.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>
>>>> Cheers,
>>>> Till
>>>
>>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Yu Li <ca...@gmail.com>.
-1 on increasing the default delay to none zero, with below reasons:

a) I could see some concerns about setting the delay to zero in the very
original JIRA (FLINK-2993 <https://issues.apache.org/jira/browse/FLINK-2993>)
but later on in FLINK-9158
<https://issues.apache.org/jira/browse/FLINK-9158> we still decided to make
the change, so I'm wondering whether the decision also came from any
customer requirement? If so, how could we judge whether one requirement
override the other?

b) There could be valid reasons for both default values depending on
different use cases, as well as relative work around (like based on latest
policy, setting the config manually to 10s could resolve the problem
mentioned), and from former replies to this thread we could see users have
already taken actions. Changing it back to non-zero again won't affect such
users but might cause surprises to those depending on 0 as default.

Last but not least, no matter what decision we make this time, I'd suggest
to make it final and document in our release note explicitly. Checking the
1.5.0 release note [1] [2] it seems we didn't mention about the change on
default restart delay and we'd better learn from it this time. Thanks.

[1]
https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html

Best Regards,
Yu


On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:

> +1 on what Zhu Zhu said.
>
> We also override the default to 10 s.
>
> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> In our production, we usually override the restart delay to be 10 s.
>> We once encountered cases that external services are overwhelmed by
>> reconnections from frequent restarted tasks.
>> As a safer though not optimized option, a default delay larger than 0 s
>> is better in my opinion.
>>
>>
>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>
>>> Hi,
>>>
>>>
>>> I thinks it's better to increase the default value. +1
>>>
>>>
>>> Best.
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>
>>>
>>>
>>> Hi everyone,
>>>
>>> I wanted to reach out to you and ask whether decreasing the default delay
>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>> user reported that he would like to increase the default value because it
>>> can cause restart storms in case of systematic faults [2].
>>>
>>> The downside of increasing the default delay would be a slightly
>>> increased
>>> restart time if this config option is not explicitly set.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>
>>> Cheers,
>>> Till
>>
>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Yu Li <ca...@gmail.com>.
-1 on increasing the default delay to none zero, with below reasons:

a) I could see some concerns about setting the delay to zero in the very
original JIRA (FLINK-2993 <https://issues.apache.org/jira/browse/FLINK-2993>)
but later on in FLINK-9158
<https://issues.apache.org/jira/browse/FLINK-9158> we still decided to make
the change, so I'm wondering whether the decision also came from any
customer requirement? If so, how could we judge whether one requirement
override the other?

b) There could be valid reasons for both default values depending on
different use cases, as well as relative work around (like based on latest
policy, setting the config manually to 10s could resolve the problem
mentioned), and from former replies to this thread we could see users have
already taken actions. Changing it back to non-zero again won't affect such
users but might cause surprises to those depending on 0 as default.

Last but not least, no matter what decision we make this time, I'd suggest
to make it final and document in our release note explicitly. Checking the
1.5.0 release note [1] [2] it seems we didn't mention about the change on
default restart delay and we'd better learn from it this time. Thanks.

[1]
https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html

Best Regards,
Yu


On Sun, 1 Sep 2019 at 04:33, Steven Wu <st...@gmail.com> wrote:

> +1 on what Zhu Zhu said.
>
> We also override the default to 10 s.
>
> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:
>
>> In our production, we usually override the restart delay to be 10 s.
>> We once encountered cases that external services are overwhelmed by
>> reconnections from frequent restarted tasks.
>> As a safer though not optimized option, a default delay larger than 0 s
>> is better in my opinion.
>>
>>
>> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>
>>> Hi,
>>>
>>>
>>> I thinks it's better to increase the default value. +1
>>>
>>>
>>> Best.
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>
>>>
>>>
>>> Hi everyone,
>>>
>>> I wanted to reach out to you and ask whether decreasing the default delay
>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>> user reported that he would like to increase the default value because it
>>> can cause restart storms in case of systematic faults [2].
>>>
>>> The downside of increasing the default delay would be a slightly
>>> increased
>>> restart time if this config option is not explicitly set.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>
>>> Cheers,
>>> Till
>>
>>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Steven Wu <st...@gmail.com>.
+1 on what Zhu Zhu said.

We also override the default to 10 s.

On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:

> In our production, we usually override the restart delay to be 10 s.
> We once encountered cases that external services are overwhelmed by
> reconnections from frequent restarted tasks.
> As a safer though not optimized option, a default delay larger than 0 s is
> better in my opinion.
>
>
> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>
>> Hi,
>>
>>
>> I thinks it's better to increase the default value. +1
>>
>>
>> Best.
>>
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>
>>
>>
>> Hi everyone,
>>
>> I wanted to reach out to you and ask whether decreasing the default delay
>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>> user reported that he would like to increase the default value because it
>> can cause restart storms in case of systematic faults [2].
>>
>> The downside of increasing the default delay would be a slightly increased
>> restart time if this config option is not explicitly set.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>
>> Cheers,
>> Till
>
>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Steven Wu <st...@gmail.com>.
+1 on what Zhu Zhu said.

We also override the default to 10 s.

On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <re...@gmail.com> wrote:

> In our production, we usually override the restart delay to be 10 s.
> We once encountered cases that external services are overwhelmed by
> reconnections from frequent restarted tasks.
> As a safer though not optimized option, a default delay larger than 0 s is
> better in my opinion.
>
>
> 未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:
>
>> Hi,
>>
>>
>> I thinks it's better to increase the default value. +1
>>
>>
>> Best.
>>
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> 发件人: "Till Rohrmann"<tr...@apache.org>;
>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>
>>
>>
>> Hi everyone,
>>
>> I wanted to reach out to you and ask whether decreasing the default delay
>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>> user reported that he would like to increase the default value because it
>> can cause restart storms in case of systematic faults [2].
>>
>> The downside of increasing the default delay would be a slightly increased
>> restart time if this config option is not explicitly set.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>
>> Cheers,
>> Till
>
>

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Zhu Zhu <re...@gmail.com>.
In our production, we usually override the restart delay to be 10 s.
We once encountered cases that external services are overwhelmed by
reconnections from frequent restarted tasks.
As a safer though not optimized option, a default delay larger than 0 s is
better in my opinion.


未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:

> Hi,
>
>
> I thinks it's better to increase the default value. +1
>
>
> Best.
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Till Rohrmann"<tr...@apache.org>;
> 发送时间: 2019年8月30日(星期五) 晚上10:07
> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>
>
>
> Hi everyone,
>
> I wanted to reach out to you and ask whether decreasing the default delay
> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
> user reported that he would like to increase the default value because it
> can cause restart storms in case of systematic faults [2].
>
> The downside of increasing the default delay would be a slightly increased
> restart time if this config option is not explicitly set.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9158
> [2] https://issues.apache.org/jira/browse/FLINK-11218
>
> Cheers,
> Till

Re: [SURVEY] Is the default restart delay of 0s causing problems?

Posted by Zhu Zhu <re...@gmail.com>.
In our production, we usually override the restart delay to be 10 s.
We once encountered cases that external services are overwhelmed by
reconnections from frequent restarted tasks.
As a safer though not optimized option, a default delay larger than 0 s is
better in my opinion.


未来阳光 <22...@qq.com> 于2019年8月30日周五 下午10:23写道:

> Hi,
>
>
> I thinks it's better to increase the default value. +1
>
>
> Best.
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Till Rohrmann"<tr...@apache.org>;
> 发送时间: 2019年8月30日(星期五) 晚上10:07
> 收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>;
> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>
>
>
> Hi everyone,
>
> I wanted to reach out to you and ask whether decreasing the default delay
> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
> user reported that he would like to increase the default value because it
> can cause restart storms in case of systematic faults [2].
>
> The downside of increasing the default delay would be a slightly increased
> restart time if this config option is not explicitly set.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9158
> [2] https://issues.apache.org/jira/browse/FLINK-11218
>
> Cheers,
> Till

回复:[SURVEY] Is the default restart delay of 0s causing problems?

Posted by 未来阳光 <22...@qq.com>.
Hi,


I thinks it's better to increase the default value. +1


Best.




------------------ 原始邮件 ------------------
发件人: "Till Rohrmann"<tr...@apache.org>; 
发送时间: 2019年8月30日(星期五) 晚上10:07
收件人: "dev"<de...@flink.apache.org>; "user"<us...@flink.apache.org>; 
主题: [SURVEY] Is the default restart delay of 0s causing problems?



Hi everyone,

I wanted to reach out to you and ask whether decreasing the default delay
to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
user reported that he would like to increase the default value because it
can cause restart storms in case of systematic faults [2].

The downside of increasing the default delay would be a slightly increased
restart time if this config option is not explicitly set.

[1] https://issues.apache.org/jira/browse/FLINK-9158
[2] https://issues.apache.org/jira/browse/FLINK-11218

Cheers,
Till