You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by tillrohrmann <gi...@git.apache.org> on 2015/12/17 19:55:14 UTC

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/1468

    [FLINK-3184] [timeouts] Decrease timeouts

    This PR introduces a client side timeout of 60 s and a cluster side timeout of 10 s. Both timeouts can be configured via `akka.client.timeout` and `akka.ask.timeout` in the configuration.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink decreaseAkkaTimeout

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1468
    
----
commit 754c0c408d92e931218a137f388fb77f51df964a
Author: Till Rohrmann <tr...@apache.org>
Date:   2015-12-15T14:15:12Z

    Harmonize config key for number of retries and retry delay

commit dd81da02ca6eaf8e0e38cf4511e26cb553c71f72
Author: Till Rohrmann <tr...@apache.org>
Date:   2015-12-15T16:34:17Z

    Add missing param descriptions to FlinkYarnCluster, remove implicit timeout from ApplicationClient

commit 5e967bf8a9ba066be73905338acfd5deb4894602
Author: Till Rohrmann <tr...@apache.org>
Date:   2015-12-15T16:37:20Z

    [FLINK-3184] [timeouts] Set default cluster side timeout to 10 s and the client side timeout to 60 s.
    
    Adapt Akka failure detector timings to respect new 10 s Akka ask timeout. Add logging statements to JobClientActor
    
    Introduce separation between client and cluster timeout
    
    Sets the cluster timeout to 10 s and the client timeout to 60 s.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by uce <gi...@git.apache.org>.
Github user uce commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165779642
  
    I've created https://docs.google.com/document/d/1987ydc2rez79Pph7qBbcXwu6XzMU2Hm-nX8kORLoZBM/edit?usp=sharing and added the config renaming as an API breaking change. I will share this list on the ML.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165754886
  
    The idea is to decouple the restart logic from the `JobManager` and to make it configurable on a per job basis. Different strategies are conceivable. For instance, what we have right now, a fixed delay restart strategy. Additions could be an exponential backoff restart strategy or later a scale in/out restart strategy. Furthermore, this allows to set the delays on a per job basis which might be relevant for specific SLAs.
    
    But in general it's more like a preliminary step towards the scale in/out restart strategy, I guess.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165752514
  
    This is true. Might be a bit anticipated but I plan to remove them completely with my next PR. I want to introduce a `RestartStrategy` which can be set on a job basis and basically encapsulates the restart logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165752418
  
    I like having lower ask timeouts.
    
    +1 from my side


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by uce <gi...@git.apache.org>.
Github user uce commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165750742
  
    Good changes.
    
    I like the configuration key changes, but we have to keep in mind that they are API breaking. With 1.0 approaching, I think it's good to do the changes now and not later. So I'm personally +1.
    
    If someone has an objection, we can add a deprecated variant with the old keys and log a warning if necessary.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-171928797
  
    We could set the default execution retry delay to 0 assuming that any longer timeout in a streaming use case would render the job wrong anyway. However, if a longer timeout is acceptable, then we would lose the ability to recover from a lost task manager which usually takes some time to reconnect to the cluster (given that we use all instances).
    
    I would be more in favour of having an exponential back off strategy as the default. This would give us a quick recovery in case that we have enough resources available but also the possibility to wait for a TM reconnection. We could implement such a restart strategy once the PR #1470 is merged.
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165752725
  
    With upcoming Mesos integration (and some YARN refactoring), we can probably drop the heartbeats between master and worker as well from Akka.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/1468


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by mbalassi <gi...@git.apache.org>.
Github user mbalassi commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-171931286
  
    I do like the exponential back off as default, enables streaming but is flexible enough to tolerate temporal resource unavailability.
    
    Having simply 0 as default does not seem safe for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-172533113
  
    Testing failure seems to be unrelated. Merging the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-171931664
  
    Will rebase and then merge this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-171786938
  
    Looks good.
    
    We talked a lot about setting the default execution retry delay to 0 (from its way to high current value). Should we do this here as well (is that a safe change?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165752782
  
    What is the `RestartStrategy` about?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-171931227
  
    That sounds like a good idea!
    
    +1 to merge this one then...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3184] [timeouts] Decrease timeouts

Posted by uce <gi...@git.apache.org>.
Github user uce commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165780008
  
    Ahhh sorry for my stupid comment. The actual config values didn't change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---