You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by tillrohrmann <gi...@git.apache.org> on 2015/10/09 22:18:24 UTC
[GitHub] flink pull request: [FLINK-2804] [runtime] Add blocking job submis...
GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/1249
[FLINK-2804] [runtime] Add blocking job submission support for HA
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.
Added test cases for JobClientActor exceptions
This PR is based on extended versions of PR #1153 and #1227.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink client-recovery
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1249.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1249
----
commit 958090b5dfb2b9934605e8f2810e521ad6b783e4
Author: Ufuk Celebi <uc...@apache.org>
Date: 2015-09-03T13:13:28Z
[runtime] Add type parameter to ByteStreamStateHandle
commit dc9daef275a75b4418502c760a943d298171b583
Author: Ufuk Celebi <uc...@apache.org>
Date: 2015-09-19T17:53:18Z
[clients, temporary] Submit job in detached mode if recovery enabled
commit beb3d7cd65b7f86ef05f4ce08e771dea24903d0b
Author: Ufuk Celebi <uc...@apache.org>
Date: 2015-09-20T11:08:24Z
[FLINK-2652] [tests] Temporary ignore flakey PartitionRequestClientFactoryTest
commit d26d133728f18052ee1fcca3c6a9486ba00c02a4
Author: Ufuk Celebi <uc...@apache.org>
Date: 2015-09-30T14:38:37Z
[FLINK-2792] [jobmanager, logging] Set actor message log level to TRACE
commit 8017dbdb29b1e79fcb90aad079726e75dac9fe4c
Author: Ufuk Celebi <uc...@apache.org>
Date: 2015-09-01T15:25:46Z
[FLINK-2354] [runtime] Add job graph and checkpoint recovery
commit f376385d9ea6a55b60db288299f4dd15f58e2f3e
Author: Till Rohrmann <tr...@apache.org>
Date: 2015-10-08T22:50:07Z
[FLINK-2354] [runtime] Remove state changing futures in JobManager
Internal actor states must only be modified within the actor thread.
This avoids all the well-known issues coming with concurrency.
Fix RemoveCachedJob by introducing RemoveJob
Fix JobManagerITCase
Add removeJob which maintains the job in the SubmittedJobGraphStore
Make revokeLeadership not remove the jobs from the state backend
Fix shading problem with curator by hiding CuratorFramework in ChaosMonkeyITCase
commit bd4f4d7ef9e74c205e3a3f9a595581e2365584ae
Author: Till Rohrmann <tr...@apache.org>
Date: 2015-10-09T19:48:31Z
Fix YARNHighAvailabilityTest by setting the correct state backend
commit 20c089c201e1468b5f1f315c37551e1e32ec17f8
Author: Ufuk Celebi <uc...@apache.org>
Date: 2015-10-05T12:30:46Z
[FLINK-2805] [blobmanager] Write JARs to file state backend for recovery
Move StateBackend enum to top level and org.apache.flink.runtime.state
Abstract blob store in blob server for recovery
commit 6d81141d3ff585867c25e73961e7da061007affe
Author: Till Rohrmann <tr...@apache.org>
Date: 2015-10-07T23:52:07Z
[FLINK-2804] [runtime] Add blocking job submission support for HA
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.
Added test cases for JobClientActor exceptions
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: [FLINK-2804] [runtime] Add blocking job submis...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/flink/pull/1249
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: [FLINK-2804] [runtime] Add blocking job submis...
Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/1249#issuecomment-147086347
Thanks for the review @uce. Thanks for pointing out the temporary commit. I will remove it when I merge this PR. No don't close the PRs. I'll commit them individually and close them via the commit messages. Will then simply rebase this one. Maybe we have HA in the master at the end of the day :-)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] flink pull request: [FLINK-2804] [runtime] Add blocking job submis...
Posted by uce <gi...@git.apache.org>.
Github user uce commented on the pull request:
https://github.com/apache/flink/pull/1249#issuecomment-147070413
+1 good to merge. We have to remove dc9daef before merging though (that was a temporary work around, because we didn't support blocking HA submissions. Should I close the other recovery PRs as this one contains all respective additions?
I've tested it out locally and it works fine. I think moving this to the job client actor is better than having it the repeated retrieval in the job client. And the addition with the submission and connection timeouts are also very good.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---