You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/05/18 10:55:16 UTC

[GitHub] [flink] StephanEwen opened a new pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

StephanEwen opened a new pull request #12225:
URL: https://github.com/apache/flink/pull/12225


   ## What is the purpose of the change
   
   This PR makes it possible to use safely use spawned threads in the `OperatorCoordinator`
   
   ### The Previous Issue
   
   Before, calls on the `Context` in the `OperatorCoordinator` went directly synchronously to the `ExcutionGraph`. That introduces two critical problems are:
   
     - It is common that the code in the `OperatorCoordinator` runs in a separate thread (for example, because it executes blocking operations). In fact, the *FLIP-27* Source Enumerators do that. Calling the scheduler from another thread causes the Scheduler to crash (Assertion Error, violation of single threaded property)
     - Calls on the `ExecutionGraph` are removed as part of removing the legacy scheduler. Certain calls do not work any more.
   
   ### Problem Level 1:
   
   The natural solution would be to pass in the scheduler and a main thread executor to interact with it.
   
   However, to do that the scheduler needs to be created before the Operator Coordinators are created. One could do that by creating the Coordinators lazily after the Scheduler.
   
   ### Problem Level 2:
   
   The Scheduler restores the savepoints as part of the scheduler creation, when the `ExecutionGraph` and the `CheckpointCoordinator` are created early in the constructor.
   (Side note: That design is tricky in itself, because it means state is restored before the scheduler is even properly constructed.)
   
   That means the `OperatorCoordinator` needs to exist (or an in placeholder component needs to exist) to accept the restored state.
   
   That brings us to a cyclic dependency:
   
     - `OperatorCoordinator` (context) needs `SchedulerNG` and `MainThreadExecutor`
     - `SchedulerNG` and `MainThreadExecutor` need a constructed `ExecutionGraph`
     - `ExecutionGraph` needs `CheckpointCoordinator`
     - `CheckpointCoordinator` needs `OperatorCoordinator`
   
   
   ## Brief change log (a.k.a. Breaking the Cycle)
   
   The only way we can currently solve this (without a massive scheduler refactoring) is with a form of lazy initialization:
   
     - We eagerly create the `OperatorCoordinators` so they exist for state restore
     - We provide an *uninitialized* `Context` to them
     - When the `SchedulerNG` is started (after leadership is granted) we initialize the `Context` with the (then readily constructed) `SchedulerNG` and `MainThreadExecutor`
   
   ## Suggested Follow-up (A longer-term Solution)
   
   The longer term solution would require a major change in the `DefaultScheduler` and `CheckpointCoordinator` setup. Something like this:
   
     - `DefaultScheduler` (and `ExecutionGraph`) are constructed first
     - `JobMaster` waits for leadership
     - Upon leader grant, `Operator Coordinators` (they are part of the `JobMaster`) are constructed and can reference the `SchedulerNG` and `FencedMainThreadExecutor`
     - `CheckpointCoordinator` is constructed and references `ExecutionGraph` and `OperatorCoordinators`
     - Savepoint or latest checkpoint is restored
   
   
   The implementation in this PR tries to couple parts as loosely as possible to make it easy to implement the above approach later.
   
   ## Verifying this change
   
   This change is an internal refactor of existing components. The contracts are tested by existing unit tests.
   
   The fact that a separate thread can be used in the `OperatorCoordinator` is tested in an INtegration Test Case in a follow-up PR, about end-to-end coordinator exactly-once guarantees.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no**
     - The serializers: (yes / no / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / no / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
     - The S3 file system connector: (yes / no / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / no)
     - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #12225:
URL: https://github.com/apache/flink/pull/12225#issuecomment-630108694


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 1f925aacf144019c5e90137c4dcaec637b4f2611 (Fri Oct 16 10:48:55 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #12225:
URL: https://github.com/apache/flink/pull/12225#issuecomment-630123970


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730",
       "triggerID" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1f925aacf144019c5e90137c4dcaec637b4f2611",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1f925aacf144019c5e90137c4dcaec637b4f2611",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 98f726234e0bce2e0088bed95efd282462dc5fc1 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730) 
   * 1f925aacf144019c5e90137c4dcaec637b4f2611 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] asfgit closed pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #12225:
URL: https://github.com/apache/flink/pull/12225


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #12225:
URL: https://github.com/apache/flink/pull/12225#issuecomment-630123970


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730",
       "triggerID" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1f925aacf144019c5e90137c4dcaec637b4f2611",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1740",
       "triggerID" : "1f925aacf144019c5e90137c4dcaec637b4f2611",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 98f726234e0bce2e0088bed95efd282462dc5fc1 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730) 
   * 1f925aacf144019c5e90137c4dcaec637b4f2611 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1740) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #12225:
URL: https://github.com/apache/flink/pull/12225#issuecomment-630123970


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 98f726234e0bce2e0088bed95efd282462dc5fc1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #12225:
URL: https://github.com/apache/flink/pull/12225#issuecomment-630123970


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730",
       "triggerID" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 98f726234e0bce2e0088bed95efd282462dc5fc1 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #12225:
URL: https://github.com/apache/flink/pull/12225#issuecomment-630108694


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 98f726234e0bce2e0088bed95efd282462dc5fc1 (Mon May 18 11:03:10 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #12225: [FLINK-17781][coordination] Use Scheduler and MainThreadExecutor in OperatorCoordinator.Context

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #12225:
URL: https://github.com/apache/flink/pull/12225#issuecomment-630123970


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730",
       "triggerID" : "98f726234e0bce2e0088bed95efd282462dc5fc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1f925aacf144019c5e90137c4dcaec637b4f2611",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1740",
       "triggerID" : "1f925aacf144019c5e90137c4dcaec637b4f2611",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 98f726234e0bce2e0088bed95efd282462dc5fc1 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1730) 
   * 1f925aacf144019c5e90137c4dcaec637b4f2611 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1740) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org