You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/07/22 18:10:00 UTC
[jira] [Commented] (FLINK-9908) Inconsistent state of SlotPool after ExecutionGraph cancellation

    [ https://issues.apache.org/jira/browse/FLINK-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552111#comment-16552111 ] 

ASF GitHub Bot commented on FLINK-9908:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6383

    [FLINK-9908][scheduling] Do not cancel individual scheduling future

    ## What is the purpose of the change
    
    Since the individual scheduling futures contain logic to release the slot if it cannot
    be assigned to the Execution, we must not cancel them. Otherwise we might risk that
    slots are not returned to the SlotPool leaving it in an inconsistent state.
    
    This PR is based on #6373.
    
    ## Brief change log
    
    - Do not propagate the overall scheduling future cancellation to the individual slot request futures
    
    ## Verifying this change
    
    - Added `ExecutionGraphSchedulingTest#testCancellationOfIncompleteScheduling`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink improveSlotPoolLogging

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6383
    
----
commit f85ec37cc3ad21998eabad45a6dcb46e8efc62fb
Author: Till Rohrmann <tr...@...>
Date:   2018-07-19T11:07:44Z

    [FLINK-9838][logging] Don't log slot request failures on the ResourceManager

commit 7c703fb3b350ef5b02b01d621c3a16d4bca6f707
Author: Till Rohrmann <tr...@...>
Date:   2018-07-19T11:41:03Z

    [hotfix] Improve logging of SlotPool and SlotSharingManager

commit 414a8d231a5b6cdc2d5db0c1d35a79ff584c1cd0
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:05:05Z

    [FLINK-9908][scheduling] Do not cancel individual scheduling future
    
    Since the individual scheduling futures contain logic to release the slot if it cannot
    be assigned to the Execution, we must not cancel them. Otherwise we might risk that
    slots are not returned to the SlotPool leaving it in an inconsistent state.

----


> Inconsistent state of SlotPool after ExecutionGraph cancellation 
> -----------------------------------------------------------------
>
>                 Key: FLINK-9908
>                 URL: https://issues.apache.org/jira/browse/FLINK-9908
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.1, 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.5.2, 1.6.0, 1.7.0
>
>
> If the {{ExecutionGraph}} is concurrently scheduled and cancelled, it can happen that requested {{Slots}} are not properly returned to the {{SlotPool}}. This causes an inconsistent state of the {{SlotPool}} where it thinks that some of its slots are still occupied even though the respective {{Execution}} has already been cancelled.
> The problem seems to be caused by propagating the cancellation of the overall scheduling future to the individual scheduling futures. If the individual scheduling future is cancelled, then the callback which produces its value and also handles the failure case won't be called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)