You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/09 16:31:00 UTC
[jira] [Commented] (FLINK-9324) SingleLogicalSlot returns completed release future before slot is properly returned

    [ https://issues.apache.org/jira/browse/FLINK-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469038#comment-16469038 ] 

ASF GitHub Bot commented on FLINK-9324:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/5980

    [FLINK-9324] Wait for slot release before completing release future in SingleLogicalSlot

    ## What is the purpose of the change
    
    This commit properly waits for the completion of the SingleLogicalSlot's release future
    until the SlotOwner has acknowledged the release. That way the ExecutionGraph will only
    recover after all of its slots have been returned to the SlotPool.
    
    As a side effect, the changes in this commit should reduce the number of redundant release
    calls sent to the SlotOwner which cluttered the debug logs.
    
    ## Brief change log
    
    - Simplify `AllocatedSlot#Payload` interface
    - Don't require calls coming from the SlotPool to wait for the completion of the payload terminal state future before releasing the slot --> The idea is that the SlotPool knows when the slots on the TM are emptied. Therefore, we only need to fail the payload.
    - Properly wait for the `SlotOwner` to acknowledge the release of the slot
    
    ## Verifying this change
    
    - Added `SingleLogicalSlotTest`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink waitForSlotRelease

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5980.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5980
    
----
commit 978e7ec2ca4d53f123c66ca01b65f24f905969c0
Author: Till Rohrmann <tr...@...>
Date:   2018-05-09T13:29:36Z

    [FLINK-9324] Wait for slot release before completing release future in SingleLogicalSlot
    
    This commit properly waits for the completion of the SingleLogicalSlot's release future
    until the SlotOwner has acknowledged the release. That way the ExecutionGraph will only
    recover after all of its slots have been returned to the SlotPool.
    
    As a side effect, the changes in this commit should reduce the number of redundant release
    calls sent to the SlotOwner which cluttered the debug logs.

----


> SingleLogicalSlot returns completed release future before slot is properly returned
> -----------------------------------------------------------------------------------
>
>                 Key: FLINK-9324
>                 URL: https://issues.apache.org/jira/browse/FLINK-9324
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> The {{SingleLogicalSlot#releaseSlot}} method returns a future which is completed once the slot has been returned to the {{SlotOwner}}. Unfortunately, we don't wait for the {{SlotOwner's}} response to complete the future but complete it directly after the call has been made. This causes that the {{ExecutionGraph}} can get restarted in case of a recovery before all of its slots have been returned to the {{SlotPool}}. As a consequence, the allocation of the new tasks might require more than the max parallelism because of collisions with old tasks (in case of slot sharing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)