You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/07/23 14:39:00 UTC

[jira] [Commented] (FLINK-9912) Release TaskExecutors from SlotPool if all slots have been removed

    [ https://issues.apache.org/jira/browse/FLINK-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552938#comment-16552938 ] 

ASF GitHub Bot commented on FLINK-9912:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6394

    [FLINK-9912][JM] Release TaskExecutors if they have no slots registered at SlotPool

    ## What is the purpose of the change
    
    This commit extends the SlotPools behaviour when failing an allocation by sending a notification
    message to the TaskExecutor about the freed slot. Moreover, it checks whether the affected
    TaskExecutor has more slots registered or not. In the latter case, the TaskExecutor's connection
    will be eagerly closed.
    
    This PR is based on #6389.
    
    ## Brief change log
    
    - send `freeSlot` message to owning `TaskExecutor` of failed `AllocatedSlot`
    - close `TaskExecutor` connection if it no longer has slots registered at the `JobMaster`
    
    ## Verifying this change
    
    - Added `SlotPoolTest#testFreeFailedSlots`, `SlotPoolTest#testFailingAllocationFailsPendingSlotRequests` and `JobMasterTest#testReleasingTaskExecutorIfNoMoreSlotsRegistered`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink releaseTaskExecutors

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6394.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6394
    
----
commit 52cd0269951f6ee2c86ca05aa95f6b43dfdd256c
Author: Till Rohrmann <tr...@...>
Date:   2018-07-19T11:07:44Z

    [FLINK-9838][logging] Don't log slot request failures on the ResourceManager

commit eac64952425fcc9ce51c768ac953523116661ef9
Author: Till Rohrmann <tr...@...>
Date:   2018-07-19T11:41:03Z

    [hotfix] Improve logging of SlotPool and SlotSharingManager

commit b474cda88812d63d38e8294b4347ecbc554c4597
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:05:05Z

    [FLINK-9908][scheduling] Do not cancel individual scheduling future
    
    Since the individual scheduling futures contain logic to release the slot if it cannot
    be assigned to the Execution, we must not cancel them. Otherwise we might risk that
    slots are not returned to the SlotPool leaving it in an inconsistent state.

commit f997860c2a0c479ea4036f0a7174b64f2b3acfc9
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:17:11Z

    [FLINK-9909][core] ConjunctFuture does not cancel input futures
    
    If a ConjunctFuture is cancelled, then it won't cancel all of its input
    futures automatically. If the users needs this behaviour then he has to
    implement it explicitly. The reason for this change is that an implicit
    cancellation can have unwanted side effects, because all of the cancelled
    input futures' producers won't be executed.

commit 30c3eb6bf2e32ea0eb18cc82262966bf716884d6
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:20:53Z

    [hotfix] Fix checkstyle violations in FutureUtils

commit 4f3ec0f88a2c27cbe7f33a82b09b44124e1b34c3
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:34:33Z

    [hotfix] Replace check state condition in Execution#tryAssignResource with if check
    
    Instead of risking an IllegalStateException it is better to check that the
    taskManagerLocationFuture has not been completed yet. If, then we also reject
    the assignment of the LogicalSlot to the Execution. That way, we don't risk
    that we don't release the slot in case of an exception in
    Execution#allocateAndAssignSlotForExecution.

commit 0f8208642d3aa561148e9e7b95c736c932e9f034
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:43:44Z

    [hotfix] Fix checkstyle violations in ExecutionVertex

commit 6ee88195fadea4badfad4a50ad832be5509d78a1
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:46:37Z

    [hotfix] Fix checkstyle violations in ExecutionJobVertex

commit 8193243d61238c2787b8d9b35ac9681709c07ddb
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T18:48:53Z

    [hotfix] Fix checkstyle violations in Execution

commit c7fc51372abe3866a3972e78b590e3791b746c65
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T19:38:42Z

    [FLINK-9910][scheduling] Execution#scheduleForeExecution does not cancel slot future
    
    In order to properly give back an allocated slot to the SlotPool, one must not complete
    the result future of Execution#allocateAndAssignSlotForExecution. This commit changes the
    behaviour in Execution#scheduleForExecution accordingly.

commit a58b755750ada229102d3d18cd89767ad7fe3b6d
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T19:57:59Z

    [FLINK-9911][JM] Use SlotPoolGateway to call failAllocation
    
    Since the SlotPool is an actor, we must use the SlotPoolGateway to interact with
    the SlotPool. Otherwise, we might risk an inconsistent state since there are
    multiple threads modifying the component.

commit 0772a52fe859bf00ec5dada2395e0296202ec469
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T20:11:13Z

    [FLINK-9917][JM] Remove superfluous lock from SlotSharingManager
    
    The SlotSharingManager is designed to be used by a single thread. Therefore,
    it is the responsibility of the caller to make sure that there is only a single
    thread at any given time accesssing this component. Consequently, the component
    does not need to be synchronized.

commit 75fb9af3ce4245dea9f704e06a3acd87b8dcd8e0
Author: Till Rohrmann <tr...@...>
Date:   2018-07-22T20:58:18Z

    [FLINK-9912][JM] Release TaskExecutors if they have no slots registered at SlotPool
    
    This commit extends the SlotPools behaviour when failing an allocation by sending a notification
    message to the TaskExecutor about the freed slot. Moreover, it checks whether the affected
    TaskExecutor has more slots registered or not. In the latter case, the TaskExecutor's connection
    will be eagerly closed.

----


> Release TaskExecutors from SlotPool if all slots have been removed
> ------------------------------------------------------------------
>
>                 Key: FLINK-9912
>                 URL: https://issues.apache.org/jira/browse/FLINK-9912
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.5.1, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, it is possible to fail slot allocations in the {{SlotPool}}. Failing an allocation means that the slot is removed from the {{SlotPool}}. If we have removed all slots from a {{TaskExecutor}}, then we should also release/close the connection to this {{TaskExecutor}}. At the moment, this only happens via the heartbeats if the {{TaskExecutor}} has become unreachable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)