You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/10 13:34:00 UTC
[jira] [Commented] (FLINK-9331) MesosResourceManager sometimes does not request new Containers

    [ https://issues.apache.org/jira/browse/FLINK-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470392#comment-16470392 ] 

ASF GitHub Bot commented on FLINK-9331:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/5986

    [FLINK-9331] [mesos] Let MesosResourceManager request new tasks for failed ones

    ## What is the purpose of the change
    
    In order to avoid the problem of not fulfilling allocation requests when a Mesos
    tasks dies before it could register at the RM, this commit restarts all failed
    Mesos tasks. The downside is that in some cases where the JM notices the failure
    of a TM and would fail the job, we request tasks which are not directly needed.
    
    ## Brief change log
    
    - Request a new Mesos task for all failed tasks
    
    ## Verifying this change
    
    - Adapted `MesosResourceManagerTest#testWorkerFailed` to check for the new Mesos task request
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixMesosTasksM

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5986.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5986
    
----
commit e8c2c8fe74d9ab9b0d2a78ad3c18d5eba9c4dc4e
Author: gyao <ga...@...>
Date:   2018-04-27T13:49:31Z

    [FLINK-9190][flip6,yarn] Request new container if container completed unexpectedly.
    
    This closes #5931.

commit da1be90c1c695a1d5e2aa53aed4dad961bd780a8
Author: gyao <ga...@...>
Date:   2018-04-27T13:51:38Z

    [hotfix][yarn] Reduce visibility of fields.

commit ba6f02a406f9834ca44d8f2582775cce0cd02b47
Author: Stephan Ewen <se...@...>
Date:   2018-05-04T16:19:10Z

    [hotfix] [runtime] Minor code cleanups and improved comments.
    
      - Clarify the purposes of AllocationID versus SlotRequestId.
      - Remove an unnecessary method indirection

commit 28d7303d58a0c5feb5791fce2e625eae00c9b54c
Author: Till Rohrmann <tr...@...>
Date:   2018-05-09T13:29:36Z

    [FLINK-9324] Wait for slot release before completing release future in SingleLogicalSlot
    
    This commit properly waits for the completion of the SingleLogicalSlot's release future
    until the SlotOwner has acknowledged the release. That way the ExecutionGraph will only
    recover after all of its slots have been returned to the SlotPool.
    
    As a side effect, the changes in this commit should reduce the number of redundant release
    calls sent to the SlotOwner which cluttered the debug logs.
    
    This closes #5980.

commit d519b61a5a63e517b154be375213144177c9e578
Author: Till Rohrmann <tr...@...>
Date:   2018-05-09T22:45:12Z

    [hotfix] [tests] Harden YarnSessionFIFOITCase#testDetachedMode
    
    Wait for the completion of the submitted job in order to avoid that we kill
    the JM while the TM tries to down load blobs from it.

commit 845140a2d8f4a29f7379c6d1d99e67ba22030339
Author: Stephan Ewen <se...@...>
Date:   2018-05-05T15:25:15Z

    [hotfix] [runtime] Add toString() methods to SlotSharingManager and contained slot classes.

commit dcf8cd653fc028e2054e9eefd5ddfd4003544ec3
Author: Stephan Ewen <se...@...>
Date:   2018-05-05T15:58:20Z

    [hotfix] [runtime] Add toString() and print methods to SlotPool classes for as debugging/diagnostic helpers

commit 8b646f29cae371f18d803a8cba4772fe927d4bd3
Author: Stephan Ewen <se...@...>
Date:   2018-05-05T16:18:36Z

    [FLINK-9330] [runtime] Add periodic logging of SlotPool status
    
    Only happens if log level for the SlotPool is set to DEBUG

commit 7e6bc7775493e06d8004689b090f3ccecf638911
Author: Till Rohrmann <tr...@...>
Date:   2018-05-10T13:02:38Z

    [FLINK-9331] [mesos] Let MesosResourceManager request new tasks for failed ones
    
    In order to avoid the problem of not fulfilling allocation requests when a Mesos
    tasks dies before it could register at the RM, this commit restarts all failed
    Mesos tasks. The downside is that in some cases where the JM notices the failure
    of a TM and would fail the job, we request tasks which are not directly needed.

----


> MesosResourceManager sometimes does not request new Containers
> --------------------------------------------------------------
>
>                 Key: FLINK-9331
>                 URL: https://issues.apache.org/jira/browse/FLINK-9331
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> Similar to FLINK-9190 we also have to request new Mesos tasks if a task is reported to have failed. Otherwise we might run into the same problem that allocation requests are not fulfilled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)