You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/01/16 04:23:16 UTC

[GitHub] [flink] zhuzhurk opened a new pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

zhuzhurk opened a new pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867
 
 
   ## What is the purpose of the change
   
   If a SharedSlotOversubscribedException happens, the MultiTaskSlot will release some of its child SingleTaskSlot. The triggered releasing will trigger a re-allocation of the task slot right inside SingleTaskSlot#release(...). So that a previous allocation in SloSharingManager#allTaskSlots will be replaced by the new allocation because they share the same slotRequestId.
   However, the SingleTaskSlot#release(...) will then invoke MultiTaskSlot#releaseChild to release the previous allocation with the slotRequestId, which will unexpectedly remove the new allocation from the SloSharingManager.
   In this way, slot leak happens because the pending slot request is not tracked by the SloSharingManager and cannot be released when its payload terminates.
   
   Note that the it's not a problem in 1.10/master now since SharedSlotOversubscribedException is removed in FLINK-14314. However, it's still an issue in 1.9.
   However, the fix would still help in master to avoid similar issue to happen in the future.
   
   A test case testNoSlotLeakOnSharedSlotOversubscribedException which exhibits this issue can be found at https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14.
   
   ## Brief change log
   
     - *See the code change.*
   
   
   ## Verifying this change
   
   The change is verified on 1.9 with the test in https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-574976242
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 3d94fa8f5a675706dcb974b397452d165bcee987 (Thu Jan 16 04:26:42 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-574982744
 
 
   <!--
   Meta data
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/144670708 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   -->
   ## CI report:
   
   * 3d94fa8f5a675706dcb974b397452d165bcee987 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/144670708) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zhuzhurk commented on a change in pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
zhuzhurk commented on a change in pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#discussion_r370213639
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotSharingManager.java
 ##########
 @@ -612,7 +612,9 @@ private void releaseChild(AbstractID childGroupId) {
 				TaskSlot child = children.remove(childGroupId);
 
 				if (child != null) {
-					allTaskSlots.remove(child.getSlotRequestId());
+					if (child == allTaskSlots.get(child.getSlotRequestId())) {
+						allTaskSlots.remove(child.getSlotRequestId());
+					}
 
 Review comment:
   Yes I think it's better to have a check in master to help to expose the unexpected behavior.
   Thanks a lot for reviewing, improving and merging the changes!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-574982744
 
 
   <!--
   Meta data
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/144670708 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:1e891dbe057611d28bc5413bedf499fcca0ae89b Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/145775062 TriggerType:PUSH TriggerID:1e891dbe057611d28bc5413bedf499fcca0ae89b
   Hash:1e891dbe057611d28bc5413bedf499fcca0ae89b Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4577 TriggerType:PUSH TriggerID:1e891dbe057611d28bc5413bedf499fcca0ae89b
   -->
   ## CI report:
   
   * 3d94fa8f5a675706dcb974b397452d165bcee987 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/144670708) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387) 
   * 1e891dbe057611d28bc5413bedf499fcca0ae89b Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/145775062) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4577) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zhuzhurk commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
zhuzhurk commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-577508913
 
 
   No test is added because the change is simple and it is not a problem since 1.10 (but still a problem  in 1.9). Here's a test to exhibit the problem. https://github.com/zhuzhurk/flink/commit/19f4de04b124011f00ab508174bebe82f7e093d2

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] tillrohrmann commented on a change in pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#discussion_r370131977
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotSharingManager.java
 ##########
 @@ -612,7 +612,9 @@ private void releaseChild(AbstractID childGroupId) {
 				TaskSlot child = children.remove(childGroupId);
 
 				if (child != null) {
-					allTaskSlots.remove(child.getSlotRequestId());
+					if (child == allTaskSlots.get(child.getSlotRequestId())) {
+						allTaskSlots.remove(child.getSlotRequestId());
+					}
 
 Review comment:
   I would actually make this a precondition in the master branch because the original problem occurred because we were reusing `SlotRequestIds` which were never intended for this purpose. In the `release-1.9` branch, we can keep it like it is.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] hequn8128 commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
hequn8128 commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-577713981
 
 
   @zhuzhurk @tillrohrmann Thanks a lot for the fix and review!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-574982744
 
 
   <!--
   Meta data
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   -->
   ## CI report:
   
   * 3d94fa8f5a675706dcb974b397452d165bcee987 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-574982744
 
 
   <!--
   Meta data
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/144670708 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:1e891dbe057611d28bc5413bedf499fcca0ae89b Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:1e891dbe057611d28bc5413bedf499fcca0ae89b
   -->
   ## CI report:
   
   * 3d94fa8f5a675706dcb974b397452d165bcee987 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/144670708) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387) 
   * 1e891dbe057611d28bc5413bedf499fcca0ae89b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-574982744
 
 
   <!--
   Meta data
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/144670708 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:1e891dbe057611d28bc5413bedf499fcca0ae89b Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/145775062 TriggerType:PUSH TriggerID:1e891dbe057611d28bc5413bedf499fcca0ae89b
   Hash:1e891dbe057611d28bc5413bedf499fcca0ae89b Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4577 TriggerType:PUSH TriggerID:1e891dbe057611d28bc5413bedf499fcca0ae89b
   -->
   ## CI report:
   
   * 3d94fa8f5a675706dcb974b397452d165bcee987 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/144670708) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387) 
   * 1e891dbe057611d28bc5413bedf499fcca0ae89b Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/145775062) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4577) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] tillrohrmann closed pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
tillrohrmann closed pull request #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zhuzhurk edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
zhuzhurk edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-576558468
 
 
   @tillrohrmann @GJL would you take a look this fix? 1.9.2 is drawing near and I hope we can have this fix in the release.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-574982744
 
 
   <!--
   Meta data
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   Hash:3d94fa8f5a675706dcb974b397452d165bcee987 Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/144670708 TriggerType:PUSH TriggerID:3d94fa8f5a675706dcb974b397452d165bcee987
   -->
   ## CI report:
   
   * 3d94fa8f5a675706dcb974b397452d165bcee987 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/144670708) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4387) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zhuzhurk commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children

Posted by GitBox <gi...@apache.org>.
zhuzhurk commented on issue #10867: [FLINK-14701][runtime] Fix MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867#issuecomment-576558468
 
 
   @GJL would you take a look this fix when you have time? It is needed for 1.9.2.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services