You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "wsry (via GitHub)" <gi...@apache.org> on 2023/04/21 10:36:49 UTC

[GitHub] [flink] wsry opened a new pull request, #22448: [FLINK-31386][network] Fix the potential deadlock issue of blocking shuffle

wsry opened a new pull request, #22448:
URL: https://github.com/apache/flink/pull/22448

   ## What is the purpose of the change
   
   Currently, the SortMergeResultPartition may allocate more network buffers than the guaranteed size of the LocalBufferPool. As a result, some result partitions may need to wait other result partitions to release the over-allocated network buffers to continue. However, the result partitions which have allocated more than guaranteed buffers relies on the processing of input data to trigger data spilling and buffer recycling. The input data further relies on batch reading buffers used by the SortMergeResultPartitionReadScheduler which may already taken by those blocked result partitions that are waiting for buffers. Then deadlock occurs. This patch fixes the deadlock issue by reserving the guaranteed buffers on initializing.
   
   ## Brief change log
   
     - Reserve the guaranteed buffers on initializing for SortMergeResultPartition.
   
   ## Verifying this change
   
   This change added tests.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] wsry commented on pull request #22448: [FLINK-31386][network] Fix the potential deadlock issue of blocking shuffle

Posted by "wsry (via GitHub)" <gi...@apache.org>.
wsry commented on PR #22448:
URL: https://github.com/apache/flink/pull/22448#issuecomment-1517633083

   This is a cherry picked PR, will merge after tests pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] wsry merged pull request #22448: [FLINK-31386][network] Fix the potential deadlock issue of blocking shuffle

Posted by "wsry (via GitHub)" <gi...@apache.org>.
wsry merged PR #22448:
URL: https://github.com/apache/flink/pull/22448


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #22448: [FLINK-31386][network] Fix the potential deadlock issue of blocking shuffle

Posted by "flinkbot (via GitHub)" <gi...@apache.org>.
flinkbot commented on PR #22448:
URL: https://github.com/apache/flink/pull/22448#issuecomment-1517637993

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "f71e3afff45ea49d2ecd060476dc77b90afe3255",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f71e3afff45ea49d2ecd060476dc77b90afe3255",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f71e3afff45ea49d2ecd060476dc77b90afe3255 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org