You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yingjie Cao (Jira)" <ji...@apache.org> on 2022/10/14 10:00:00 UTC

[jira] [Assigned] (FLINK-29298) LocalBufferPool request buffer from NetworkBufferPool hanging

     [ https://issues.apache.org/jira/browse/FLINK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yingjie Cao reassigned FLINK-29298:
-----------------------------------

    Assignee: Weijie Guo

> LocalBufferPool request buffer from NetworkBufferPool hanging
> -------------------------------------------------------------
>
>                 Key: FLINK-29298
>                 URL: https://issues.apache.org/jira/browse/FLINK-29298
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.16.0
>            Reporter: Weijie Guo
>            Assignee: Weijie Guo
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.17.0
>
>         Attachments: image-2022-09-14-10-52-15-259.png, image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png
>
>
> In the scenario where the buffer contention is fierce, sometimes the task hang can be observed. Through the thread dump information, we can found that the task thread is blocked by requestMemorySegmentBlocking forever. After investigating the dumped heap information, I found that the NetworkBufferPool actually has many buffers, but the LocalBufferPool is still unavailable and no buffer has been obtained.
> By looking at the code, I am sure that this is a bug in thread race: when the task thread polled out the last buffer in LocalBufferPool and triggered the onGlobalPoolAvailable callback itself, it will skip this notification  (as currently the LocalBufferPool is available), which will cause the BufferPool to eventually become unavailable and will never register a callback to the NetworkBufferPool.
> The conditions for triggering the problem are relatively strict, but I have found a stable way to reproduce it, I will try to fix and verify this problem.
> !image-2022-09-14-10-52-15-259.png|width=1021,height=219!
> !image-2022-09-14-10-58-45-987.png|width=997,height=315!
> !image-2022-09-14-11-00-47-309.png|width=453,height=121!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)