You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Weijie Guo (Jira)" <ji...@apache.org> on 2023/04/04 02:38:00 UTC
[jira] [Comment Edited] (FLINK-31293) Request memory segment from LocalBufferPool may hanging forever.

    [ https://issues.apache.org/jira/browse/FLINK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708190#comment-17708190 ] 

Weijie Guo edited comment on FLINK-31293 at 4/4/23 2:37 AM:
------------------------------------------------------------

master(1.18) via fb6caee13710348a9b53284c2cabbdb2e7aa9739.
release-1.17 via 6a476bee5e452d1f172173ec018939c8a154886c.
release-1.16 via 9582727387d368d1b9e358aedb55c3f2eaae4371.



was (Author: weijie guo):
master(1.18) via fb6caee13710348a9b53284c2cabbdb2e7aa9739.
release-1.16 via
release-1.17 via 

> Request memory segment from LocalBufferPool may hanging forever.
> ----------------------------------------------------------------
>
>                 Key: FLINK-31293
>                 URL: https://issues.apache.org/jira/browse/FLINK-31293
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.17.0, 1.16.1, 1.18.0
>            Reporter: Weijie Guo
>            Assignee: Weijie Guo
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.16.2, 1.18.0, 1.17.1
>
>         Attachments: image-2023-03-02-12-23-50-572.png, image-2023-03-02-12-28-48-437.png, image-2023-03-02-12-29-03-003.png
>
>
> In our TPC-DS test, we found that in the case of fierce competition in network memory, some tasks may hanging forever.
> From the thread dump information, we can see that the task is waiting for the {{LocalBufferPool}} to become available. It is strange that other tasks have finished and released network memory already. Undoubtedly, this is an unexpected behavior, which implies that there must be a bug in the {{LocalBufferPool}} or {{{}NetworkBufferPool{}}}.
> !image-2023-03-02-12-23-50-572.png|width=650,height=153!
> By dumping the heap memory, we can find a strange phenomenon that there are available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be un-available. Another thing to note is that it now holds an overdraft buffer.
> !image-2023-03-02-12-28-48-437.png|width=520,height=200!
> !image-2023-03-02-12-29-03-003.png|width=438,height=84!
> TL;DR: This problem occurred in multi-thread race related to the introduction of overdraft buffer.
> Suppose we have two threads, called A and B. For simplicity, {{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is called {{{}GlobalPool{}}}.
> Thread A continuously request buffers blocking from the {{{}LocalPool{}}}.
> Thread B continuously return buffers to {{{}GlobalPool{}}}.
>  # If thread A takes the last available buffer of {{{}LocalPool{}}}, but {{GlobalPool}} does not have a buffer at this time, it will register a callback function with {{{}GlobalPool{}}}.
>  # Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to trigger the callback.
>  # Thread A continues to request buffer. Because the {{availableMemorySegments}} of {{LocalPool}} is empty, it requests the overdraftBuffer instead. But there is already a buffer in the {{{}GlobalPool{}}}, it successfully gets the buffer.
>  # Thread B triggers the callback. Since there is no buffer in {{GlobalPool}} now, the callback is re-registered.
>  # Thread A continues to request buffer. Because there is no buffer in {{{}GlobalPool{}}}, it will block on {{{}CompletableFuture#get{}}}.
>  # Thread B continues to return a buffer and triggers the recently registered callback. As a result, {{LocalPool}} puts the buffer into {{{}availableMemorySegments{}}}. Unfortunately, the current logic of {{shouldBeAvailable}} method is: if there is an overdraft buffer, {{LocalPool}} is considered as un-available.
>  # Thread A will keep blocking forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)