You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (Jira)" <ji...@apache.org> on 2019/12/03 15:19:00 UTC

[jira] [Comment Edited] (FLINK-14872) Potential deadlock for task reading from blocking ResultPartition.

    [ https://issues.apache.org/jira/browse/FLINK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986991#comment-16986991 ] 

Piotr Nowojski edited comment on FLINK-14872 at 12/3/19 3:18 PM:
-----------------------------------------------------------------

{quote}
. The only problem is that we may use more require buffer than before.
{quote}
I think this is the blocker, which would cause quite a lot of deployments to start failing. As I wrote above, for the quick fix, I would vote for:
{quote}
For a quick fix, we might want to configure InputGate for BoundedBlockingSubpartition to request always obligatory "1 exclusive buffer per channel + couple of floating", without any "optional" buffers. Probably we could go away without any floating buffers, as performance will be bottlenecked by reading from files on the sender side.
{quote}
{quote}
There are argument checks and we can not set the number of exclusive buffer per channel to 0 currently.
{quote}
I think it's not only those checks, but currently code always assumes that it can send some exclusive buffers and this is used to propagate pending backlog lenght. With 0 exclusive buffers on the input gates, upstream producer, after completing a buffer, would have to pro-actively send a "credit request message", to acquire a credit.


was (Author: pnowojski):
{quote}
. The only problem is that we may use more require buffer than before.
{quote}
I think this the blocker of always requesting all of the buffers. As I wrote above, for the quick fix, I would vote for:
{quote}
For a quick fix, we might want to configure InputGate for BoundedBlockingSubpartition to request always obligatory "1 exclusive buffer per channel + couple of floating", without any "optional" buffers. Probably we could go away without any floating buffers, as performance will be bottlenecked by reading from files on the sender side.
{quote}
{quote}
There are argument checks and we can not set the number of exclusive buffer per channel to 0 currently.
{quote}
I think it's not only those checks, but currently code always assumes that it can send some exclusive buffers and this is used to propagate pending backlog lenght. With 0 exclusive buffers on the input gates, upstream producer, after completing a buffer, would have to pro-actively send a "credit request message", to acquire a credit.

> Potential deadlock for task reading from blocking ResultPartition.
> ------------------------------------------------------------------
>
>                 Key: FLINK-14872
>                 URL: https://issues.apache.org/jira/browse/FLINK-14872
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>            Reporter: Yingjie Cao
>            Priority: Blocker
>             Fix For: 1.10.0
>
>
> Currently, the buffer pool size of InputGate reading from blocking ResultPartition is unbounded which have a potential of using too many buffers and may lead to ResultPartition of the same task can not acquire enough core buffers and finally lead to deadlock.
> Considers the following case:
> Core buffers are reserved for InputGate and ResultPartition -> InputGate consumes lots of Buffer (not including the buffer reserved for ResultPartition) -> Other tasks acquire exclusive buffer for InputGate and trigger redistribute of Buffers (Buffers taken by previous InputGate can not be released) -> The first task of which InputGate uses lots of buffers begin to emit records but can not acquire enough core Buffers (Some operators may not emit records out immediately or there is just nothing to emit) -> Deadlock.
>  
> I think we can fix this problem by limit the number of Buffers can be allocated by a InputGate which reads from blocking ResultPartition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)