You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/06/17 08:22:19 UTC

[GitHub] [flink] pnowojski commented on a diff in pull request #19993: [FLINK-28077][checkpoint] Fix the bug that tasks get stuck during cancellation in ChannelStateWriteRequestExecutorImpl

pnowojski commented on code in PR #19993:
URL: https://github.com/apache/flink/pull/19993#discussion_r899887457


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriteRequest.java:
##########
@@ -109,6 +112,9 @@ static ChannelStateWriteRequest buildFutureWriteRequest(
                     }
                 },
                 throwable -> {
+                    if (!dataFuture.isDone()) {
+                        return;
+                    }

Review Comment:
   I agree with @zentol  that this doesn't look good and I would be afraid it could lead to some resource leaks.
   
   It looks to me like the issue is that `dataFuture` is being cancelled from the chain: `PipelinedSubpartition#release()` <- ... <- `ResultPartition#release` <- ... <- `NettyShuffleEnvironment#close`. Which happens after `StreamTask#cleanUp` (which is waiting for this future to complete), leading to a deadlock.
   
   We would either need to cancel the future sooner (`StreamTask#cleanUp`?)`, or do what @zentol proposed. I think the latter is indeed a good option. We don't need to blockingly wait. Let's just not completely ignore exceptions here. Logging error should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org