You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Thomas Weise (Jira)" <ji...@apache.org> on 2022/11/28 23:37:00 UTC
[jira] [Resolved] (FLINK-29099) Deadlock for Single Subtask in Kinesis Consumer
[ https://issues.apache.org/jira/browse/FLINK-29099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Weise resolved FLINK-29099.
----------------------------------
Fix Version/s: 1.17.0
Resolution: Fixed
> Deadlock for Single Subtask in Kinesis Consumer
> -----------------------------------------------
>
> Key: FLINK-29099
> URL: https://issues.apache.org/jira/browse/FLINK-29099
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Kinesis
> Affects Versions: 1.9.3, 1.10.3, 1.11.6, 1.12.7, 1.13.6, 1.14.5, 1.15.3
> Reporter: seth saperstein
> Priority: Minor
> Labels: connector, consumer, kinesis, pull-request-available
> Fix For: 1.17.0
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> Deadlock is reached as the result of:
> * max lookahead reached for local watermark
> * idle state for subtask
> The lookahead prevents the RecordEmitter from emitting a new record. The idle state prevents the global watermark from being updated.
> To exit this deadlock state, we need to complete the [TODO here|https://github.com/apache/flink/blob/221d70d9930f72147422ea24b399f006ebbfb8d7/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/internals/KinesisDataFetcher.java#L1268] which updates the global watermark while the subtask is marked idle, which will then allow us to emit a record again as the lookahead is no longer reached.
>
> *Context:*
> We reached this scenario at Lyft as a result of prolonged CPU throttling on all FlinkKinesisConsumer threads for multiple minutes.
> Walking through the series of events for a single subtask:
> * prolonged CPU throttling occurs and no logs are seen from any FlinkKinesisConsumer thread for up to 15 minutes
> * after CPU throttling the subtask is marked idle
> * the subtask has reached the lookahead for its local watermark relative to the global watermark
> * WatermarkSyncCallback indicates the subtask as idle and does not update the global watermark
> * emitQueue fills to max
> * RecordEmitter cannot emit records due to the max lookahead
> * Deadlock on subtask
> At this point, we had not realized what had happened and processing of all other shards/subtasks had continued for multiple days. When we finally restarted the application, we saw the following behavior:
> * global watermark recalculated after all subtasks consumed data based on the last kinesis record sequence number
> * global watermark moved back in time multiple days, to when the subtask was first marked idle
> * the single subtask processed data while all others remained idle due to the lookahead
> This would have continued until the subtask had caught up to the others and thus the global watermark is within reach of the lookahead for other subtasks.
>
> *Repro:*
> Too difficult to repro the exact scenario.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)