You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "seth saperstein (Jira)" <ji...@apache.org> on 2022/08/24 19:06:00 UTC
[jira] [Created] (FLINK-29099) Deadlock for Single Subtask in Kinesis Consumer

seth saperstein created FLINK-29099:
---------------------------------------

             Summary: Deadlock for Single Subtask in Kinesis Consumer
                 Key: FLINK-29099
                 URL: https://issues.apache.org/jira/browse/FLINK-29099
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Kinesis
    Affects Versions: 1.15.2, 1.14.5, 1.13.6, 1.12.7, 1.11.6, 1.10.3, 1.9.3
            Reporter: seth saperstein


Deadlock is reached as the result of:
 * max lookahead reached for local watermark
 * idle state for subtask

The lookahead prevents the RecordEmitter from emitting a new record. The idle state prevents the global watermark from being updated.

To exit this deadlock state, we need to complete the [TODO here|https://github.com/apache/flink/blob/221d70d9930f72147422ea24b399f006ebbfb8d7/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/internals/KinesisDataFetcher.java#L1268] which updates the global watermark while the subtask is marked idle, which will then allow us to emit a record again as the lookahead is no longer reached.

 

*Context:*

We reached this scenario at Lyft as a result of prolonged CPU throttling on all FlinkKinesisConsumer threads for multiple minutes.

Walking through the series of events:
 * prolonged CPU throttling occurs and no logs are seen from any FlinkKinesisConsumer thread for up to 15 minutes
 * after CPU throttling the subtask is marked idle
 * the subtask has reached the lookahead for its local watermark relative to the global watermark
 * WatermarkSyncCallback indicates the subtask as idle and does not update the global watermark
 * emitQueue fills to max
 * RecordEmitter cannot emit records due to the max lookahead
 * Deadlock on subtask

At this point, we had not realized what had happened and processing of all other shards/subtasks had continued for multiple days. When we finally restarted the application, we saw the following behavior:
 * global watermark recalculated after all subtasks consumed data based on the last kinesis record sequence number
 * global watermark moved back in time multiple days, to when the subtask was first marked idle
 * the single subtask processed data while all others remained idle due to the lookahead

This would have continued until the subtask had caught up to the others and thus the global watermark is within reach of the lookahead for other subtasks.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)