You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Thomas Weise (Jira)" <ji...@apache.org> on 2022/11/28 23:37:00 UTC

[jira] [Resolved] (FLINK-29099) Deadlock for Single Subtask in Kinesis Consumer

     [ https://issues.apache.org/jira/browse/FLINK-29099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Weise resolved FLINK-29099.
----------------------------------
    Fix Version/s: 1.17.0
       Resolution: Fixed

> Deadlock for Single Subtask in Kinesis Consumer
> -----------------------------------------------
>
>                 Key: FLINK-29099
>                 URL: https://issues.apache.org/jira/browse/FLINK-29099
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kinesis
>    Affects Versions: 1.9.3, 1.10.3, 1.11.6, 1.12.7, 1.13.6, 1.14.5, 1.15.3
>            Reporter: seth saperstein
>            Priority: Minor
>              Labels: connector, consumer, kinesis, pull-request-available
>             Fix For: 1.17.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Deadlock is reached as the result of:
>  * max lookahead reached for local watermark
>  * idle state for subtask
> The lookahead prevents the RecordEmitter from emitting a new record. The idle state prevents the global watermark from being updated.
> To exit this deadlock state, we need to complete the [TODO here|https://github.com/apache/flink/blob/221d70d9930f72147422ea24b399f006ebbfb8d7/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/internals/KinesisDataFetcher.java#L1268] which updates the global watermark while the subtask is marked idle, which will then allow us to emit a record again as the lookahead is no longer reached.
>  
> *Context:*
> We reached this scenario at Lyft as a result of prolonged CPU throttling on all FlinkKinesisConsumer threads for multiple minutes.
> Walking through the series of events for a single subtask:
>  * prolonged CPU throttling occurs and no logs are seen from any FlinkKinesisConsumer thread for up to 15 minutes
>  * after CPU throttling the subtask is marked idle
>  * the subtask has reached the lookahead for its local watermark relative to the global watermark
>  * WatermarkSyncCallback indicates the subtask as idle and does not update the global watermark
>  * emitQueue fills to max
>  * RecordEmitter cannot emit records due to the max lookahead
>  * Deadlock on subtask
> At this point, we had not realized what had happened and processing of all other shards/subtasks had continued for multiple days. When we finally restarted the application, we saw the following behavior:
>  * global watermark recalculated after all subtasks consumed data based on the last kinesis record sequence number
>  * global watermark moved back in time multiple days, to when the subtask was first marked idle
>  * the single subtask processed data while all others remained idle due to the lookahead
> This would have continued until the subtask had caught up to the others and thus the global watermark is within reach of the lookahead for other subtasks.
>  
> *Repro:*
> Too difficult to repro the exact scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)