You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Lijie Wang (Jira)" <ji...@apache.org> on 2023/04/07 04:33:00 UTC
[jira] [Comment Edited] (FLINK-31628) ArrayIndexOutOfBoundsException in watermark processing

    [ https://issues.apache.org/jira/browse/FLINK-31628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709580#comment-17709580 ] 

Lijie Wang edited comment on FLINK-31628 at 4/7/23 4:32 AM:
------------------------------------------------------------

This is caused by FLINK-30544, in which we introduced a heap to help find the minimum watermark. This error occurs when the channel status changes like this:

phase 1: channel becomes IDLE (receives WatermarkStatus.IDLE)
phase 2: channel becomes ACTIVE (receives WatermarkStatus.ACTIVE), but current watermark of the channel is less than the last output watermark (last watermark sent to downstream tasks)
phase 3: channel becomes IDLE again (receives WatermarkStatus.IDLE again)

In phase 1, we remove the channel from the heap because it is idle, should no longer participate in the calculation of the minimum watermark. In phase 2, although the channel becomes active, but its watermark is expired(less than the last output watermark), so we don't add it back to the heap. And then in phase 3, we try to remove the channel again, but the channel is not in the heap, which causes the above problem.


was (Author: wanglijie95):
This is caused by FLINK-30544, in which we introduced a heap to help find the minimum watermark. This error occurs when the channel status changes like this:

phase 1. channel becomes IDLE (receives WatermarkStatus.IDLE)
phase 2. channel becomes ACTIVE (receives WatermarkStatus.ACTIVE), but current watermark of the channel is less than the last output watermark (last watermark sent to downstream tasks)
phase 3. channel becomes IDLE again (receives WatermarkStatus.IDLE again)

In phase 1, we remove the channel from the heap because it is idle, should no longer participate in the calculation of the minimum watermark. In phase 2, although the channel becomes active, but its watermark is expired(less than the last output watermark), so we don't add it back to the heap. And then in phase 3, we try to remove the channel again, but the channel is not in the heap, which causes the above problem.

> ArrayIndexOutOfBoundsException in watermark processing
> ------------------------------------------------------
>
>                 Key: FLINK-31628
>                 URL: https://issues.apache.org/jira/browse/FLINK-31628
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.17.0
>         Environment: Kubernetes with Flink operator 1.4.0.
>            Reporter: Michael Helmling
>            Assignee: Lijie Wang
>            Priority: Major
>             Fix For: 1.18.0, 1.17.1
>
>
> After upgrading a job from Flink 1.16.1 to 1.17.0, my task managers throw the following exception:
>  
>  
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: Index -2147483648 out of bounds for length 5
> 	at org.apache.flink.streaming.runtime.watermarkstatus.HeapPriorityQueue.removeInternal(HeapPriorityQueue.java:155)
> 	at org.apache.flink.streaming.runtime.watermarkstatus.HeapPriorityQueue.remove(HeapPriorityQueue.java:100)
> 	at org.apache.flink.streaming.runtime.watermarkstatus.StatusWatermarkValve$InputChannelStatus.removeFrom(StatusWatermarkValve.java:300)
> 	at org.apache.flink.streaming.runtime.watermarkstatus.StatusWatermarkValve$InputChannelStatus.access$200(StatusWatermarkValve.java:266)
> 	at org.apache.flink.streaming.runtime.watermarkstatus.StatusWatermarkValve.markWatermarkUnaligned(StatusWatermarkValve.java:222)
> 	at org.apache.flink.streaming.runtime.watermarkstatus.StatusWatermarkValve.inputWatermarkStatus(StatusWatermarkValve.java:140)
> 	at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:153)
> 	at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
> 	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
> 	at org.apache.flink.streaming.runtime.io.StreamMultipleInputProcessor.processInput(StreamMultipleInputProcessor.java:85)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
> 	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
> 	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
> 	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
> 	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> 	at java.base/java.lang.Thread.run(Unknown Source){code}
> I never saw this before. The job has multiple Kafka inputs, but doesn't use watermark alignment.
>  
>  
> Initially reported [on Slack|https://apache-flink.slack.com/archives/C03G7LJTS2G/p1679908171461309], where a relation to FLINK-28853 was suspected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)