You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Hangxiang Yu (Jira)" <ji...@apache.org> on 2023/04/28 03:03:00 UTC

[jira] [Commented] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scale down via autoscaler

    [ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717440#comment-17717440 ] 

Hangxiang Yu commented on FLINK-31963:
--------------------------------------

We also saw simliar exception when rescaling down manually with unaligned checkpoint is enabled.
This issue is related to unaligned checkpoint rescaling.
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 54
at org.apache.flink.runtime.io.network.partition.PipelinedResultPartition.getCheckpointedSubpartition(PipelinedResultPartition.java:183)
at org.apache.flink.runtime.checkpoint.channel.ResultSubpartitionRecoveredStateHandler.getSubpartition(RecoveredChannelStateHandler.java:222)
at org.apache.flink.runtime.checkpoint.channel.ResultSubpartitionRecoveredStateHandler.lambda$calculateMapping$1(RecoveredChannelStateHandler.java:237)
at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
at java.util.Spliterators$IntArraySpliterator.forEachRemaining(Spliterators.java:1032)
at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.apache.flink.runtime.checkpoint.channel.ResultSubpartitionRecoveredStateHandler.calculateMapping(RecoveredChannelStateHandler.java:238)
at java.util.HashMap.computeIfAbsent(HashMap.java:1126)
at org.apache.flink.runtime.checkpoint.channel.ResultSubpartitionRecoveredStateHandler.getMappedChannels(RecoveredChannelStateHandler.java:227)
at org.apache.flink.runtime.checkpoint.channel.ResultSubpartitionRecoveredStateHandler.getBuffer(RecoveredChannelStateHandler.java:182)
at org.apache.flink.runtime.checkpoint.channel.ResultSubpartitionRecoveredStateHandler.getBuffer(RecoveredChannelStateHandler.java:157)
at org.apache.flink.runtime.checkpoint.channel.ChannelStateChunkReader.readChunk(SequentialChannelStateReaderImpl.java:198)
at org.apache.flink.runtime.checkpoint.channel.SequentialChannelStateReaderImpl.readSequentially(SequentialChannelStateReaderImpl.java:107)
at org.apache.flink.runtime.checkpoint.channel.SequentialChannelStateReaderImpl.read(SequentialChannelStateReaderImpl.java:93)
at org.apache.flink.runtime.checkpoint.channel.SequentialChannelStateReaderImpl.readOutputData(SequentialChannelStateReaderImpl.java:79)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:704)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:683)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:650)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:954)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:923)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:746)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:568)
at java.lang.Thread.run(Thread.java:834) {code}

> java.lang.ArrayIndexOutOfBoundsException when scale down via autoscaler
> -----------------------------------------------------------------------
>
>                 Key: FLINK-31963
>                 URL: https://issues.apache.org/jira/browse/FLINK-31963
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator, Runtime / Checkpointing
>    Affects Versions: 1.17.0
>         Environment: Flink: 1.17.0
> FKO: 1.4.0
> StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint enabled)
>            Reporter: Tan Kim
>            Priority: Critical
>              Labels: stability
>         Attachments: jobmanager_error.txt, taskmanager_error.txt
>
>
> I'm testing Autoscaler through Kubernetes Operator and I'm facing the following issue.
> As you know, when a job is scaled down through the autoscaler, the job manager and task manager go down and then back up again.
> When this happens, an index out of bounds exception is thrown and the state is not restored from a checkpoint.
> [~gyfora] told me via the Flink Slack troubleshooting channel that this is likely an issue with Unaligned Checkpoint and not an issue with the autoscaler, but I'm opening a ticket with Gyula for more clarification.
> Please see the attached JM and TM error logs.
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)