You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Piotr Nowojski (Jira)" <ji...@apache.org> on 2023/04/28 09:52:00 UTC

[jira] [Comment Edited] (FLINK-31963) java.lang.ArrayIndexOutOfBoundsException when scale down via autoscaler

    [ https://issues.apache.org/jira/browse/FLINK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717586#comment-17717586 ] 

Piotr Nowojski edited comment on FLINK-31963 at 4/28/23 9:51 AM:
-----------------------------------------------------------------

Yes, I agree it looks like a problem with unaligned checkpoints. [~tanee.kim] could you clarify a couple of things?
* Can you reproduce this issue? Or did it happen only once? Or maybe a couple of times, but not always? If you can reproduce, can you post steps to reproduce?
* Could you share a job graph for which this error happened?
* What are the parallelism values before the rescale and after for all of the tasks?
* From which task/subtask this error is being thrown?


was (Author: pnowojski):
Yes, I agree it looks like a problem with unaligned checkpoints. Could you clarify a couple of things?
* Can you reproduce this issue? Or did it happen only once? Or maybe a couple of times, but not always? If you can reproduce, can you post steps to reproduce?
* Could you share a job graph for which this error happened?
* What are the parallelism values before the rescale and after for all of the tasks?
* From which task/subtask this error is being thrown?

> java.lang.ArrayIndexOutOfBoundsException when scale down via autoscaler
> -----------------------------------------------------------------------
>
>                 Key: FLINK-31963
>                 URL: https://issues.apache.org/jira/browse/FLINK-31963
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.17.0
>         Environment: Flink: 1.17.0
> FKO: 1.4.0
> StateBackend: RocksDB(Genetic Incremental Checkpoint & Unaligned Checkpoint enabled)
>            Reporter: Tan Kim
>            Priority: Critical
>              Labels: stability
>         Attachments: jobmanager_error.txt, taskmanager_error.txt
>
>
> I'm testing Autoscaler through Kubernetes Operator and I'm facing the following issue.
> As you know, when a job is scaled down through the autoscaler, the job manager and task manager go down and then back up again.
> When this happens, an index out of bounds exception is thrown and the state is not restored from a checkpoint.
> [~gyfora] told me via the Flink Slack troubleshooting channel that this is likely an issue with Unaligned Checkpoint and not an issue with the autoscaler, but I'm opening a ticket with Gyula for more clarification.
> Please see the attached JM and TM error logs.
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)