You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Wilson Wu (Jira)" <ji...@apache.org> on 2022/10/05 00:44:00 UTC
[jira] [Updated] (FLINK-29365) Millisecond behind latest jumps after Flink 1.15.2 upgrade

     [ https://issues.apache.org/jira/browse/FLINK-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wilson Wu updated FLINK-29365:
------------------------------
    Attachment: 2022-09-14T17_00_00.000Z - 2022-09-14T18_03_44.089Z_14.4-15.2.numbers
                2022-09-15T12_49_54.686Z - 2022-09-15T14_57_44.089Z_15.2-15.2.numbers

> Millisecond behind latest jumps after Flink 1.15.2 upgrade
> ----------------------------------------------------------
>
>                 Key: FLINK-29365
>                 URL: https://issues.apache.org/jira/browse/FLINK-29365
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kinesis
>    Affects Versions: 1.15.2
>         Environment: Redeployment from 1.14.4 to 1.15.2
>            Reporter: Wilson Wu
>            Priority: Major
>         Attachments: 2022-09-14T17_00_00.000Z - 2022-09-14T18_03_44.089Z_14.4-15.2.numbers, 2022-09-15T12_49_54.686Z - 2022-09-15T14_57_44.089Z_15.2-15.2.numbers, Screen Shot 2022-09-19 at 2.50.56 PM.png
>
>
> (First time filling a ticket in Flink community, please let me know if there are any guidelines I need to follow)
> I noticed a very strange behavior with a recent version bump from Flink 1.14.4 to 1.15.2. My project consumes around 30K records per second from a sharded kinesis stream, and during the version upgrade, it will follow the best practice to first trigger a savepoint from the running job, start the new job from the savepoint and then remove the old job. So far so good, and the above logic has been tested multiple times without any issue for 1.14.4. Usually, after the version upgrade, our job will have a few minutes delay for millisecond behind latest, but it will catch up with the speed quickly(within 30mins). Our savepoint is around one hundred MBs big, and our job DAG will become 90 - 100% busy with some backpressure when we redeploy but after 10-20 minutes it goes back to normal.
> Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade from a running 1.14.4 job, I can see a savepoint has been created and the new job is running, all the metrics look fine, except suddenly [millisecond behind the latest|https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html] jumps to 10 hours!! and it takes days for my application to catch up with the kinesis stream latest record. I don't understand why it jumps from 0 second to 10+ hours when we restart the new job. The only main change I introduced with version bump is to change [failOnError|https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/connector/kinesis/sink/KinesisStreamsSink.html] from true to false, but I don't think this is the root cause.
> I tried to redeploy the new 1.15.2 job by changing our parallelism, redeploying a job from 1.15.2 does not introduce a big delay, so I assume the issue above only happens when we bump version from 1.14.4 to 1.15.2(note the attached screenshot)? I did try to bump it twice and I see the same 10hrs+ jump in delay, we do not have changes related to any timezones.
> Please let me know if this can be filled as a bug, as I do not have a running project with all the kinesis setup available that can reproduce the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)