You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Julien Tournay (Jira)" <ji...@apache.org> on 2023/04/03 09:03:00 UTC
[jira] [Comment Edited] (FLINK-31144) Slow scheduling on large-scale batch jobs
[ https://issues.apache.org/jira/browse/FLINK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704538#comment-17704538 ]
Julien Tournay edited comment on FLINK-31144 at 4/3/23 9:02 AM:
----------------------------------------------------------------
Hey [~martijnvisser] this is great! Thanks for sharing :)
Spotify is in a good position to identify inefficiencies and possible optimizations on high parallelism jobs.
I'm hopeful we can contribute to improve he performances further in the future!
was (Author: jto):
Hey [~martijnvisser] this is great! Thanks for sharing :)
Spotify is in a good position to identify inefficiencies and possible optimizations on high parallelism jobs.
I'm hopefully we can contribute to improve he performances further in the future!
> Slow scheduling on large-scale batch jobs
> ------------------------------------------
>
> Key: FLINK-31144
> URL: https://issues.apache.org/jira/browse/FLINK-31144
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.17.0, 1.15.3, 1.16.1
> Reporter: Julien Tournay
> Assignee: Junrui Li
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.17.0
>
> Attachments: Screenshot 2023-03-13 at 14.22.27.png, flink-1.17-snapshot-1676473798013.nps, image-2023-02-21-10-29-49-388.png
>
>
> When executing a complex job graph at high parallelism `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can get slow and cause long pauses where the JobManager becomes unresponsive and all the taskmanagers just wait. I've attached a VisualVM snapshot to illustrate the problem.[^flink-1.17-snapshot-1676473798013.nps]
> At Spotify we have complex jobs where this issue can cause batch "pause" of 40+ minutes and make the overall execution 30% slower or more.
> More importantly this prevent us from running said jobs on larger cluster as adding resources to the cluster worsen the issue.
> We have successfully tested a modified Flink version where `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was completely commented and simply returns an empty collection and confirmed it solves the issue.
> In the same spirit as a recent change ([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)] there could be a mechanism in place to detect when Flink run into this specific issue and just skip the call to `getInputLocationFutures` [https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.]
> I'm not familiar enough with the internals of Flink to propose a more advanced fix, however it seems like a configurable threshold on the number of consumer vertices above which the preferred location is not computed would do. If this solution is good enough, I'd be happy to submit a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)