You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Kyungmin Kim (Jira)" <ji...@apache.org> on 2023/04/24 05:19:00 UTC

[jira] [Comment Edited] (FLINK-31898) Flink k8s autoscaler does not work as expected

    [ https://issues.apache.org/jira/browse/FLINK-31898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715575#comment-17715575 ] 

Kyungmin Kim edited comment on FLINK-31898 at 4/24/23 5:18 AM:
---------------------------------------------------------------

[~Zhanghao Chen] 

Hi metric data I provided above are expired now so I'll give you the new experiment information.

Experiment is done between *10:30 to 12:00* and I used Kafka as a data source. 

!image-2023-04-24-13-27-17-478.png|width=249,height=216!!image-2023-04-24-13-28-15-462.png|width=463,height=86!!image-2023-04-24-13-31-06-420.png|width=301,height=125!

Please see when scaling to new parallelism and lag count fluctuate.

As I described already, one task can process 10 TPS and I'm making about 35 RPS. 

I'm not sure why autoscaler does not stop when it reaches parallelism 4 (4->8 scaling happens around 11:20).

 

busy time for the map operator

!image-2023-04-24-13-41-43-040.png|width=268,height=201!

in/out per second for map operator

!image-2023-04-24-13-42-40-124.png|width=686,height=250!

in/out per second for source operator

!image-2023-04-24-14-18-12-450.png|width=673,height=248!

some metrics that might help.

!image-2023-04-24-13-43-49-431.png|width=307,height=265!!image-2023-04-24-13-44-17-479.png|width=277,height=265!

Sorry for the too many data. Let me know if you need more metrics.

Last but not least my flink operator's configuration.
{code:java}
pipeline.max-parallelism: "8"
kubernetes.operator.job.autoscaler.enabled: "true"
kubernetes.operator.job.autoscaler.stabilization.interval: 6m
kubernetes.operator.job.autoscaler.metrics.window: 3m
kubernetes.operator.job.autoscaler.target.utilization: "0.6"
kubernetes.operator.job.autoscaler.target.utilization.boundary: "0.3"
kubernetes.operator.job.autoscaler.restart.time: 2m
kubernetes.operator.job.autoscaler.catch-up.duration: 6m
kubernetes.operator.job.autoscaler.scale-down.max-factor: "0.3" {code}
Thank you so much and have a nice day!


was (Author: JIRAUSER299786):
[~Zhanghao Chen] 

Hi metric data I provided above are expired now so I'll give you the new experiment information.

Experiment is done between *10:30 to 12:00* and I used Kafka as a data source. 

!image-2023-04-24-13-27-17-478.png|width=249,height=216!!image-2023-04-24-13-28-15-462.png|width=463,height=86!!image-2023-04-24-13-31-06-420.png|width=301,height=125!

Please see when scaling to new parallelism and lag count fluctuate.

As I described already, one task can process 10 TPS and I'm making about 35 RPS. 

I'm not sure why autoscaler does not stop when it reaches parallelism 4 (4->8 scaling happens around 11:20).

 

busy time for the map operator

!image-2023-04-24-13-41-43-040.png|width=268,height=201!

in/out per second for map operator

!image-2023-04-24-13-42-40-124.png|width=686,height=250!

some metrics that might help.

!image-2023-04-24-13-43-49-431.png|width=307,height=265!!image-2023-04-24-13-44-17-479.png|width=277,height=265!

Sorry for the too many data. Let me know if you need more metrics.

Last but not least my flink operator's configuration.
{code:java}
pipeline.max-parallelism: "8"
kubernetes.operator.job.autoscaler.enabled: "true"
kubernetes.operator.job.autoscaler.stabilization.interval: 6m
kubernetes.operator.job.autoscaler.metrics.window: 3m
kubernetes.operator.job.autoscaler.target.utilization: "0.6"
kubernetes.operator.job.autoscaler.target.utilization.boundary: "0.3"
kubernetes.operator.job.autoscaler.restart.time: 2m
kubernetes.operator.job.autoscaler.catch-up.duration: 6m
kubernetes.operator.job.autoscaler.scale-down.max-factor: "0.3" {code}
Thank you so much and have a nice day!

> Flink k8s autoscaler does not work as expected
> ----------------------------------------------
>
>                 Key: FLINK-31898
>                 URL: https://issues.apache.org/jira/browse/FLINK-31898
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Kyungmin Kim
>            Priority: Major
>         Attachments: image-2023-04-24-10-54-58-083.png, image-2023-04-24-13-27-17-478.png, image-2023-04-24-13-28-15-462.png, image-2023-04-24-13-31-06-420.png, image-2023-04-24-13-41-43-040.png, image-2023-04-24-13-42-40-124.png, image-2023-04-24-13-43-49-431.png, image-2023-04-24-13-44-17-479.png, image-2023-04-24-14-18-12-450.png
>
>
> Hi I'm using Flink k8s autoscaler to automatically deploy jobs in proper parallelism.
> I was using 1.4 version but I found that it does not scale down properly because TRUE_PROCESSING_RATE becoming NaN when the tasks are idled.
> In the main branch, I checked the code was fixed to set TRUE_PROCESSING_RATE to positive infinity and make scaleFactor to very low value so I'm now experimentally using docker image built with main branch of Flink-k8s-operator repository in my job.
> It now scales down properly but the problem is, it does not converge to the optimal parallelism. It scales down well but it jumps up again to high parallelism. 
>  
> Below is the experimental setup and my figure of parallelism changes result.
>  * about 40 RPS
>  * each task can process 10 TPS (intended throttling)
> !image-2023-04-24-10-54-58-083.png|width=999,height=266!
> Even using default configuration leads to the same result. What can I do more? Thank you.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)