You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Liu (Jira)" <ji...@apache.org> on 2023/04/28 06:44:00 UTC

[jira] [Comment Edited] (FLINK-30773) Add API for rescaling of jobs based on per-vertex parallelism overrides

    [ https://issues.apache.org/jira/browse/FLINK-30773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717449#comment-17717449 ] 

Liu edited comment on FLINK-30773 at 4/28/23 6:43 AM:
------------------------------------------------------

[~mxm] The rescaling api is useful in many situations such as:
 * Increase the parallelism when the job load is high. As latency is important for streaming systems, the rescale api can help us resolve this problem quickly.
 * Decrease the parallelism when the job needn't so much resource. When the cluster is big, this way can help us save much resources which means much money.
 * The rescaling api can decrease the stop-the-world time when changing the job's parallelism. This is important for the long-running streaming computation.

In fact, we have implemented the rescaling api in our company for default scheduler. The general idea is similar with yours. I understand your long considerations for state migration for operators. But this may need a long time to wait. A rescaling api may be valuable for may users. Can we consider to implement it in the recent time?

Looking forward to your reply. Thanks.

One more word, I wonder whether we can make the rescaling a common api for all schedulers?


was (Author: jiangang):
[~mxm] The rescaling api is useful in many situations such as:
 * Increase the parallelism when the job load is high. As latency is important for streaming systems, the rescale api can help us resolve this problem quickly.
 * Decrease the parallelism when the job needn't so much resource. When the cluster is big, this way can help us save much resources which means much money.
 * The rescaling api can decrease the stop-the-world time when changing the job's parallelism. This is important for the long-running streaming computation.

In fact, we have implemented the rescaling api in our company. The general idea is similar with yours. I understand your long considerations for state migration for operators. But this may need a long time to wait. A rescaling api may be valuable for may users. Can we consider to implement it in the recent time?

Looking forward to your reply. Thanks.

> Add API for rescaling of jobs based on per-vertex parallelism overrides
> -----------------------------------------------------------------------
>
>                 Key: FLINK-30773
>                 URL: https://issues.apache.org/jira/browse/FLINK-30773
>             Project: Flink
>          Issue Type: New Feature
>          Components: Autoscaler, Runtime / Coordination, Runtime / REST
>            Reporter: Maximilian Michels
>            Assignee: Maximilian Michels
>            Priority: Major
>         Attachments: meces.patch
>
>
> FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism overrides map. This feature is already used today by the Autoscaler of the Flink Kubernetes operator. However, it requires a full restart of the Flink job and only supports the application deployment mode.
> In a K8s environment, this is inefficient because all pods for a deployment will be surrendered. Upon restart, they have to be re-acquired. In addition to being slow, this can also lead to situations where resource constraints prevent a restart from executing properly.
> Ideally, we would would want the following to happen on receiving a rescale request:
>  # Rescale API receives a request with a parallelism overrides map (vertexId => parallelism) for a jobId
>  # Compute the number of required task slots using the overrides and the current JobGraph
>  ## If the total number of task slots for the cluster is less than the required number of task slots of the rescale, acquire the missing task slots. Otherwise, do nothing
>  ## Wait for new task slots to become available
>  ## Abort rescale request on timeout
>  # Redeploy the JobGraph / Tasks with the updated parallelisms
>  # Surrender any unused task slots in case of scaling down
>  
> There is an existing "Rescaling" API which is currently disabled. We have to evaluate whether reusing it makes sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)