You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "gyfora (via GitHub)" <gi...@apache.org> on 2023/03/06 12:14:18 UTC

[GitHub] [flink-kubernetes-operator] gyfora opened a new pull request, #544: [FLINK-31318] Improve stability during backlog processing

gyfora opened a new pull request, #544:
URL: https://github.com/apache/flink-kubernetes-operator/pull/544

   ## What is the purpose of the change
   
   When catching up (processing backlog) the best thing for the autoscaler is to remain stable as long as the job can keep up with the required / target data rate.
   
   Currently however it can be possible that the operator would trigger the scale down some underutilized vertices triggering a restart and even more backlog, putting the job in a vicious cycle.
   
   Also it would not make sense to scale vertices even further for minimal utilization gains if the current parallelism is enough to process the backlog in a timely manner.
   
   ## Brief change log
   
     - *Compute CURRENT_PROCESS_RATE metric for sources and use it together with LAG to detect the backlog processing state*
     - *Set the scale down utilization threshold to 0% and scale up to 100% to avoid triggering unnecessary scaling actions*
   
   ## Verifying this change
   
   New unit tests added.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no
     - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not documented
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] mxm commented on a diff in pull request #544: [FLINK-31318] Improve stability during backlog processing

Posted by "mxm (via GitHub)" <gi...@apache.org>.
mxm commented on code in PR #544:
URL: https://github.com/apache/flink-kubernetes-operator/pull/544#discussion_r1127723106


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/config/AutoScalerOptions.java:
##########
@@ -121,6 +121,13 @@ private static ConfigOptions.OptionBuilder autoScalerConfig(String key) {
                     .withDescription(
                             "Expected restart time to be used until the operator can determine it reliably from history.");
 
+    public static final ConfigOption<Duration> BACKLOG_PROCESSING_LAG_THRESHOLD =
+            autoScalerConfig("backlog-processing.lag-threshold")
+                    .durationType()
+                    .defaultValue(Duration.ofMinutes(5))
+                    .withDescription(
+                            "Lag threshold for detecting backlog processing which will trigger more conservative scaling decisions to improve stability.");

Review Comment:
   ```suggestion
                               "Lag threshold which will prevent unnecessary scalings while removing the pending messages responsible for the lag.");
   ```
   
   Maybe that's clearer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora merged pull request #544: [FLINK-31318] Improve stability during backlog processing

Posted by "gyfora (via GitHub)" <gi...@apache.org>.
gyfora merged PR #544:
URL: https://github.com/apache/flink-kubernetes-operator/pull/544


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] mxm commented on a diff in pull request #544: [FLINK-31318] Improve stability during backlog processing

Posted by "mxm (via GitHub)" <gi...@apache.org>.
mxm commented on code in PR #544:
URL: https://github.com/apache/flink-kubernetes-operator/pull/544#discussion_r1126574669


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/ScalingMetricEvaluator.java:
##########
@@ -60,30 +62,66 @@ public class ScalingMetricEvaluator {
 
     private Clock clock = Clock.systemDefaultZone();
 
+    protected static final long BACKLOG_PROC_LAG_SECONDS_THRESHOLD =
+            Duration.ofMinutes(5).toSeconds();

Review Comment:
   We should make that configurable as `MAX_LATENCY` or `MAX_PROCESSING_LATENCY`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #544: [FLINK-31318] Improve stability during backlog processing

Posted by "gyfora (via GitHub)" <gi...@apache.org>.
gyfora commented on code in PR #544:
URL: https://github.com/apache/flink-kubernetes-operator/pull/544#discussion_r1127701017


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/ScalingMetricEvaluator.java:
##########
@@ -60,30 +62,66 @@ public class ScalingMetricEvaluator {
 
     private Clock clock = Clock.systemDefaultZone();
 
+    protected static final long BACKLOG_PROC_LAG_SECONDS_THRESHOLD =
+            Duration.ofMinutes(5).toSeconds();

Review Comment:
   Added a config for this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org