You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "gyfora (via GitHub)" <gi...@apache.org> on 2023/04/26 06:37:59 UTC

[GitHub] [flink-kubernetes-operator] gyfora opened a new pull request, #575: [FLINK-31885] Trigger event on autoscaler error

gyfora opened a new pull request, #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575

   ## What is the purpose of the change
   
   Currently when there is an error in the atuoscaler logic due to some misconfiguration / invalid metrics etc we only log the problem in the operator side but don't show this to the user. From the user's perspective the autoscaler simply doesnt do anything in these cases if they dont have access to the operator logs.
   
   This PR triggers a CR event in case an error happens in the autoscaler logic.
   
   ## Brief change log
   
    - Add error when exception happens
    - Add test
   
   ## Verifying this change
   
   new unit test added
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no
     - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "gyfora (via GitHub)" <gi...@apache.org>.
gyfora commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1179350480


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##########
@@ -113,7 +115,13 @@ public boolean scale(FlinkResourceContext<? extends AbstractFlinkResource<?, ?>>
                     scalingExecutor.scaleResource(resource, autoScalerInfo, conf, evaluatedMetrics);
             autoScalerInfo.replaceInKubernetes(kubernetesClient);
             return specAdjusted;
-        } catch (Exception e) {
+        } catch (Throwable e) {
+            eventRecorder.triggerEvent(
+                    resource,
+                    EventRecorder.Type.Warning,
+                    EventRecorder.Reason.AutoscalerError,
+                    EventRecorder.Component.Operator,
+                    e.getMessage());
             LOG.error("Error while scaling resource", e);

Review Comment:
   sure ^^



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] rodmeneses commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "rodmeneses (via GitHub)" <gi...@apache.org>.
rodmeneses commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1179330753


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##########
@@ -81,17 +83,17 @@ public boolean scale(FlinkResourceContext<? extends AbstractFlinkResource<?, ?>>
         var conf = ctx.getObserveConfig();
         var resource = ctx.getResource();
 
-        if (resource.getSpec().getJob() == null || !conf.getBoolean(AUTOSCALER_ENABLED)) {
-            LOG.info("Job autoscaler is disabled");
-            return false;
-        }
+        try {
+            if (resource.getSpec().getJob() == null || !conf.getBoolean(AUTOSCALER_ENABLED)) {
+                LOG.info("Job autoscaler is disabled");

Review Comment:
   should we use `LOG.warn` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] rodmeneses commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "rodmeneses (via GitHub)" <gi...@apache.org>.
rodmeneses commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1179342548


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##########
@@ -113,7 +115,13 @@ public boolean scale(FlinkResourceContext<? extends AbstractFlinkResource<?, ?>>
                     scalingExecutor.scaleResource(resource, autoScalerInfo, conf, evaluatedMetrics);
             autoScalerInfo.replaceInKubernetes(kubernetesClient);
             return specAdjusted;
-        } catch (Exception e) {
+        } catch (Throwable e) {
+            eventRecorder.triggerEvent(
+                    resource,
+                    EventRecorder.Type.Warning,
+                    EventRecorder.Reason.AutoscalerError,
+                    EventRecorder.Component.Operator,
+                    e.getMessage());

Review Comment:
   Maybe we can add a unit test to cover this, just to be 100% sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] mxm commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "mxm (via GitHub)" <gi...@apache.org>.
mxm commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1179255242


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java:
##########
@@ -83,6 +83,8 @@ public static void computeDataRateMetrics(
             scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, numRecordsInPerSecond);
         } else {
             LOG.error("Cannot compute true processing rate without numRecordsInPerSecond");
+            scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, Double.NaN);
+            scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, Double.NaN);

Review Comment:
   Unrelated change? Why is this necessary?



##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##########
@@ -113,7 +115,13 @@ public boolean scale(FlinkResourceContext<? extends AbstractFlinkResource<?, ?>>
                     scalingExecutor.scaleResource(resource, autoScalerInfo, conf, evaluatedMetrics);
             autoScalerInfo.replaceInKubernetes(kubernetesClient);
             return specAdjusted;
-        } catch (Exception e) {
+        } catch (Throwable e) {
+            eventRecorder.triggerEvent(
+                    resource,
+                    EventRecorder.Type.Warning,
+                    EventRecorder.Reason.AutoscalerError,
+                    EventRecorder.Component.Operator,
+                    e.getMessage());
             LOG.error("Error while scaling resource", e);

Review Comment:
   Can we log before triggering the event just in case...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] rodmeneses commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "rodmeneses (via GitHub)" <gi...@apache.org>.
rodmeneses commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1179340070


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##########
@@ -113,7 +115,13 @@ public boolean scale(FlinkResourceContext<? extends AbstractFlinkResource<?, ?>>
                     scalingExecutor.scaleResource(resource, autoScalerInfo, conf, evaluatedMetrics);
             autoScalerInfo.replaceInKubernetes(kubernetesClient);
             return specAdjusted;
-        } catch (Exception e) {
+        } catch (Throwable e) {
+            eventRecorder.triggerEvent(
+                    resource,
+                    EventRecorder.Type.Warning,
+                    EventRecorder.Reason.AutoscalerError,
+                    EventRecorder.Component.Operator,
+                    e.getMessage());

Review Comment:
   if an NPE occurs, `e.gestMessage()` will be null. We need to make sure that `triggerEvent` and the downstream caller are not throwing NPE because of this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "gyfora (via GitHub)" <gi...@apache.org>.
gyfora commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1180228387


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##########
@@ -113,7 +115,13 @@ public boolean scale(FlinkResourceContext<? extends AbstractFlinkResource<?, ?>>
                     scalingExecutor.scaleResource(resource, autoScalerInfo, conf, evaluatedMetrics);
             autoScalerInfo.replaceInKubernetes(kubernetesClient);
             return specAdjusted;
-        } catch (Exception e) {
+        } catch (Throwable e) {
+            eventRecorder.triggerEvent(
+                    resource,
+                    EventRecorder.Type.Warning,
+                    EventRecorder.Reason.AutoscalerError,
+                    EventRecorder.Component.Operator,
+                    e.getMessage());

Review Comment:
   Shouldn't be a problem and we do this in other places as well. I will add a simple test even if unrelated to the change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "gyfora (via GitHub)" <gi...@apache.org>.
gyfora commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1180222200


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/JobAutoScalerImpl.java:
##########
@@ -81,17 +83,17 @@ public boolean scale(FlinkResourceContext<? extends AbstractFlinkResource<?, ?>>
         var conf = ctx.getObserveConfig();
         var resource = ctx.getResource();
 
-        if (resource.getSpec().getJob() == null || !conf.getBoolean(AUTOSCALER_ENABLED)) {
-            LOG.info("Job autoscaler is disabled");
-            return false;
-        }
+        try {
+            if (resource.getSpec().getJob() == null || !conf.getBoolean(AUTOSCALER_ENABLED)) {
+                LOG.info("Job autoscaler is disabled");

Review Comment:
   It's not really a warning, this is based on user config. Usually it's off



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora merged pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "gyfora (via GitHub)" <gi...@apache.org>.
gyfora merged PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #575: [FLINK-31885] Trigger event on autoscaler error

Posted by "gyfora (via GitHub)" <gi...@apache.org>.
gyfora commented on code in PR #575:
URL: https://github.com/apache/flink-kubernetes-operator/pull/575#discussion_r1179349838


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/metrics/ScalingMetrics.java:
##########
@@ -83,6 +83,8 @@ public static void computeDataRateMetrics(
             scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, numRecordsInPerSecond);
         } else {
             LOG.error("Cannot compute true processing rate without numRecordsInPerSecond");
+            scalingMetrics.put(ScalingMetric.TRUE_PROCESSING_RATE, Double.NaN);
+            scalingMetrics.put(ScalingMetric.CURRENT_PROCESSING_RATE, Double.NaN);

Review Comment:
   I was testing the error scenarios and realized that if some metrics were missing from during collection, the scaling evaluator threw a nullpointer exception. This is to put nan instead of just not putting anything



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org