You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/12/02 12:27:58 UTC

[GitHub] [flink-kubernetes-operator] pvary opened a new pull request, #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

pvary opened a new pull request, #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464

   ## What is the purpose of the change
   
   Make it possible to recover from the missing HA errors
   
   ## Brief change log
   
   - Created a new Exception signalling recoverable errors
   - Made sure that the state of the cluster/job is not changed when a recoverable error happens
   - Added unit test
   
   ## Verifying this change
   This change added tests and can be verified as follows:
   - Added a unit test recovering from missing HA
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no
     - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] mbalassi merged pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

mbalassi merged PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

gyfora commented on code in PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#discussion_r1039555267


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##########
@@ -125,7 +125,7 @@ protected Optional<UpgradeMode> getAvailableUpgradeMode(
 
         if (status.getJobManagerDeploymentStatus() == JobManagerDeploymentStatus.MISSING
                 || status.getJobManagerDeploymentStatus() == JobManagerDeploymentStatus.ERROR) {
-            throw new DeploymentFailedException(
+            throw new RecoveryFailureException(
                     "JobManager deployment is missing and HA data is not available to make stateful upgrades. "
                             + "It is possible that the job has finished or terminally failed, or the configmaps have been deleted. "
                             + "Manual restore required.",

Review Comment:
   This almost certainly means that the operator cannot restore this anymore. The user needs to delete the CR and recover using initialSavepoontPath at this stage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] morhidi commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

morhidi commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1335708826

   @pvary please also rebase it to the latest containing @gaborgsomogyi's e2e changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] morhidi commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

morhidi commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1336324416

   > > @pvary please also rebase it to the latest containing @gaborgsomogyi's e2e changes
   > 
   > @morhidi: I am not sure I understand this comment. You might refer to my refactoring of the e2e tests which groups them to 8 executor. If this is the case, then on Friday we had to revert it, because it caused continuous failures. The decision was to revert it for the release and find fix it when we have more time.
   
   @gaborgsomogyi has added another e2e related fix since then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

gyfora commented on code in PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#discussion_r1039563266


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##########
@@ -125,7 +125,7 @@ protected Optional<UpgradeMode> getAvailableUpgradeMode(
 
         if (status.getJobManagerDeploymentStatus() == JobManagerDeploymentStatus.MISSING
                 || status.getJobManagerDeploymentStatus() == JobManagerDeploymentStatus.ERROR) {
-            throw new DeploymentFailedException(
+            throw new RecoveryFailureException(
                     "JobManager deployment is missing and HA data is not available to make stateful upgrades. "
                             + "It is possible that the job has finished or terminally failed, or the configmaps have been deleted. "
                             + "Manual restore required.",

Review Comment:
   There could be some error cases where we still recover I guess , but we would have to check. In that case that would recover automatically. But missing I think is final in this case 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

gyfora commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1335572073

   I would not necessarily call missing HA metadata exceptions recoverable. In most cases they are actually fatal and as many error msgs suggest, they require deletion and recreation of the CR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] gyfora commented on a diff in pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

gyfora commented on code in PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#discussion_r1038512952


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/exception/RecoverableDeploymentFailureException.java:
##########
@@ -0,0 +1,25 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.kubernetes.operator.exception;
+
+/** Exception to signal non-terminal deployment failure. */
+public class RecoverableDeploymentFailureException extends DeploymentFailedException {

Review Comment:
   Let’s not extend DeploymentFailedException , the deployment does not fail in these cases (the jobmanager deployment I mean)
   
   we could simply call FlinkResourceException or something more generic 



##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java:
##########
@@ -139,10 +142,13 @@ public UpdateControl<FlinkDeployment> reconcile(FlinkDeployment flinkApp, Contex
                 configManager.getOperatorConfiguration(), flinkApp, previousDeployment, true);
     }
 
-    private void handleDeploymentFailed(FlinkDeployment flinkApp, DeploymentFailedException dfe) {
+    private void handleDeploymentFailed(
+            FlinkDeployment flinkApp, DeploymentFailedException dfe, boolean updateJobStatus) {
         LOG.error("Flink Deployment failed", dfe);
-        flinkApp.getStatus().setJobManagerDeploymentStatus(JobManagerDeploymentStatus.ERROR);
-        flinkApp.getStatus().getJobStatus().setState(JobStatus.RECONCILING.name());
+        if (updateJobStatus) {

Review Comment:
   These should be two separate methods, the update status flag does not have any well defined meaning I feel



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] gaborgsomogyi commented on a diff in pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

gaborgsomogyi commented on code in PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#discussion_r1039549680


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##########
@@ -125,7 +125,7 @@ protected Optional<UpgradeMode> getAvailableUpgradeMode(
 
         if (status.getJobManagerDeploymentStatus() == JobManagerDeploymentStatus.MISSING
                 || status.getJobManagerDeploymentStatus() == JobManagerDeploymentStatus.ERROR) {
-            throw new DeploymentFailedException(
+            throw new RecoveryFailureException(
                     "JobManager deployment is missing and HA data is not available to make stateful upgrades. "
                             + "It is possible that the job has finished or terminally failed, or the configmaps have been deleted. "
                             + "Manual restore required.",

Review Comment:
   How true is this? `Manual restore required.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] gaborgsomogyi commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

gaborgsomogyi commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1337285718

   Not sure is this the last or prev run but it fails:
   ```
   Error:  Errors: 
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    NativeFlinkServiceTest.testGetLastCheckpoint:287 » RecoveryFailure Latest chec...
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] gaborgsomogyi commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

gaborgsomogyi commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1337566677

   Still there are errors:
   ```
   Error:  Errors: 
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    ApplicationReconcilerTest.testUpgrade:211 » RecoveryFailure JobManager deploym...
   Error:    NativeFlinkServiceTest.testGetLastCheckpoint:287 » RecoveryFailure Latest chec...
   [INFO] 
   Error:  Tests run: 326, Failures: 0, Errors: 5, Skipped: 0
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] pvary commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

pvary commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1336091361

   > @pvary please also rebase it to the latest containing @gaborgsomogyi's e2e changes
   
   @morhidi: I am not sure I understand this comment. You might refer to my refactoring of the e2e tests which groups them to 8 executor. If this is the case, then on Friday we had to revert it, because it caused continuous failures. The decision was to revert it for the release and find fix it when we have more time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] mbalassi commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

mbalassi commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1339405085

   I think we have reached consensus on this for the 1.3 release, so I a merging. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-kubernetes-operator] mbalassi commented on pull request #464: [FLINK-30268] HA metadata and other cluster submission related errors should not throw DeploymentFailedException

Posted by GitBox <gi...@apache.org>.

mbalassi commented on PR #464:
URL: https://github.com/apache/flink-kubernetes-operator/pull/464#issuecomment-1335207172

   Thanks @pvary. `DeploymentRecoveryTest#verifyApplicationJmRecovery` is failing, PTAL on Monday.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org