You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Gyula Fora (Jira)" <ji...@apache.org> on 2022/05/05 10:39:00 UTC

[jira] [Created] (FLINK-27500) Validation error handling inside controller blocks reconciliation

Gyula Fora created FLINK-27500:
----------------------------------

             Summary: Validation error handling inside controller blocks reconciliation
                 Key: FLINK-27500
                 URL: https://issues.apache.org/jira/browse/FLINK-27500
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.0.0
            Reporter: Gyula Fora


Currently when using the operator without the Webhook (validating only within the controller) , the way we handle validation errors completely blocks reconciliation.

The reason for this is that validation happens between observe and reconciliation and an error short-circuits the controller flow thus skipping the reconciler which would be able to execute actions such as rollbacks, deployment-recovery etc.

We also return an UpdateControl without reschedule after an error which makes this even worse.

There are a few ways to get around this some are more complex than the other. One possible solution:

If a validation error occurs simply use the "old" FlinkDeployment option in the rest of the controller loop. We can restore the old valid deployment from the lastReconciledSpec field, we just need to make sure to only update the status at the end. This would work from the observer/reconciler's perspective as if the new broken spec was never submitted.

Going this way we have to avoid repeatedly reporting the error caused by validation as we reschedule again and again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)