You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by GitBox <gi...@apache.org> on 2022/02/24 05:23:22 UTC

[GitHub] [flink-kubernetes-operator] tweise opened a new pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

tweise opened a new pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21


   First part following resource dependencies in the reconciler logic: Wait for the JM deployment to be ready before trying to access the JM. This will be built upon for improving the error detection/handling later on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814303652



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +118,19 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                LOG.info(
+                        "JobManager deployment {} in namespace {} is ready",
+                        flinkApp.getMetadata().getName(),
+                        flinkApp.getMetadata().getNamespace());
+                jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());
+                // reschedule for immediate job status check
+                return UpdateControl.updateStatus(flinkApp).rescheduleAfter(0);

Review comment:
       could we move this whole thing into a `boolean checkAndCacheDeployment(...)` -or something similar- method?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815363350



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -114,6 +126,39 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        return checkDeployment(flinkApp, context);
+    }
+
+    private UpdateControl<FlinkDeployment> checkDeployment(
+            FlinkDeployment flinkApp, Context context) {
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {

Review comment:
       It is a unique reference that avoids caching entire resource objects and duplicate key construction. Why should the field not exist? Did you encounter it somewhere?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814049730



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {
+            return context.getSecondaryResource(Deployment.class)
+                    .map(
+                            deployment -> {
+                                LOG.info(
+                                        "JobManager deployment {} in namespace {} is ready",
+                                        flinkApp.getMetadata().getName(),
+                                        flinkApp.getMetadata().getNamespace());
+                                jobManagerDeployments.put(flinkApp, deployment);

Review comment:
       Note that that there is other logic in `JobStatusObserver` that will exit when not ready to watch, confusingly returning true. I did not want to touch that in the PR. My goal is to implement the terminal error check when the deployment never shows up, for example due to invalid service account.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1050032866


   > I wonder if caching the deployment is even necessary, it adds complexity and can lead to cases where the deployment disappears and we do not notice. The reconcile loops should not run too often for this to stress kubernetes, what do you think @tweise ?
   
   We actually don't need to cache the whole objects for this purpose. However, I would like to limit the checking for secondary resources unless really necessary as that may become a scalability concern. How frequently we check the job status should ideally be independent from this logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815371000



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -114,6 +126,39 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        return checkDeployment(flinkApp, context);
+    }
+
+    private UpdateControl<FlinkDeployment> checkDeployment(
+            FlinkDeployment flinkApp, Context context) {
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                DeploymentStatus status = deployment.get().getStatus();
+                DeploymentSpec spec = deployment.get().getSpec();
+                if (status != null
+                        && status.getAvailableReplicas() != null
+                        && spec.getReplicas().intValue() == status.getReplicas()
+                        && spec.getReplicas().intValue() == status.getAvailableReplicas()) {
+                    LOG.info(
+                            "JobManager deployment {} in namespace {} is ready",
+                            flinkApp.getMetadata().getName(),
+                            flinkApp.getMetadata().getNamespace());
+                    jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());

Review comment:
       Maybe, the job status check would fail in that case and we could handle it there. We can address this as follow-up. There is a lot of work that needs to go into the reconciliation logic based on what I noticed while working on this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815398490



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -145,11 +190,16 @@ private void updateForReconciliationError(FlinkDeployment flinkApp, String err)
     @Override
     public List<EventSource> prepareEventSources(
             EventSourceContext<FlinkDeployment> eventSourceContext) {
-        // TODO: start status updated
-        //        return List.of(new PerResourcePollingEventSource<>(
-        //                new FlinkResourceSupplier, context.getPrimaryCache(), POLL_PERIOD,
-        //                FlinkApplication.class));
-        return Collections.emptyList();
+        // reconcile when job manager deployment and REST API are ready
+        SharedIndexInformer<Deployment> deploymentInformer =
+                kubernetesClient
+                        .apps()
+                        .deployments()
+                        .inAnyNamespace()

Review comment:
       Not sure if there is a nicer way (i need to check the operator sdk code) but you could add a setConfiguration method to the FlinkOperator and call it from the constructor of the FlinkControllerConfig




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] wangyang0918 commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1050768766


   I am getting to this PR now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1049509120


   @gyfora PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] morhidi commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
morhidi commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1049740679


   > I wonder if caching the deployment is even necessary, it adds complexity and can lead to cases where the deployment disappears and we do not notice. The reconcile loops should not run too often for this to stress kubernetes, what do you think @tweise ?
   
   I agree with @gyfora here. The refresh loop is 1 min at the moment, shouldn't be a big deal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814046991



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {
+            return context.getSecondaryResource(Deployment.class)
+                    .map(
+                            deployment -> {
+                                LOG.info(
+                                        "JobManager deployment {} in namespace {} is ready",
+                                        flinkApp.getMetadata().getName(),
+                                        flinkApp.getMetadata().getNamespace());
+                                jobManagerDeployments.put(flinkApp, deployment);

Review comment:
       The current implementation checks job status only after the deployment was observed. I think my earlier reply was misleading.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814140541



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +118,19 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                LOG.info(
+                        "JobManager deployment {} in namespace {} is ready",
+                        flinkApp.getMetadata().getName(),
+                        flinkApp.getMetadata().getNamespace());
+                jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());
+                // reschedule for immediate job status check
+                return UpdateControl.updateStatus(flinkApp).rescheduleAfter(0);

Review comment:
       We will still need a delay here as even after the pod is ready it still takes a bit for the REST server to listen at the port. This is how I currently see the job status check still hitting a timeout. BTW @gyfora job status checks blocking the operator threads isn't so nice! 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r813558909



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -88,17 +92,23 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
                 operatorNamespace,
                 kubernetesClient,
                 true);
+        jobManagerDeployments.remove(flinkApp);

Review comment:
       Should we also remove from this map when we cancel the flink job in the Flink service?

##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {
+            return context.getSecondaryResource(Deployment.class)
+                    .map(
+                            deployment -> {
+                                LOG.info(
+                                        "JobManager deployment {} in namespace {} is ready",
+                                        flinkApp.getMetadata().getName(),
+                                        flinkApp.getMetadata().getNamespace());
+                                jobManagerDeployments.put(flinkApp, deployment);

Review comment:
       Shouldn't we check and put into the jobManagerDeployments befre we try to observe? It seems now we introduce an extra refesh between every reconcile and observe.

##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -88,17 +92,23 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
                 operatorNamespace,
                 kubernetesClient,
                 true);
+        jobManagerDeployments.remove(flinkApp);

Review comment:
       Otherwise after an upgrade the observer might hit the same issue




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1052920892


   @gyfora I think after the refactor this looks much better now. Minor improvements can be dealt with as follow-up. I think there will be significant work necessary to make the reconciliation more robust. I'm wondering if it should be implemented as a state machine. The job upgrade part needs to be looked at. What happens when the previous job never came up due to a code issue and user pushes an update to fix the problem. This should be possible and we need to have test coverage for it also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1050048822


   > I tested your PR locally and hit this error, when deleting the CR @tweise did you notice?
   > 
   > ```
   > 2022-02-24 11:44:24,277 o.a.f.k.o.c.FlinkDeploymentController [INFO ] [default.basic-example] Cleaning up application cluster basic-example
   > Exception in thread "pool-3-thread-4" java.lang.IllegalStateException: Cached custom resource must be present at this point
   > 	at io.javaoperatorsdk.operator.processing.event.EventProcessor.lambda$isCacheReadyForInstantReconciliation$1(EventProcessor.java:248)
   > ```
   
   I also see it now. It is related to the deployment event watching as it goes away when disabling `prepareEventSources`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814394166



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +118,19 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                LOG.info(
+                        "JobManager deployment {} in namespace {} is ready",
+                        flinkApp.getMetadata().getName(),
+                        flinkApp.getMetadata().getNamespace());
+                jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());
+                // reschedule for immediate job status check
+                return UpdateControl.updateStatus(flinkApp).rescheduleAfter(0);

Review comment:
       The annotation does not seem to change anything wrt the original error. There is a separate reschedule interval for this condition, which in absent more sophisticated readiness check I have set to 5s for now - that seems to avoids the timeout exception. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1051719280


   Sounds good @tweise !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] morhidi commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
morhidi commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r813767532



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {

Review comment:
       We could also move the whole logic into JobStatusObserver passing the context object.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise merged pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise merged pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r813561729



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -88,17 +92,23 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
                 operatorNamespace,
                 kubernetesClient,
                 true);
+        jobManagerDeployments.remove(flinkApp);

Review comment:
       Good catch, will look into that tomorrow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r813561088



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {
+            return context.getSecondaryResource(Deployment.class)
+                    .map(
+                            deployment -> {
+                                LOG.info(
+                                        "JobManager deployment {} in namespace {} is ready",
+                                        flinkApp.getMetadata().getName(),
+                                        flinkApp.getMetadata().getNamespace());
+                                jobManagerDeployments.put(flinkApp, deployment);

Review comment:
       We do check before we observe. The deployment check is executed once, it does not introduce an extra refresh. It also works when the operator is restarted and finds existing deployments.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815374225



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -145,11 +190,16 @@ private void updateForReconciliationError(FlinkDeployment flinkApp, String err)
     @Override
     public List<EventSource> prepareEventSources(
             EventSourceContext<FlinkDeployment> eventSourceContext) {
-        // TODO: start status updated
-        //        return List.of(new PerResourcePollingEventSource<>(
-        //                new FlinkResourceSupplier, context.getPrimaryCache(), POLL_PERIOD,
-        //                FlinkApplication.class));
-        return Collections.emptyList();
+        // reconcile when job manager deployment and REST API are ready
+        SharedIndexInformer<Deployment> deploymentInformer =
+                kubernetesClient
+                        .apps()
+                        .deployments()
+                        .inAnyNamespace()

Review comment:
       It's possible to reconstruct the config like this:
   ```
           FlinkControllerConfig config = new FlinkControllerConfig(this);
           Set<String> namespaces = config.getNamespaces();
   ```
   But that should not be necessary. @gyfora maybe?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1053122420


   I will have to take a look again after the refactor, but there is also in the ticket opened in the meantime: https://issues.apache.org/jira/browse/FLINK-26377
   
   cc @Aitozi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815377063



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -88,17 +92,23 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
                 operatorNamespace,
                 kubernetesClient,
                 true);
+        jobManagerDeployments.remove(flinkApp);

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] wangyang0918 commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815415064



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -114,6 +126,39 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        return checkDeployment(flinkApp, context);
+    }
+
+    private UpdateControl<FlinkDeployment> checkDeployment(
+            FlinkDeployment flinkApp, Context context) {
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {

Review comment:
       I do not find the `selfLink` in my minikube with version `v1.23.0`. Could we simply use the `metadata.name` here?
   
   ![image](https://user-images.githubusercontent.com/15904523/155877551-7bc72a22-5c4d-4130-a3ad-04e03aa26b43.png)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814044563



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {

Review comment:
       It would scatter the resource watching logic. I would prefer to keep it here since it needs to work in tandem with `prepareEventSources`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1051721189


   There are still some outstanding minor comments but after that I think its good to go +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1050687538


   @wangyang0918 if you have some time to look at this we could use a bit of extra feedback :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] morhidi commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
morhidi commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814213346



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +118,19 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                LOG.info(
+                        "JobManager deployment {} in namespace {} is ready",
+                        flinkApp.getMetadata().getName(),
+                        flinkApp.getMetadata().getNamespace());
+                jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());
+                // reschedule for immediate job status check
+                return UpdateControl.updateStatus(flinkApp).rescheduleAfter(0);

Review comment:
       In this case, the reconcile loop will kick off after 60 seconds. Can be a simple workaround until we find a solution for this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r813564416



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {

Review comment:
       Maybe we could extract the checking and loading of the deployment from the secondaryResource into a new method in the controller to keep the main reconcile flow as simple as possible

##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {
+            return context.getSecondaryResource(Deployment.class)
+                    .map(
+                            deployment -> {
+                                LOG.info(
+                                        "JobManager deployment {} in namespace {} is ready",
+                                        flinkApp.getMetadata().getName(),
+                                        flinkApp.getMetadata().getNamespace());
+                                jobManagerDeployments.put(flinkApp, deployment);

Review comment:
       Oh I think I see what you mean. Do you mean that we check after the initial reconcolitiation? (right after the job was deployed).
   
   My suggestion was to move before line 108 just to be safe, but maybe it's not necessary

##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -144,7 +170,16 @@ private void updateForReconciliationError(FlinkDeployment flinkApp, String err)
         //        return List.of(new PerResourcePollingEventSource<>(
         //                new FlinkResourceSupplier, context.getPrimaryCache(), POLL_PERIOD,
         //                FlinkApplication.class));

Review comment:
       I think we can remove these comments now that we have some logic here :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814393387



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +118,19 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                LOG.info(
+                        "JobManager deployment {} in namespace {} is ready",
+                        flinkApp.getMetadata().getName(),
+                        flinkApp.getMetadata().getNamespace());
+                jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());
+                // reschedule for immediate job status check
+                return UpdateControl.updateStatus(flinkApp).rescheduleAfter(0);

Review comment:
       I moved it into into separate method for now. Will see how it can be arranged better when I have the error/timeout check in place. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815374112



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -145,11 +190,16 @@ private void updateForReconciliationError(FlinkDeployment flinkApp, String err)
     @Override
     public List<EventSource> prepareEventSources(
             EventSourceContext<FlinkDeployment> eventSourceContext) {
-        // TODO: start status updated
-        //        return List.of(new PerResourcePollingEventSource<>(
-        //                new FlinkResourceSupplier, context.getPrimaryCache(), POLL_PERIOD,
-        //                FlinkApplication.class));
-        return Collections.emptyList();
+        // reconcile when job manager deployment and REST API are ready
+        SharedIndexInformer<Deployment> deploymentInformer =
+                kubernetesClient
+                        .apps()
+                        .deployments()
+                        .inAnyNamespace()

Review comment:
       I would prefer that. Need to figure out how to get the controller config from the context and set multiple namespaces.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1052945882


   > I think this PR is a good step to avoid unnecessary reconciliation. But it seems that I still find some `java.util.concurrent.TimeoutException` when accessing the rest service even after `PORT_READY_DELAY_SECONDS`. Maybe my minikube is a little slow or JobManager does not have enough resources(0.5 cpu).
   
   I changed the delay to 10s and added a port readiness check. The 10s may still not be enough in some situations for the REST service to get ready (after the port is ready). That is covered by the retry though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] wangyang0918 commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814713388



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -145,11 +190,16 @@ private void updateForReconciliationError(FlinkDeployment flinkApp, String err)
     @Override
     public List<EventSource> prepareEventSources(
             EventSourceContext<FlinkDeployment> eventSourceContext) {
-        // TODO: start status updated
-        //        return List.of(new PerResourcePollingEventSource<>(
-        //                new FlinkResourceSupplier, context.getPrimaryCache(), POLL_PERIOD,
-        //                FlinkApplication.class));
-        return Collections.emptyList();
+        // reconcile when job manager deployment and REST API are ready
+        SharedIndexInformer<Deployment> deploymentInformer =
+                kubernetesClient
+                        .apps()
+                        .deployments()
+                        .inAnyNamespace()
+                        .withLabel("type", "flink-native-kubernetes")

Review comment:
       Using constants in `org.apache.flink.kubernetes.utils.Constants` could make us aware of upstream changes.

##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -114,6 +126,39 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        return checkDeployment(flinkApp, context);
+    }
+
+    private UpdateControl<FlinkDeployment> checkDeployment(
+            FlinkDeployment flinkApp, Context context) {
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {

Review comment:
       Why do we use `selfLink` here? I am afraid the `selfLink` field does not always exist.

##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -145,11 +190,16 @@ private void updateForReconciliationError(FlinkDeployment flinkApp, String err)
     @Override
     public List<EventSource> prepareEventSources(
             EventSourceContext<FlinkDeployment> eventSourceContext) {
-        // TODO: start status updated
-        //        return List.of(new PerResourcePollingEventSource<>(
-        //                new FlinkResourceSupplier, context.getPrimaryCache(), POLL_PERIOD,
-        //                FlinkApplication.class));
-        return Collections.emptyList();
+        // reconcile when job manager deployment and REST API are ready
+        SharedIndexInformer<Deployment> deploymentInformer =
+                kubernetesClient
+                        .apps()
+                        .deployments()
+                        .inAnyNamespace()

Review comment:
       Maybe we do not need to watch all the namespaces if `FLINK_OPERATOR_WATCH_NAMESPACES` configured.

##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -114,6 +126,39 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        return checkDeployment(flinkApp, context);
+    }
+
+    private UpdateControl<FlinkDeployment> checkDeployment(
+            FlinkDeployment flinkApp, Context context) {
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                DeploymentStatus status = deployment.get().getStatus();
+                DeploymentSpec spec = deployment.get().getSpec();
+                if (status != null
+                        && status.getAvailableReplicas() != null
+                        && spec.getReplicas().intValue() == status.getReplicas()
+                        && spec.getReplicas().intValue() == status.getAvailableReplicas()) {
+                    LOG.info(
+                            "JobManager deployment {} in namespace {} is ready",
+                            flinkApp.getMetadata().getName(),
+                            flinkApp.getMetadata().getNamespace());
+                    jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());

Review comment:
       Do we need to remove the current flinkApp from cache `jobManagerDeployments` when the replicas is not enough? For example, the JobManager crashed for a while.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815374463



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {

Review comment:
       Refactored the logic and `JobStatusObserver` can be removed / consolidated with JobReconciler in a follow-up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r815468641



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -114,6 +126,39 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        return checkDeployment(flinkApp, context);
+    }
+
+    private UpdateControl<FlinkDeployment> checkDeployment(
+            FlinkDeployment flinkApp, Context context) {
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {

Review comment:
       I see, it appears that selflink was deprecated. We need a unique identifier across namespaces. `UID` should do the trick.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] morhidi commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
morhidi commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814212639



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +118,19 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                LOG.info(
+                        "JobManager deployment {} in namespace {} is ready",
+                        flinkApp.getMetadata().getName(),
+                        flinkApp.getMetadata().getNamespace());
+                jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());
+                // reschedule for immediate job status check
+                return UpdateControl.updateStatus(flinkApp).rescheduleAfter(0);

Review comment:
       Try @ControllerConfiguration(generationAwareEventProcessing = true)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1049518325


   I wonder if caching the deployment is even necessary, it adds complexity and can lead to cases where the deployment disappears and we do not notice. The reconcile loops should not run too often for this to stress kubernetes, what do you think @tweise ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] morhidi commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
morhidi commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1049726514


   I tested your PR locally and hit this error, when deleting the CR @tweise did you notice?
   
   ```
   2022-02-24 11:44:24,277 o.a.f.k.o.c.FlinkDeploymentController [INFO ] [default.basic-example] Cleaning up application cluster basic-example
   Exception in thread "pool-3-thread-4" java.lang.IllegalStateException: Cached custom resource must be present at this point
   	at io.javaoperatorsdk.operator.processing.event.EventProcessor.lambda$isCacheReadyForInstantReconciliation$1(EventProcessor.java:248)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] gyfora commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
gyfora commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814308356



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +119,22 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.containsKey(flinkApp)) {

Review comment:
       I also agree with @morhidi that this probably fits best into the observer itself because that is the component responsible for checking the running jobs already. The observer could also prepare the event sources that we would just pass in the controller.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on a change in pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on a change in pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#discussion_r814394166



##########
File path: flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
##########
@@ -109,6 +118,19 @@ public DeleteControl cleanup(FlinkDeployment flinkApp, Context context) {
         } catch (Exception e) {
             throw new ReconciliationException(e);
         }
+
+        if (!jobManagerDeployments.contains(flinkApp.getMetadata().getSelfLink())) {
+            Optional<Deployment> deployment = context.getSecondaryResource(Deployment.class);
+            if (deployment.isPresent()) {
+                LOG.info(
+                        "JobManager deployment {} in namespace {} is ready",
+                        flinkApp.getMetadata().getName(),
+                        flinkApp.getMetadata().getNamespace());
+                jobManagerDeployments.add(flinkApp.getMetadata().getSelfLink());
+                // reschedule for immediate job status check
+                return UpdateControl.updateStatus(flinkApp).rescheduleAfter(0);

Review comment:
       The annotation does not seem to change anything wrt the original error. There is a separate reschedule interval for this condition, which absent of more sophisticated readiness check I have set to 5s for now - that seems to avoids the timeout exception. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] wangyang0918 commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
wangyang0918 commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1053451398


   @tweise Except for the `selfLink` part, I agree to leave some other improvements in the follow-up tickets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink-kubernetes-operator] tweise commented on pull request #21: [FLINK-26261] Wait for JobManager deployment ready before accessing REST API

Posted by GitBox <gi...@apache.org>.
tweise commented on pull request #21:
URL: https://github.com/apache/flink-kubernetes-operator/pull/21#issuecomment-1051663403


   @gyfora this is working pretty well now, I also added a port readiness check. But the current implementation with control logic scattered over job status observer and reconcilers makes it pretty hard to implement this in a cleaner way. I'm working on some refactoring and hopefully that will lead to simplification. Hope to have it done soon to avoid larger merge conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org