You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Peter Vary (Jira)" <ji...@apache.org> on 2022/12/06 15:05:00 UTC
[jira] [Commented] (FLINK-30315) Add more information about image pull failures to the operator log
[ https://issues.apache.org/jira/browse/FLINK-30315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643909#comment-17643909 ]
Peter Vary commented on FLINK-30315:
------------------------------------
The {{ContainerStateWaiting}} contains the message that we want.
The issue is that:
- For {{ErrImagePull}} we have the correct message: {{Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded}}
- For {{ImagePullBackOff}} we only have this message: {{Back-off pulling image "flink:1.14"}} which is not that useful
Based on this, I think we have the following options:
# Throw {{DeploymentFailedException}} at {{ErrImagePull}} and add provide the enhanced message. Cons: This throws an error on the first image pull error - previously we retried at least once (I am not sure that this is that important as we continue to monitor the state of the deployment and we act on the state changes anyway)
# Store the message in the state and provide it when the ImagePullBackOff failed
I would like to hear you opinions about the options, or I am interested in any alternatives you have in mind.
Without any different opinions, I would go for option 1.
> Add more information about image pull failures to the operator log
> ------------------------------------------------------------------
>
> Key: FLINK-30315
> URL: https://issues.apache.org/jira/browse/FLINK-30315
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Reporter: Peter Vary
> Priority: Major
>
> When there is an image pull error, this is what we see in the operator log:
> {code:java}
> org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: Back-off pulling image "flink:1.14"
> at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.checkContainerBackoff(AbstractFlinkDeploymentObserver.java:194)
> at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeJmDeployment(AbstractFlinkDeploymentObserver.java:150)
> at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:84)
> at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55)
> at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56)
> at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32)
> at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113)
> at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
> at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136)
> at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94)
> at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
> at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81)
> at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54)
> at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.base/java.lang.Thread.run(Unknown Source) {code}
> This is the information we have on kubernetes side:
> {code}
> Normal Scheduled 2m19s default-scheduler Successfully assigned
> default/quickstart-base-86787586cd-lb7j6 to minikube
> Warning Failed 20s kubelet Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded
> *Warning Failed 20s kubelet Error*: ErrImagePull
> Normal BackOff 19s kubelet Back-off pulling image "flink:1.14"
> *Warning Failed 19s kubelet Error*: ImagePullBackOff
> Normal Pulling 7s (x2 over 2m19s) kubelet Pulling image "flink:1.14"
> {code}
> It would be good to add the additional message (in this case {{Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded}}) to the message of the {{DeploymentFailedException}} for tracebility.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)