You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Peter Vary (Jira)" <ji...@apache.org> on 2022/12/06 15:05:00 UTC
[jira] [Commented] (FLINK-30315) Add more information about image pull failures to the operator log

    [ https://issues.apache.org/jira/browse/FLINK-30315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643909#comment-17643909 ] 

Peter Vary commented on FLINK-30315:
------------------------------------

The {{ContainerStateWaiting}} contains the message that we want.
The issue is that:
 - For {{ErrImagePull}} we have the correct message: {{Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded}}
 - For {{ImagePullBackOff}} we only have this message: {{Back-off pulling image "flink:1.14"}} which is not that useful

Based on this, I think we have the following options:
 # Throw {{DeploymentFailedException}} at {{ErrImagePull}} and add provide the enhanced message. Cons: This throws an error on the first image pull error - previously we retried at least once (I am not sure that this is that important as we continue to monitor the state of the deployment and we act on the state changes anyway)
 # Store the message in the state and provide it when the ImagePullBackOff failed

I would like to hear you opinions about the options, or I am interested in any alternatives you have in mind.



Without any different opinions, I would go for option 1.

> Add more information about image pull failures to the operator log
> ------------------------------------------------------------------
>
>                 Key: FLINK-30315
>                 URL: https://issues.apache.org/jira/browse/FLINK-30315
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Peter Vary
>            Priority: Major
>
> When there is an image pull error, this is what we see in the operator log:
> {code:java}
> org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: Back-off pulling image "flink:1.14"
>  	at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.checkContainerBackoff(AbstractFlinkDeploymentObserver.java:194)
>  	at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeJmDeployment(AbstractFlinkDeploymentObserver.java:150)
>  	at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:84)
>  	at org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55)
>  	at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56)
>  	at org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32)
>  	at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113)
>  	at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
>  	at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136)
>  	at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94)
>  	at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
>  	at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93)
>  	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130)
>  	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110)
>  	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81)
>  	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54)
>  	at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
>  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>  	at java.base/java.lang.Thread.run(Unknown Source) {code}
> This is the information we have on kubernetes side:
> {code}
> Normal   Scheduled  2m19s               default-scheduler  Successfully assigned
> default/quickstart-base-86787586cd-lb7j6 to minikube
> Warning  Failed     20s                 kubelet            Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded
> *Warning  Failed     20s                 kubelet            Error*: ErrImagePull
> Normal   BackOff    19s                 kubelet            Back-off pulling image "flink:1.14"
> *Warning  Failed     19s                 kubelet            Error*: ImagePullBackOff
> Normal   Pulling    7s (x2 over 2m19s)  kubelet            Pulling image "flink:1.14"
> {code}
> It would be good to add the additional message (in this case {{Failed to pull image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded}}) to the message of the {{DeploymentFailedException}} for tracebility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)