You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/03/22 11:38:00 UTC

[jira] [Updated] (FLINK-26804) Operator e2e tests sporadically fail: DEPLOYED_NOT_READY

     [ https://issues.apache.org/jira/browse/FLINK-26804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated FLINK-26804:
-----------------------------------
    Labels: pull-request-available  (was: )

> Operator e2e tests sporadically fail: DEPLOYED_NOT_READY
> --------------------------------------------------------
>
>                 Key: FLINK-26804
>                 URL: https://issues.apache.org/jira/browse/FLINK-26804
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Márton Balassi
>            Assignee: Márton Balassi
>            Priority: Major
>              Labels: pull-request-available
>
> I managed to introduce a sporadic failure scenario for the e2e tests via my solution of FLINK-26715. Since the operator only checks on the job every couple second the job might still be observed as being in DEPLOYED_NOT_READY state even after successfully completing checkpoints.
> {code:bash}
> Run ls e2e-tests/test_*.sh | while read script_test;do \
> Running e2e-tests/test_kubernetes_application_ha.sh
> persistentvolumeclaim/flink-example-statemachine created
> Error from server (InternalError): error when creating "e2e-tests/data/cr.yaml": Internal error occurred: failed calling webhook "vflinkdeployments.flink.apache.org": failed to call webhook: Post "https://flink-operator-webhook-service.default.svc:443/validate?timeout=10s": dial tcp 10.106.63.26:443: connect: connection refused
> Command: kubectl apply -f e2e-tests/data/cr.yaml failed. Retrying...
> flinkdeployment.flink.apache.org/flink-example-statemachine created
> persistentvolumeclaim/flink-example-statemachine unchanged
> Error from server (NotFound): deployments.apps "flink-example-statemachine" not found
> Command: kubectl get deploy/flink-example-statemachine failed. Retrying...
> NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
> flink-example-statemachine   0/1     1            0           1s
> deployment.apps/flink-example-statemachine condition met
> Waiting for jobmanager pod flink-example-statemachine-7fcf55c88b-h5r7r ready.
> pod/flink-example-statemachine-7fcf55c88b-h5r7r condition met
> Waiting for log "Rest endpoint listening at"...
> Log "Rest endpoint listening at" shows up.
> Waiting for log "Completed checkpoint [0-[9](https://github.com/apache/flink-kubernetes-operator/runs/5640468148?check_suite_focus=true#step:9:9)]+ for job"...
> Log "Completed checkpoint [0-9]+ for job" shows up.
> Successfully verified that flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus is in READY state.
> Successfully verified that flinkdep/flink-example-statemachine.status.jobStatus.state is in RUNNING state.
> Kill the flink-example-statemachine-7fcf55c88b-h5r7r
> Defaulted container "flink-main-container" out of: flink-main-container, artifacts-fetcher (init)
> Waiting for log "Restoring job 00000000000000000000000000000000 from Checkpoint"...
> Log "Restoring job 00000000000000000000000000000000 from Checkpoint" shows up.
> Waiting for log "Completed checkpoint [0-9]+ for job"...
> Log "Completed checkpoint [0-9]+ for job" shows up.
> Status verification for flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus failed. It is DEPLOYED_NOT_READY instead of READY.
> Debugging failed e2e test:
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)