You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Márton Balassi (Jira)" <ji...@apache.org> on 2022/03/22 11:00:00 UTC

[jira] [Created] (FLINK-26804) Operator e2e tests sporadically fail: DEPLOYED_NOT_READY

Márton Balassi created FLINK-26804:
--------------------------------------

Summary: Operator e2e tests sporadically fail: DEPLOYED_NOT_READY
Key: FLINK-26804
URL: https://issues.apache.org/jira/browse/FLINK-26804
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Reporter: Márton Balassi
Assignee: Márton Balassi

I managed to introduce a sporadic failure scenario for the e2e tests via my solution of FLINK-26715. Since the operator only checks on the job every couple second the job might still be observed as being in DEPLOYED_NOT_READY state even after successfully completing checkpoints.

{code:bash}
Run ls e2e-tests/test_*.sh | while read script_test;do \
Running e2e-tests/test_kubernetes_application_ha.sh
persistentvolumeclaim/flink-example-statemachine created
Error from server (InternalError): error when creating "e2e-tests/data/cr.yaml": Internal error occurred: failed calling webhook "vflinkdeployments.flink.apache.org": failed to call webhook: Post "https://flink-operator-webhook-service.default.svc:443/validate?timeout=10s": dial tcp 10.106.63.26:443: connect: connection refused
Command: kubectl apply -f e2e-tests/data/cr.yaml failed. Retrying...
flinkdeployment.flink.apache.org/flink-example-statemachine created
persistentvolumeclaim/flink-example-statemachine unchanged
Error from server (NotFound): deployments.apps "flink-example-statemachine" not found
Command: kubectl get deploy/flink-example-statemachine failed. Retrying...
NAME READY UP-TO-DATE AVAILABLE AGE
flink-example-statemachine 0/1 1 0 1s
deployment.apps/flink-example-statemachine condition met
Waiting for jobmanager pod flink-example-statemachine-7fcf55c88b-h5r7r ready.
pod/flink-example-statemachine-7fcf55c88b-h5r7r condition met
Waiting for log "Rest endpoint listening at"...
Log "Rest endpoint listening at" shows up.
Waiting for log "Completed checkpoint [0-[9](https://github.com/apache/flink-kubernetes-operator/runs/5640468148?check_suite_focus=true#step:9:9)]+ for job"...
Log "Completed checkpoint [0-9]+ for job" shows up.
Successfully verified that flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus is in READY state.
Successfully verified that flinkdep/flink-example-statemachine.status.jobStatus.state is in RUNNING state.
Kill the flink-example-statemachine-7fcf55c88b-h5r7r
Defaulted container "flink-main-container" out of: flink-main-container, artifacts-fetcher (init)
Waiting for log "Restoring job 00000000000000000000000000000000 from Checkpoint"...
Log "Restoring job 00000000000000000000000000000000 from Checkpoint" shows up.
Waiting for log "Completed checkpoint [0-9]+ for job"...
Log "Completed checkpoint [0-9]+ for job" shows up.
Status verification for flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus failed. It is DEPLOYED_NOT_READY instead of READY.
Debugging failed e2e test:
{code}

--
This message was sent by Atlassian Jira
(v8.20.1#820001)