You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/18 17:52:41 UTC

[GitHub] [airflow] lindsable opened a new issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

lindsable opened a new issue #17693:
URL: https://github.com/apache/airflow/issues/17693


   **Apache Airflow version**: 2.1.2
   **OS**: Custom Docker Image built from python:3.8-slim
   **Apache Airflow Provider versions**:
   apache-airflow-providers-cncf-kubernetes==2.0.1
   apache-airflow-providers-ftp==2.0.0
   apache-airflow-providers-http==2.0.0
   apache-airflow-providers-imap==2.0.0
   
   **Deployment**: Self managed on EKS, manifests templated and rendered using krane, dags mounted via PVC & EFS
   
   **What happened**: Task remains in queued state 
   <img width="1439" alt="Screen Shot 2021-08-18 at 12 17 28 PM" src="https://user-images.githubusercontent.com/47788186/129936170-e16e1362-24ca-4ce9-b2f7-978f2642d388.png">
   
   **What you expected to happen**: Task starts running
   
   **How to reproduce it**: I believe it is because a node is removed. I've attached both the scheduler/k8s executor logs and the kubernetes logs. 
   [scheduler-executor-logs.csv](https://github.com/apache/airflow/files/7008737/scheduler-executor-logs.csv)
   [cloud-watch-eks-logs.csv](https://github.com/apache/airflow/files/7009128/cloud-watch-eks-logs.csv)
   
   ```
   2021-08-17 13:07:39.000,1 node_lifecycle_controller.go:1127] node ip-10-0-151-210.ec2.internal hasn't been updated for 40.010072107s. Last MemoryPressure is: &NodeCondition{Type:MemoryPressure,Status:False,LastHeartbeatTime:2021-08-17 13:06:16 +0000 UTC,LastTransitionTime:2021-08-14 14:36:53 +0000 UTC,Reason:KubeletHasSufficientMemory,Message:kubelet has sufficient memory available,}"
   2021-08-17 13:07:39.000, 1 node_lifecycle_controller.go:1127] node ip-10-0-151-210.ec2.internal hasn't been updated for 40.010146198s. Last PIDPressure is: &NodeCondition{Type:PIDPressure,Status:False,LastHeartbeatTime:2021-08-17 13:06:16 +0000 UTC,LastTransitionTime:2021-08-14 14:36:53 +0000 UTC,Reason:KubeletHasSufficientPID,Message:kubelet has sufficient PID available,}"
   2021-08-17 13:07:39.000,1 node_lifecycle_controller.go:1127] node ip-10-0-151-210.ec2.internal hasn't been updated for 40.010092188s. Last DiskPressure is: &NodeCondition{Type:DiskPressure,Status:False,LastHeartbeatTime:2021-08-17 13:06:16 +0000 UTC,LastTransitionTime:2021-08-14 14:36:53 +0000 UTC,Reason:KubeletHasNoDiskPressure,Message:kubelet has no disk pressure,}"
   2021-08-17 13:07:40.000, 1 node_tree.go:100] Removed node ""ip-10-0-151-210.ec2.internal"" in group ""us-east-1:\x00:us-east-1c"" from NodeTree"
   2021-08-17 13:07:40.000,  1 controller_utils.go:182] Recording status change NodeNotReady event message for node ip-10-0-151-210.ec2.internal
   2021-08-17 13:07:40.000,1 controller_utils.go:122] Update ready status of pods on node [ip-10-0-151-210.ec2.internal]
   2021-08-17 13:07:40.000, 1 event.go:278] Event(v1.ObjectReference{Kind:""Node"", Namespace:"""", Name:""ip-10-0-151-210.ec2.internal"", UID:""bbf44ddc-78fa-4ba1-ad56-01e751dfb4b0"", APIVersion:""v1"", ResourceVersion:"""", FieldPath:""""}): type: 'Normal' reason: 'NodeNotReady' Node ip-10-0-151-210.ec2.internal status is now: NodeNotReady"
   2021-08-17 13:07:40.000, 1 node_lifecycle_controller.go:181] deleting node since it is no longer present in cloud provider: ip-10-0-151-210.ec2.internal
   2021-08-17 13:07:40.000, 1 event.go:278] Event(v1.ObjectReference{Kind:""Node"", Namespace:"""", Name:""ip-10-0-151-210.ec2.internal"", UID:""bbf44ddc-78fa-4ba1-ad56-01e751dfb4b0"", APIVersion:"""", ResourceVersion:"""", FieldPath:""""}): type: 'Normal' reason: 'Deleting node ip-10-0-151-210.ec2.internal because it does not exist in the cloud provider' Node ip-10-0-151-210.ec2.internal event: DeletingNode"
   2021-08-17 13:07:40.333,"{""kind"":""Event"",""apiVersion"":""audit.k8s.io/v1"",""level"":""RequestResponse"",""auditID"":""ab54e784-d671-4374-a83c-6eaa3ca7088f"",""stage"":""ResponseComplete"",""requestURI"":""/api/v1/namespaces/etl/pods/etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35/status"",""verb"":""update"",""user"":{""username"":""system:serviceaccount:kube-system:node-controller"",""uid"":""6a8594ad-a8b6-11e9-818d-128c99dcc1f2"",""groups"":[""system:serviceaccounts"",""system:serviceaccounts:kube-system"",""system:authenticated""]},""sourceIPs"":[""172.16.113.40""],""userAgent"":""kube-controller-manager/v1.18.16 (linux/amd64) kubernetes/7737de1/system:serviceaccount:kube-system:node-controller"",""objectRef"":{""resource"":""pods"",""namespace"":""etl"",""name"":""etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""uid"":""12194e36-46d9-466f-897a-8a73b96446f0"",""apiVersion"":""v1"","
 "resourceVersion"":""580055875"",""subresource"":""status""},""responseStatus"":{""metadata"":{},""code"":200},""requestObject"":{""kind"":""Pod"",""apiVersion"":""v1"",""metadata"":{""name"":""etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""namespace"":""etl"",""selfLink"":""/api/v1/namespaces/etl/pods/etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""uid"":""12194e36-46d9-466f-897a-8a73b96446f0"",""resourceVersion"":""580055875"",""creationTimestamp"":""2021-08-17T13:05:53Z"",""labels"":{""airflow-worker"":""16375530"",""airflow_version"":""2.1.2"",""app"":""airflowable"",""component"":""worker"",""dag_id"":""etl_fedora_course_collections"",""execution_date"":""2021-08-17T12_00_00_plus_00_00"",""kubernetes_executor"":""True"",""task_id"":""create_stage_table_from_current_table"",""try_number"":""1""},""annotations"":{""dag_id"":""etl_fedora_course_collections"",""execution_date"":""2021-0
 8-17T12:00:00+00:00"",""kubernetes.io/psp"":""eks.privileged"",""task_id"":""create_stage_table_from_current_table"",""try_number"":""1""},""managedFields"":[{""manager"":""OpenAPI-Generator"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:05:53Z"",""fieldsType"":""FieldsV1"",""fieldsV1"":{""f:metadata"":{""f:annotations"":{""."":{},""f:dag_id"":{},""f:execution_date"":{},""f:task_id"":{},""f:try_number"":{}},""f:labels"":{""."":{},""f:airflow-worker"":{},""f:airflow_version"":{},""f:app"":{},""f:component"":{},""f:dag_id"":{},""f:execution_date"":{},""f:kubernetes_executor"":{},""f:task_id"":{},""f:try_number"":{}}},""f:spec"":{""f:affinity"":{""."":{},""f:podAntiAffinity"":{""."":{},""f:requiredDuringSchedulingIgnoredDuringExecution"":{}}},""{""manager"":""kube-scheduler"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:05:53Z"",""fieldsType"":""FieldsV1"",""fieldsV1"":{""f:status"":{""f:conditions"":{""."":{},""k:{\""type\"":\""Pod
 Scheduled\""}"":{""."":{},""f:lastProbeTime"":{},""f:lastTransitionTime"":{},""f:message"":{},""f:reason"":{},""f:status"":{},""f:type"":{}}}}}},{""manager"":""kubelet"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:06:56Z"",""fieldsType"":""FieldsV1"",""fieldsV1"":{""f:status"":{""f:conditions"":{""k:{\""type\"":\""ContainersReady\""}"":{""."":{},""f:lastProbeTime"":{},""f:lastTransitionTime"":{},""f:status"":{},""f:type"":{}},""k:{\""type\"":\""Initialized\""}"":{""."":{},""f:lastProbeTime"":{},""f:lastTransitionTime"":{},""f:status"":{},""f:type"":{}},""k:{\""type\"":\""Ready\""}"":{""."":{},""f:lastProbeTime"":{},""f:lastTransitionTime"":{},""f:status"":{},""f:type"":{}}},""f:containerStatuses"":{},""f:hostIP"":{},""f:phase"":{},""f:podIP"":{},""f:podIPs"":{""."":{},""k:{\""ip\"":\""10.0.149.225\""}"":{""."":{},""f:ip"":{}}},""f:startTime"":{}}}}]},""spec"":{""volumes"":[{""name"":""airflow-dags"",""persistentVolumeClaim"":{""claimName"":""airflow-dags"
 "}},{""name"":""airflow-logs"",""persistentVolumeClaim"":{""claimName"":""airflow-logs""}},{""name"":""airflow-cluster-admin-token-7hxjd"",""secret"":{""secretName"":""airflow-cluster-admin-token-7hxjd"",""defaultMode"":420}}],""containers"":[{""name"":""base"",""image"":""898529291541.dkr.ecr.us-east-1.amazonaws.com/airflowable:git-89418dc3d56a"",""args"":[""airflow"",""tasks"",""run"",""etl_fedora_course_collections"",""create_stage_table_from_current_table"",""2021-08-17T12:00:00+00:00"",""--local"",""--pool"",""default_pool"",""--subdir"",""/usr/local/airflow/dags/main/federated_db_etl.py""],""ports"":[{""containerPort"":8080,""protocol"":""TCP""}],""envFrom"":[{""secretRef"":{""name"":""airflowable""}},{""configMapRef"":{""name"":""airflow-env-config""}}],""env"":[{""name"":""AIRFLOW__CORE__EXECUTOR"",""value"":""LocalExecutor""},{""name"":""SERVICE_NAME"",""value"":""airflowable""},{""name"":""AIRFLOW__METRICS__STATSD_HOST"",""valueFrom"":{""fieldRef"":{""apiVersion"":""v1"","
 "fieldPath"":""status.hostIP""}}},{""name"":""AIRFLOW_IS_K8S_EXECUTOR_POD"",""value"":""True""}],""resources"":{""limits"":{""cpu"":""500m"",""memory"":""500Mi""},""requests"":{""cpu"":""500m"",""memory"":""500Mi""}},""volumeMounts"":[{""name"":""airflow-dags"",""mountPath"":""/usr/local/airflow/dags""},{""name"":""airflow-logs"",""mountPath"":""/usr/local/airflow/logs""},{""name"":""airflow-cluster-admin-token-7hxjd"",""readOnly"":true,""mountPath"":""/var/run/secrets/kubernetes.io/serviceaccount""}],""terminationMessagePath"":""/dev/termination-log"",""terminationMessagePolicy"":""File"",""imagePullPolicy"":""IfNotPresent""}],""restartPolicy"":""Never"",""terminationGracePeriodSeconds"":30,""dnsPolicy"":""ClusterFirst"",""serviceAccountName"":""airflow-cluster-admin"",""serviceAccount"":""airflow-cluster-admin"",""nodeName"":""ip-10-0-151-210.ec2.internal"",""securityContext"":{},""affinity"":{""podAntiAffinity"":{""requiredDuringSchedulingIgnoredDuringExecution"":[{""labelSelecto
 r"":{""matchExpressions"":[{""key"":""app"",""operator"":""In"",""values"":[""kafka""]}]},""topologyKey"":""kubernetes.io/hostname""}]}},""schedulerName"":""default-scheduler"",""tolerations"":[{""key"":""node.kubernetes.io/not-ready"",""operator"":""Exists"",""effect"":""NoExecute"",""tolerationSeconds"":300},{""key"":""node.kubernetes.io/unreachable"",""operator"":""Exists"",""effect"":""NoExecute"",""tolerationSeconds"":300}],""priority"":0,""enableServiceLinks"":true},""status"":{""phase"":""Running"",""conditions"":[{""type"":""Initialized"",""status"":""True"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:06:53Z""},{""type"":""Ready"",""status"":""False"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:07:40Z""},{""type"":""ContainersReady"",""status"":""True"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:06:56Z""},{""type"":""PodScheduled"",""status"":""True"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:06:53Z""
 }],""hostIP"":""10.0.151.210"",""podIP"":""10.0.149.225"",""podIPs"":[{""ip"":""10.0.149.225""}],""startTime"":""2021-08-17T13:06:53Z"",""containerStatuses"":[{""name"":""base"",""state"":{""running"":{""startedAt"":""2021-08-17T13:06:55Z""}},""lastState"":{},""ready"":true,""restartCount"":0,""image"":""898529291541.dkr.ecr.us-east-1.amazonaws.com/airflowable:git-89418dc3d56a"",""imageID"":""docker-pullable://898529291541.dkr.ecr.us-east-1.amazonaws.com/airflowable@sha256:a9bd2de5662a8be6033902632c8782e0cf69985095ddf2d0bbd4b08b383fea90"",""containerID"":""docker://ee7184e150c0bef601300b3d2ba4fc79c47978bbc6688dcd563d3f138eb3440a"",""started"":true}],""qosClass"":""Guaranteed""}},""responseObject"":{""kind"":""Pod"",""apiVersion"":""v1"",""metadata"":{""name"":""etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""namespace"":""etl"",""selfLink"":""/api/v1/namespaces/etl/pods/etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6
 206054a4baab8428eabdb6a35/status"",""uid"":""12194e36-46d9-466f-897a-8a73b96446f0"",""resourceVersion"":""580057074"",""creationTimestamp"":""2021-08-17T13:05:53Z"",""labels"":{""airflow-worker"":""16375530"",""airflow_version"":""2.1.2"",""app"":""airflowable"",""component"":""worker"",""dag_id"":""etl_fedora_course_collections"",""execution_date"":""2021-08-17T12_00_00_plus_00_00"",""kubernetes_executor"":""True"",""task_id"":""create_stage_table_from_current_table"",""try_number"":""1""},""annotations"":{""dag_id"":""etl_fedora_course_collections"",""execution_date"":""2021-08-17T12:00:00+00:00"",""kubernetes.io/psp"":""eks.privileged"",""task_id"":""create_stage_table_from_current_table"",""try_number"":""1""},""managedFields"":[{""manager"":""OpenAPI-Generator"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:05:53Z"",""fieldsType"":""FieldsV1"",
   2021-08-17 13:08:31.000,I0817 13:08:31.810773       1 gc_controller.go:185] Found orphaned Pod etl/etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35 assigned to the Node ip-10-0-151-210.ec2.internal. Deleting.
   2021-08-17 13:08:31.892,"{""kind"":""Event"",""apiVersion"":""audit.k8s.io/v1"",""level"":""RequestResponse"",""auditID"":""30456b9d-f13f-4e90-ae34-bee294a5f675"",""stage"":""ResponseComplete"",""requestURI"":""/api/v1/namespaces/etl/pods/etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""verb"":""delete"",""user"":{""username"":""system:serviceaccount:kube-system:pod-garbage-collector"",""uid"":""692d32b6-a8b6-11e9-818d-128c99dcc1f2"",""groups"":[""system:serviceaccounts"",""system:serviceaccounts:kube-system"",""system:authenticated""]},""sourceIPs"":[""172.16.113.40""],""userAgent"":""kube-controller-manager/v1.18.16 (linux/amd64) kubernetes/7737de1/system:serviceaccount:kube-system:pod-garbage-collector"",""objectRef"":{""resource"":""pods"",""namespace"":""etl"",""name"":""etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""apiVersion"":""v1""},""responseStatus"":{""metadata"":{},""code""
 :200},""requestObject"":{""kind"":""DeleteOptions"",""apiVersion"":""v1"",""gracePeriodSeconds"":0},""responseObject"":{""kind"":""Pod"",""apiVersion"":""v1"",""metadata"":{""name"":""etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""namespace"":""etl"",""selfLink"":""/api/v1/namespaces/etl/pods/etlfedoracoursecollectionscreatestagetablefromcurrenttable.92e18b6206054a4baab8428eabdb6a35"",""uid"":""12194e36-46d9-466f-897a-8a73b96446f0"",""resourceVersion"":""580058245"",""creationTimestamp"":""2021-08-17T13:05:53Z"",""deletionTimestamp"":""2021-08-17T13:08:31Z"",""deletionGracePeriodSeconds"":0,""labels"":{""airflow-worker"":""16375530"",""airflow_version"":""2.1.2"",""app"":""airflowable"",""component"":""worker"",""dag_id"":""etl_fedora_course_collections"",""execution_date"":""2021-08-17T12_00_00_plus_00_00"",""kubernetes_executor"":""True"",""task_id"":""create_stage_table_from_current_table"",""try_number"":""1""},""annotations"":{""
 dag_id"":""etl_fedora_course_collections"",""execution_date"":""2021-08-17T12:00:00+00:00"",""kubernetes.io/psp"":""eks.privileged"",""task_id"":""create_stage_table_from_current_table"",""try_number"":""1""},""managedFields"":[{""manager"":""OpenAPI-Generator"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:05:53Z"",""fieldsType"":""FieldsV1"",""fieldsV1"":{""f:metadata"":{""f:annotations"":{""."":{},""f:dag_id"":{},""f:execution_date"":{},""f:task_id"":{},""f:try_number"":{}},""f:labels"":{""."":{},""f:airflow-worker"":{},""f:airflow_version"":{},""f:app"":{},""f:component"":{},""f:dag_id"":{},""f:execution_date"":{},""f:kubernetes_executor"":{},""f:task_id"":{},""f:try_number"":{}}},""f:spec"":{""f:affinity"":{""."":{},""f:podAntiAffinity"":{""."":{},""f:requiredDuringSchedulingIgnoredDuringExecution"":{}}},""f:containers"":{""k:{\""name\"":\""base\""}"":{""."":{},""f:args"":{},""f:env"":{""."":{},""k:{\""name\"":\""AIRFLOW_IS_K8S_EXECUTOR_POD\""}"":{""."
 ":{},""f:name"":{},""f:value"":{}},""k:{\""name\"":\""AIRFLOW__CORE__EXECUTOR\""}"":{""."":{},""f:name"":{},""f:value"":{}},""k:{\""name\"":\""AIRFLOW__METRICS__STATSD_HOST\""}"":{""."":{},""f:name"":{},""f:valueFrom"":{""."":{},""f:fieldRef"":{""."":{},""f:apiVersion"":{},""f:fieldPath"":{}}}},""k:{\""name\"":\""SERVICE_NAME\""}"":{""."":{},""f:name"":{},""f:value"":{}}},""f:envFrom"":{},""f:image"":{},""f:imagePullPolicy"":{},""f:name"":{},""f:ports"":{""."":{},""k:{\""containerPort\"":8080,\""protocol\"":\""TCP\""}"":{""."":{},""f:containerPort"":{},""f:protocol"":{}}},""f:resources"":{""."":{},""f:limits"":{""."":{},""f:cpu"":{},""f:memory"":{}},""f:requests"":{""."":{},""f:cpu"":{},""f:memory"":{}}},""f:terminationMessagePath"":{},""f:terminationMessagePolicy"":{},""f:volumeMounts"":{""."":{},""k:{\""mountPath\"":\""/usr/local/airflow/dags\""}"":{""."":{},""f:mountPath"":{},""f:name"":{}},""k:{\""mountPath\"":\""/usr/local/airflow/logs\""}"":{""."":{},""f:mountPath"":{},""f:nam
 e"":{}}}}},""f:dnsPolicy"":{},""f:enableServiceLinks"":{},""f:restartPolicy"":{},""f:schedulerName"":{},""f:securityContext"":{},""f:serviceAccount"":{},""f:serviceAccountName"":{},""f:terminationGracePeriodSeconds"":{},""f:volumes"":{""."":{},""k:{\""name\"":\""airflow-dags\""}"":{""."":{},""f:name"":{},""f:persistentVolumeClaim"":{""."":{},""f:claimName"":{}}},""k:{\""name\"":\""airflow-logs\""}"":{""."":{},""f:name"":{},""f:persistentVolumeClaim"":{""."":{},""f:claimName"":{}}}}}}},{""manager"":""kube-scheduler"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:05:53Z"",""fieldsType"":""FieldsV1"",""fieldsV1"":{""f:status"":{""f:conditions"":{""."":{},""k:{\""type\"":\""PodScheduled\""}"":{""."":{},""f:lastProbeTime"":{},""f:lastTransitionTime"":{},""f:message"":{},""f:reason"":{},""f:status"":{},""f:type"":{}}}}}},{""manager"":""kubelet"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:06:56Z"",""fieldsType"":""FieldsV1"",""fieldsV1
 "":{""f:status"":{""f:conditions"":{""k:{\""type\"":\""ContainersReady\""}"":{""."":{},""f:lastProbeTime"":{},""f:lastTransitionTime"":{},""f:status"":{},""f:type"":{}},""k:{\""type\"":\""Initialized\""}"":{""."":{},""f:lastProbeTime"":{},""f:lastTransitionTime"":{},""f:status"":{},""f:type"":{}},""k:{\""type\"":\""Ready\""}"":{""."":{},""f:lastProbeTime"":{},""f:type"":{}}},""f:containerStatuses"":{},""f:hostIP"":{},""f:phase"":{},""f:podIP"":{},""f:podIPs"":{""."":{},""k:{\""ip\"":\""10.0.149.225\""}"":{""."":{},""f:ip"":{}}},""f:startTime"":{}}}},{""manager"":""kube-controller-manager"",""operation"":""Update"",""apiVersion"":""v1"",""time"":""2021-08-17T13:07:40Z"",""fieldsType"":""FieldsV1"",""fieldsV1"":{""f:status"":{""f:conditions"":{""k:{\""type\"":\""Ready\""}"":{""f:lastTransitionTime"":{},""f:status"":{}}}}}}]},""spec"":{""volumes"":[{""name"":""airflow-dags"",""persistentVolumeClaim"":{""claimName"":""airflow-dags""}},{""name"":""airflow-logs"",""persistentVolumeClaim""
 :{""claimName"":""airflow-logs""}},{""name"":""airflow-cluster-admin-token-7hxjd"",""secret"":{""secretName"":""airflow-cluster-admin-token-7hxjd"",""defaultMode"":420}}],""containers"":[{""name"":""base"",""image"":""898529291541.dkr.ecr.us-east-1.amazonaws.com/airflowable:git-89418dc3d56a"",""args"":[""airflow"",""tasks"",""run"",""etl_fedora_course_collections"",""create_stage_table_from_current_table"",""2021-08-17T12:00:00+00:00"",""--local"",""--pool"",""default_pool"",""--subdir"",""/usr/local/airflow/dags/main/federated_db_etl.py""],""ports"":[{""containerPort"":8080,""protocol"":""TCP""}],""envFrom"":[{""secretRef"":{""name"":""airflowable""}},{""configMapRef"":{""name"":""airflow-env-config""}}],""env"":[{""name"":""AIRFLOW__CORE__EXECUTOR"",""value"":""LocalExecutor""},{""name"":""SERVICE_NAME"",""value"":""airflowable""},{""name"":""AIRFLOW__METRICS__STATSD_HOST"",""valueFrom"":{""fieldRef"":{""apiVersion"":""v1"",""fieldPath"":""status.hostIP""}}},{""name"":""AIRFLOW_IS
 _K8S_EXECUTOR_POD"",""value"":""True""}],""resources"":{""limits"":{""cpu"":""500m"",""memory"":""500Mi""},""requests"":{""cpu"":""500m"",""memory"":""500Mi""}},""volumeMounts"":[{""name"":""airflow-dags"",""mountPath"":""/usr/local/airflow/dags""},{""name"":""airflow-logs"",""mountPath"":""/usr/local/airflow/logs""},{""name"":""airflow-cluster-admin-token-7hxjd"",""readOnly"":true,""mountPath"":""/var/run/secrets/kubernetes.io/serviceaccount""}],""terminationMessagePath"":""/dev/termination-log"",""terminationMessagePolicy"":""File"",""imagePullPolicy"":""IfNotPresent""}],""restartPolicy"":""Never"",""terminationGracePeriodSeconds"":30,""dnsPolicy"":""ClusterFirst"",""serviceAccountName"":""airflow-cluster-admin"",""serviceAccount"":""airflow-cluster-admin"",""nodeName"":""ip-10-0-151-210.ec2.internal"",""securityContext"":{},""affinity"":{""podAntiAffinity"":{""requiredDuringSchedulingIgnoredDuringExecution"":[{""labelSelector"":{""matchExpressions"":[{""key"":""app"",""operator""
 :""In"",""values"":[""kafka""]}]},""topologyKey"":""kubernetes.io/hostname""}]}},""schedulerName"":""default-scheduler"",""tolerations"":[{""key"":""node.kubernetes.io/not-ready"",""operator"":""Exists"",""effect"":""NoExecute"",""tolerationSeconds"":300},{""key"":""node.kubernetes.io/unreachable"",""operator"":""Exists"",""effect"":""NoExecute"",""tolerationSeconds"":300}],""priority"":0,""enableServiceLinks"":true},""status"":{""phase"":""Running"",""conditions"":[{""type"":""Initialized"",""status"":""True"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:06:53Z""},{""type"":""Ready"",""status"":""False"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:07:40Z""},{""type"":""ContainersReady"",""status"":""True"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:06:56Z""},{""type"":""PodScheduled"",""status"":""True"",""lastProbeTime"":null,""lastTransitionTime"":""2021-08-17T13:06:53Z""}],""hostIP"":""10.0.151.210"",""podIP"":""10.0.149.225"
 ",""podIPs"":[{""ip"":""10.0.149.225""}],""startTime"":""2021-08-17T13:06:53Z"",""containerStatuses"":[{""name"":""base"",""state"":{""running"":{""startedAt"":""2021-08-17T13:06:55Z""}},""lastState"":{},""ready"":true,""restartCount"":0,""image"":""898529291541.dkr.ecr.us-east-1.amazonaws.com/airflowable:git-89418dc3d56a"",""imageID"":""docker-pullable://898529291541.dkr.ecr.us-east-1.amazonaws.com/airflowable@sha256:a9bd2de5662a8be6033902632c8782e0cf69985095ddf2d0bbd4b08b383fea90"",""containerID"":""docker://ee7184e150c0bef601300b3d2ba4fc79c47978bbc6688dcd563d3f138eb3440a"",""started"":true}],""qosClass"":""Guaranteed""}},""requestReceivedTimestamp"":""2021-08-17T13:08:31.813033Z"",""stageTimestamp"":""2021-08-17T13:08:31.869272Z"",""annotations"":{""authorization.k8s.io/decision"":""allow"",""authorization.k8s.io/reason"":""RBAC: allowed by ClusterRoleBinding \""system:controller:pod-garbage-collector\"" of ClusterRole \""system:controller:pod-garbage-collector\"" to ServiceAccou
 nt \""pod-garbage-collector/kube-system\""""}}"
   
   ```
   
   **Are you willing to submit a PR?** I think there is a state & event combination missing from the process_state function in the KubernetesJobWatcher. Would be willing to pair on a PR to fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #17693:
URL: https://github.com/apache/airflow/issues/17693#issuecomment-901312069


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #17693:
URL: https://github.com/apache/airflow/issues/17693#issuecomment-904380248


   2.1.3 has been released, can you try this in 2.1.3, similar issues were fixed in 2.1.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lindsable commented on issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

Posted by GitBox <gi...@apache.org>.
lindsable commented on issue #17693:
URL: https://github.com/apache/airflow/issues/17693#issuecomment-904651962


   I'll try it out! I'm having an issue when upgrading where the folders in my dags folder aren't being loaded as modules when I upgrade so it might take a day to get 2.1.3 deployed in production. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lindsable edited a comment on issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

Posted by GitBox <gi...@apache.org>.
lindsable edited a comment on issue #17693:
URL: https://github.com/apache/airflow/issues/17693#issuecomment-915363464


   I was just able to confirm that this is still an issue in 2.1.3. I'll update the issue with the logs. The scenario is when a node is removed from the cluster where a worker pod is already scheduled to run. There is an event where the status is still running but the event type is DELETED. In this case I believe we should set the state to up_for_retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lindsable commented on issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

Posted by GitBox <gi...@apache.org>.
lindsable commented on issue #17693:
URL: https://github.com/apache/airflow/issues/17693#issuecomment-915363464


   I was just able to confirm that this is still an issue in 2.1.3. I'll update the issue with the logs. The scenario is when a node is removed from the cluster where a worker pod is already scheduled to run. There is an event where the status is still running but the event type is DELETED. In this case I believe we should set the state to up_for_retry. The optimal solution would be to get the name of the new pod from the kubernetes api and follow that one instead, but up_for_retry is better than nothing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lindsable edited a comment on issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

Posted by GitBox <gi...@apache.org>.
lindsable edited a comment on issue #17693:
URL: https://github.com/apache/airflow/issues/17693#issuecomment-915363464


   I was just able to confirm that this is still an issue in 2.1.3. I'll update the issue with the logs. The scenario is when a node is removed from the cluster where a worker pod is already scheduled to run. There is an event where the status is still running but the event type is DELETED. In this case I believe we should set the state to failed so the task will be up for retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy closed issue #17693: Kubernetes Executor: Tasks Stuck in Queued State indefinitely (or until scheduler restart).

Posted by GitBox <gi...@apache.org>.
ephraimbuddy closed issue #17693:
URL: https://github.com/apache/airflow/issues/17693


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org