You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/28 21:35:08 UTC

[GitHub] [airflow] rpfernandezjr opened a new issue #15580: KubernetesExecutor not deleting worker pods on completion.

rpfernandezjr opened a new issue #15580:
URL: https://github.com/apache/airflow/issues/15580


   **Apache Airflow version**:  2.0.2
   
   **Kubernetes version**: 1.17.17
   
   - **Cloud provider or hardware configuration**: GKE
   - **OS** (e.g. from /etc/os-release):  CentOS Linux release 7.8.2003 (Core)
   - **kernel**: 4.19.150+ x86
   
   **What happened**:
   A dag is kicked off via the airflow webserver, the scheduler kicks off a new worker pod for the task, and that task is flagged as `success` , however the worker pod that was executed stays in a `CrashLoopBackOff` state, and is never removed.
   
   
   **What you expected to happen**:
   once the task is flagged as `success` , i expect the worker pod to get cleaned up. 
   
   Here's the entry from the database where the task was flagged as successful after the initial pod run.
   ```
     task_id   | try_number | pid |  state  |          start_date           |           end_date            
   ------------+------------+-----+---------+-------------------------------+-------------------------------
    start_task |          1 |   1 | success | 2021-04-28 20:55:37.117977+00 | 2021-04-28 20:55:37.424436+00
   ```
   
   Here's a snippet of the scheduler logs, it continues to log this in a loop 
   ```
   [2021-04-28 21:19:00,140] {kubernetes_executor.py:335} DEBUG - Syncing KubernetesExecutor
   [2021-04-28 21:19:00,140] {kubernetes_executor.py:261} DEBUG - KubeJobWatcher alive, continuing
   [2021-04-28 21:19:00,141] {dag_processing.py:385} DEBUG - Received message of type DagParsingStat
   [2021-04-28 21:19:00,142] {dag_processing.py:385} DEBUG - Received message of type DagParsingStat
   [2021-04-28 21:19:00,152] {scheduler_job.py:1399} DEBUG - Next timed event is in 1.834328
   [2021-04-28 21:19:00,152] {scheduler_job.py:1401} DEBUG - Ran scheduling loop in 0.05 seconds
   [2021-04-28 21:19:01,154] {scheduler_job.py:1591} DEBUG - Running SchedulerJob._create_dagruns_for_dags with retries. Try 1 of 3
   [2021-04-28 21:19:01,165] {scheduler_job.py:1573} DEBUG - Running SchedulerJob._get_dagmodels_and_create_dagruns with retries. Try 1 of 3
   [2021-04-28 21:19:01,187] {scheduler_job.py:940} DEBUG - No tasks to consider for execution.
   [2021-04-28 21:19:01,189] {base_executor.py:150} DEBUG - 1 running task instances
   [2021-04-28 21:19:01,189] {base_executor.py:151} DEBUG - 0 in queue
   [2021-04-28 21:19:01,189] {base_executor.py:152} DEBUG - 31 open slots
   [2021-04-28 21:19:01,190] {base_executor.py:161} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
   [2021-04-28 21:19:01,190] {kubernetes_executor.py:510} DEBUG - self.running: {TaskInstanceKey(dag_id='raf-k8-dag', task_id='start_task', execution_date=datetime.datetime(2021, 4, 28, 20, 55, 32, 53209, tzinfo=Timezone('UTC')), try_number=1)}
   ```
   
   here is the pod what was created `dagfk8dagstarttask...` , stays in a crashing state.
   ```
   NAME                                                 READY   STATUS             RESTARTS   AGE
   airflow-0                                            2/2     Running            0          17h
   dagfk8dagstarttask.372040981c7a4a2d8e3df6f01e656f50  0/1     CrashLoopBackOff   9          24m
   ```
   
   currently have these 2 settings in our `airflow.cfg` config
   ```
   delete_worker_pods = True
   delete_worker_pods_on_failure = False
   ```
   I've tried filliping `delete_worker_pods_on_failure` to both True and False and get the same results in my runs.
   
   
   here's the description of the worker pod that is crashing
   ```
   Name:                 dagk8dagstarttask.372040981c7a4a2d8e3df6f01e656f50
   Namespace:            default
   Priority:             500
   Priority Class Name:  mid-priority
   Node:                 gke-udp-xxxxx-gke-node-pool-1a26adcb-92ca/10.100.0.45
   Start Time:           Wed, 28 Apr 2021 15:55:33 -0500
   Labels:               airflow-worker=372
                         airflow_version=2.0.2
                         dag_id=raf-k8-dag
                         execution_date=2021-04-28T20_55_32.053209_plus_00_00
                         kubernetes_executor=True
                         task_id=start_task
                         try_number=1
   Annotations:          dag_id: raf-k8-dag
                         execution_date: 2021-04-28T20:55:32.053209+00:00
                         kubernetes.io/limit-ranger: LimitRanger plugin set: cpu, memory request for container base; cpu limit for container base
                         task_id: start_task
                         try_number: 1
   Status:               Running
   IP:                   10.102.1.96
   IPs:
     IP:  10.102.1.96
   Containers:
     base:
       Container ID:  docker://7d0fc4d8a1e799ce215a08a4216682f8a0fb33d735cb81c22ae4bd4410f3b78f
       Image:         gcr.io/xxxxxxxx/default/airflow:latest
       Image ID:      docker-pullable://gcr.io/xxxxxxxxxxx/default/airflow@sha256:66fe0f4e1185698c93ebfba05d837f1fc764cd3ec70492a6a058a01efa558598
       Port:          <none>
       Host Port:     <none>
       Args:
         airflow
         tasks
         run
         raf-k8-dag
         start_task
         2021-04-28T20:55:32.053209+00:00
         --local
         --pool
         default_pool
         --subdir
         /opt/airflow/dags/k8-test.py
       State:          Waiting
         Reason:       CrashLoopBackOff
       Last State:     Terminated
         Reason:       Completed
         Exit Code:    0
         Started:      Wed, 28 Apr 2021 16:01:34 -0500
         Finished:     Wed, 28 Apr 2021 16:01:37 -0500
       Ready:          False
       Restart Count:  6
       Limits:
         cpu:  750m
       Requests:
         cpu:     150m
         memory:  512Mi
       Environment:
         AIRFLOW_IS_K8S_EXECUTOR_POD:  True
       Mounts:
         /var/run/secrets/kubernetes.io/serviceaccount from default-token-hd2mp (ro)
   Conditions:
     Type              Status
     Initialized       True
     Ready             False
     ContainersReady   False
     PodScheduled      True
   Volumes:
     default-token-hd2mp:
       Type:        Secret (a volume populated by a Secret)
       SecretName:  default-token-hd2mp
       Optional:    false
   QoS Class:       Burstable
   Node-Selectors:  <none>
   Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                    node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
   Events:
     Type     Reason     Age                   From               Message
     ----     ------     ----                  ----               -------
     Normal   Scheduled  8m5s                  default-scheduler  Successfully assigned default/dagk8dagstarttask.372040981c7a4a2d8e3df6f01e656f50 to gke-udp-xxxxx-gke-node-pool-1a26adcb-92ca
     Normal   Started    7m8s (x4 over 8m3s)   kubelet            Started container base
     Normal   Pulling    6m21s (x5 over 8m4s)  kubelet            Pulling image "gcr.io/xxxxxxxxxxx/default/airflow:latest"
     Normal   Pulled     6m21s (x5 over 8m3s)  kubelet            Successfully pulled image "gcr.io/xxxxxxxxxxx/default/airflow:latest"
     Normal   Created    6m21s (x5 over 8m3s)  kubelet            Created container base
     Warning  BackOff    3m (x22 over 7m54s)   kubelet            Back-off restarting failed container
   ```
   
   here's the dag that I'm running too, pretty basic
   ```python
   import logging
   import os
   import time
   import sys
   
   from airflow import DAG
   from airflow.example_dags.libs.helper import print_stuff
   from airflow.operators.python import PythonOperator
   from airflow.settings import AIRFLOW_HOME
   from airflow.utils.dates import days_ago
   from kubernetes.client import models as k8s
   
   default_args = {
       'owner': 'airflow',
   }
   
   log = logging.getLogger(__name__)
   
   with DAG(
           dag_id='raf-k8-dag',
           default_args=default_args,
           schedule_interval=None,
           start_date=days_ago(1),
           tags=['raf-dag-tag'],
   ) as dag:
   
       def print_me():
           msg = "start-task ran ok"
           print(msg)
           log.info(msg)
           return 0
   
       start_task = PythonOperator(
           task_id="start_task",
           python_callable=print_me
       )
   
       start_task
   ```
   
   **How to reproduce it**:
   Run the dag listed above.
   
   
   
   **Anything else we need to know**:
    All signs point to the code (a basic print/log statement) are being executed when the worker runs, however we don't see the output of the worker logs via the airflow webserver. 
   
   seems very similar to this issues: https://github.com/apache/airflow/issues/13917  which was closed without a resolution.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-828795202


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kipkoan edited a comment on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
kipkoan edited a comment on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-864126661


   @rpfernandezjr did you ever find a solution to this? I'm seeing the same thing with Airflow 2.1.0 and the latest docker image.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kipkoan commented on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
kipkoan commented on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-864126661


   @rpfernandezjr did you ever find a solution to this? I'm seeing the same thing with the 2.1.0 docker image.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nwsparks commented on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
nwsparks commented on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-909276155


   can confirm that setting `restartPolicy: Never` fixes this. Make sure its under spec: and not in the container: section.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rpfernandezjr commented on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
rpfernandezjr commented on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-829699581


   Was messing with this today some more. If I set `AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE`  and point that to this template file
   
   ```
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: dummy-name
   spec:
     serviceAccountName: default
     restartPolicy: Never
     containers:
       - name: base
         image: dummy_image
         imagePullPolicy: IfNotPresent
         ports: []
         command: []
   ```
   The important setting from the template file being ` restartPolicy: Never`
   
   Is this the correct way of making the pods not stay in the crash loop? 
   
   It doesn't seems like it should be the right solution, and more of a hack, but I'm not entirely sure. Especially if i want to set this setting;
   
   ```
   delete_worker_pods_on_failure = False
   ```
   
   so that my pods don't get deleted when something goes sideways.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] bensonnd commented on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
bensonnd commented on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-829522449


   Commenting to follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rpfernandezjr commented on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
rpfernandezjr commented on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-911966784


   @kipkoan  - no i just ended up using `restartPolicy: Never` in my pod template spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nwsparks commented on issue #15580: KubernetesExecutor not deleting worker pods on completion.

Posted by GitBox <gi...@apache.org>.
nwsparks commented on issue #15580:
URL: https://github.com/apache/airflow/issues/15580#issuecomment-909276155


   can confirm that setting `restartPolicy: Never` fixes this. Make sure its under spec: and not in the container: section.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org