You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/15 18:32:25 UTC

[GitHub] [airflow] collinmcnulty commented on issue #16625: Task is not retried when worker pod fails to start

collinmcnulty commented on issue #16625:
URL: https://github.com/apache/airflow/issues/16625#issuecomment-920275278


   I can reproduce this issue like this:
   
   Use this dag on 2.1.1:
   ```
   from datetime import timedelta
   
   from kubernetes.client import models as k8s
   
   from airflow import DAG
   from airflow.operators.bash import BashOperator
   from airflow.utils.dates import days_ago
   
   
   with DAG(
       dag_id="pending",
       schedule_interval=None,
       start_date=days_ago(2),
   ) as dag:
       BashOperator(
           task_id="forever_pending",
           bash_command="date; sleep 30; date",
           retries=3,
           retry_delay=timedelta(seconds=30),
           executor_config={
               "pod_override": k8s.V1Pod(
                   spec=k8s.V1PodSpec(
                       containers=[
                           k8s.V1Container(
                               name="base",
                               volume_mounts=[
                                   k8s.V1VolumeMount(mount_path="/foo/", name="vol")
                               ],)],
                       volumes=[
                           k8s.V1Volume(
                               name="vol",
                               persistent_volume_claim=k8s.V1PersistentVolumeClaimVolumeSource(
                                   claim_name="missing"
                               ),)],)),},)
   ```
   
   And here is the scheduler log from around the failure
   
   ```
   [2021-09-15 17:48:56,352] {scheduler_job.py:873} WARNING - Set 1 task instances to state=failed as their associated DagRun was not in RUNNING state
   2021-09-15T17:48:56.134716Z info watchFileEvents: "/etc/certs": MODIFY|ATTRIB
   2021-09-15T17:48:56.134808Z info watchFileEvents: "/etc/certs/..2021_09_06_06_43_21.729675760": MODIFY|ATTRIB
   [2021-09-15 17:48:47,821] {dagrun.py:429} ERROR - Marking run <DagRun pending @ 2021-09-15 17:43:28.990599+00:00: manual__2021-09-15T17:43:28.990599+00:00, externally triggered: True> failed
   [2021-09-15 17:48:47,769] {scheduler_job.py:1258} ERROR - Executor reports task instance <TaskInstance: pending.forever_pending 2021-09-15 17:43:28.990599+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
   [2021-09-15 17:48:47,769] {scheduler_job.py:1265} INFO - Setting task instance <TaskInstance: pending.forever_pending 2021-09-15 17:43:28.990599+00:00 [queued]> state to failed as reported by executor
   [2021-09-15 17:48:47,761] {kubernetes_executor.py:549} INFO - Changing state of (TaskInstanceKey(dag_id='pending', task_id='forever_pending', execution_date=datetime.datetime(2021, 9, 15, 17, 43, 28, 990599, tzinfo=tzlocal()), try_number=1), 'failed', 'pendingforeverpending.cc4a625ffe0d4da88709098daba98d87', 'astronomer-magnificent-aurora-4284', '1751732637') to failed
   [2021-09-15 17:48:47,761] {scheduler_job.py:1229} INFO - Executor reports execution of pending.forever_pending execution_date=2021-09-15 17:43:28.990599+00:00 exited with status failed for try_number 1
   [2021-09-15 17:48:47,759] {kubernetes_executor.py:372} INFO - Attempting to finish pod; pod_id: pendingforeverpending.cc4a625ffe0d4da88709098daba98d87; state: failed; annotations: {'dag_id': 'pending', 'task_id': 'forever_pending', 'execution_date': '2021-09-15T17:43:28.990599+00:00', 'try_number': '1'}
   [2021-09-15 17:48:46,695] {kubernetes_executor.py:149} INFO - Event: pendingforeverpending.cc4a625ffe0d4da88709098daba98d87 had an event of type DELETED
   [2021-09-15 17:48:46,695] {kubernetes_executor.py:200} INFO - Event: Failed to start pod pendingforeverpending.cc4a625ffe0d4da88709098daba98d87
   [2021-09-15 17:48:46,692] {kubernetes_executor.py:149} INFO - Event: pendingforeverpending.cc4a625ffe0d4da88709098daba98d87 had an event of type MODIFIED
   [2021-09-15 17:48:46,692] {kubernetes_executor.py:203} INFO - Event: pendingforeverpending.cc4a625ffe0d4da88709098daba98d87 Pending
   [2021-09-15 17:48:46,676] {kubernetes_executor.py:625} ERROR - Pod "pendingforeverpending.cc4a625ffe0d4da88709098daba98d87" has been pending for longer than 300 seconds.It will be deleted and set to failed.
   2021-09-15T17:47:50.966665Z info watchFileEvents: notifying
   2021-09-15T17:47:47.079744Z info watchFileEvents: notifying
   2021-09-15T17:47:40.966397Z info watchFileEvents: "/etc/certs": MODIFY|ATTRIB
   2021-09-15T17:47:40.966527Z info watchFileEvents: "/etc/certs/..2021_09_06_06_43_21.426627327": MODIFY|ATTRIB
   2021-09-15T17:47:37.079501Z info watchFileEvents: "/etc/certs": MODIFY|ATTRIB
   2021-09-15T17:47:37.079624Z info watchFileEvents: "/etc/certs/..2021_09_06_06_43_21.729675760": MODIFY|ATTRIB
   [2021-09-15 17:47:07,909] {scheduler_job.py:1841} INFO - Resetting orphaned tasks for active dag runs
   [2021-09-15 17:47:00,347] {scheduler_job.py:1841} INFO - Resetting orphaned tasks for active dag runs
   2021-09-15T17:46:35.978572Z info watchFileEvents: notifying
   2021-09-15T17:46:25.978277Z info watchFileEvents: "/etc/certs": MODIFY|ATTRIB
   2021-09-15T17:46:25.978421Z info watchFileEvents: "/etc/certs/..2021_09_06_06_43_21.426627327": MODIFY|ATTRIB
   2021-09-15T17:46:21.074893Z info watchFileEvents: notifying
   2021-09-15T17:46:11.074610Z info watchFileEvents: "/etc/certs": MODIFY|ATTRIB
   2021-09-15T17:46:11.074754Z info watchFileEvents: "/etc/certs/..2021_09_06_06_43_21.729675760": MODIFY|ATTRIB
   2021-09-15T17:45:11.006936Z info watchFileEvents: notifying
   2021-09-15T17:45:01.006688Z info watchFileEvents: "/etc/certs": MODIFY|ATTRIB
   2021-09-15T17:45:01.006777Z info watchFileEvents: "/etc/certs/..2021_09_06_06_43_21.426627327": MODIFY|ATTRIB
   2021-09-15T17:45:01.006787Z info watchFileEvents: "/etc/certs/..2021_09_06_06_43_21.426627327": MODIFY|ATTRIB
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org