You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/21 10:11:32 UTC

[GitHub] [airflow] eejbyfeldt opened a new issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

eejbyfeldt opened a new issue #13808:
URL: https://github.com/apache/airflow/issues/13808


   **Apache Airflow version**:
   
   2.0.0
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   
   1.15.7
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AWS
   
   **What happened**:
   
   We run two schedulers that were restarted simultaneously.
   
   The first scheduler correctly adopts these orphaned tasks:
   ```
   [2021-01-20 16:28:06,744] {kubernetes_executor.py:626} INFO - attempting to adopt pod dag1sometask-bcf1adde250
   [2021-01-20 16:28:06,762] {kubernetes_executor.py:147} INFO - Event: dag1sometask-bcf1adde250 had an event of type ADDED
   [2021-01-20 16:28:06,762] {kubernetes_executor.py:210} INFO - Event: dag1sometask-bcf1adde250 is Running
   ```
   
   Then 15 minutes later the second scheduler goes and incorrectly reset the state for this TaskInstance:
   ```
   [2021-01-20 16:43:10,383] {scheduler_job.py:1764} INFO - Marked 1 SchedulerJob instances as failed
   [2021-01-20 16:43:10,591] {scheduler_job.py:1805} INFO - Reset the following 1 orphaned TaskInstances:
   	<TaskInstance: dag1.sometask 2021-01-19 01:00:00+00:00 [running]>
   ```
   
   **What you expected to happen**:
   
   One of the schedulers adopts the orphaned task and the other does nothing with this task. 
   
   **How to reproduce it**:
   
   Use 2 schedulers and have tasks running using the KubernetesExecutor.
   Restart both schedulers simultaniously.
   Wait for some task to fail due to external state change of task instance.
   
   **Anything else we need to know**:
   
   The following configurations was used for the scheduler:
   ```
   scheduler_heartbeat_sec = 15
   scheduler_health_check_threshold = 30
   orphaned_tasks_check_interval = 300.0
   ```
   
   We have not seen the issue every time we restart the scheulers, but it has been happened more than once.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] atrbgithub commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
atrbgithub commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-796580922


   @ephraimbuddy @dimberman @eejbyfeldt 
   
   We're seeing an issue similar to this, but in our case we never have two schedulers running. 
   
   We see this after moving to 2.0.1, we have a dag which has a subdag, both tasks are launched using the k8s executor. The task in the subdag does nothing but sleep for an hour or so. When the task is running, and we kill the scheduler pod, everything is fine, when the scheduler pod returns, the task continues to run.
   
   However, when we redeploy airflow, and the scheduler is recreated (with a new docker image), we see that the task is terminated:
   
   ```
   [2021-03-11 08:29:32,491] {{logging_mixin.py:104}} INFO - I'm sleeping for 10/60 minutes
   [2021-03-11 08:30:32,501] {{logging_mixin.py:104}} INFO - I'm sleeping for 11/60 minutes
   [2021-03-11 08:30:55,688] {{local_task_job.py:187}} WARNING - State of this instance has been externally set to queued. Terminating instance.
   [2021-03-11 08:30:55,690] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 16
   [2021-03-11 08:30:55,691] {{taskinstance.py:1240}} ERROR - Received SIGTERM. Terminating subprocesses.
   [2021-03-11 08:30:55,719] {{taskinstance.py:1456}} ERROR - Task received SIGTERM signal
   ```
   
   This seems to be new behaviour in 2.0.1. The scheduler has the following logs
   
   ```
   [ESC[34m2021-03-11 08:30:51,524ESC[0m] {ESC[34mscheduler_job.py:ESC[0m1898} INFOESC[0m - Reset the following 2 orphaned TaskInstances:
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job 2021-03-11 08:19:21.448002+00:00 [running]>
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job.kubernetes_executor_worker_adhoc_task 2021-03-11 08:19:21.448002+00:00 [running]>ESC[0m 
   ```
   
   Which relate to the task which was killed above. Is this expected? We're using the recreate strategy in k8s, which means the old scheduler is stopped before the new one is created. They are never running at the same time.
   
   Is there perhaps a config option to prevent this from happening?
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eejbyfeldt commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-801152433


   > @ephraimbuddy @dimberman @eejbyfeldt @kaxil
   > 
   > We have created a PR for this which seems to resolve it for us:
   > #14795
   > 
   > Would it be possible to review this? I'm not sure if it will resolve it for this particular issue, but certainly when a single scheduler is used, and that pod is terminated over and over again, without this fix the tasks are not correctly adopted after the first restart.
   
   We have also seen these problems with this issue running a single scheduler since reporting this. So it is possible that we wrongly assumed it was related to running two schedulers since that made it more prominent for us. 
   
   I have had a look at the PR and I agree that the usage of `external_executor_id` inside `try_adopt_task_instances` while updating `queued_by_job_id` outside seems problematic. But I am not familiar enough with the code to say what is the correct solution.
   
   Thanks for helping to make some progress towards solving this issue. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
dimberman commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-779402190


   @ephraimbuddy would you be interested in trying to reproduce this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eejbyfeldt commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-764710319


   > Did that task run correctly or was it marked as failed?
   
   It failed. In the logs of the task we had the following:
   ```
   State of this instance has been externally set to queued. Terminating instance.
   Sending Signals.SIGTERM to GPID 16
   Received SIGTERM. Terminating subprocesses.
   Task received SIGTERM signal
   ...
   ```
   So the it seems the `LocalTaskJob` noticed a state change and terminated the task.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eejbyfeldt commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-771672924


   We did not run with that specific fix, but the one proposed in https://github.com/apache/airflow/issues/13677
   
   Also both of the schedulers were running when the task failures occurred and continued running afterwards, so none the schedulers were not killed/restarted due to livenessProbe.
   
   @grepthat If you saw this issue without that fix, was that the real root cause or were you just hitting this bug more often when you had more scheduler restarts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eejbyfeldt commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-764710319


   > Did that task run correctly or was it marked as failed?
   
   It failed. In the logs of the task we had the following:
   ```
   State of this instance has been externally set to queued. Terminating instance.
   Sending Signals.SIGTERM to GPID 16
   Received SIGTERM. Terminating subprocesses.
   Task received SIGTERM signal
   ...
   ```
   So the it seems the `LocalTaskJob` noticed a state change and terminated the task.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-764527556


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] atrbgithub edited a comment on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
atrbgithub edited a comment on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-796580922


   @ephraimbuddy @dimberman @eejbyfeldt 
   
   We're seeing an issue similar to this, but in our case we never have two schedulers running. 
   
   We see this after moving to 2.0.1, we have a dag which has a subdag, both tasks are launched using the k8s executor. The task in the subdag does nothing but sleep for an hour or so. When the task is running, and we kill the scheduler pod, everything is fine, when the scheduler pod returns, the task continues to run.
   
   However, when we redeploy airflow, and the scheduler is recreated (with a new docker image), we see that the task is terminated:
   
   ```
   [2021-03-11 08:29:32,491] {{logging_mixin.py:104}} INFO - I'm sleeping for 10/60 minutes
   [2021-03-11 08:30:32,501] {{logging_mixin.py:104}} INFO - I'm sleeping for 11/60 minutes
   [2021-03-11 08:30:55,688] {{local_task_job.py:187}} WARNING - State of this instance has been externally set to queued. Terminating instance.
   [2021-03-11 08:30:55,690] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 16
   [2021-03-11 08:30:55,691] {{taskinstance.py:1240}} ERROR - Received SIGTERM. Terminating subprocesses.
   [2021-03-11 08:30:55,719] {{taskinstance.py:1456}} ERROR - Task received SIGTERM signal
   ```
   
   This seems to be new behaviour in 2.0.1. The scheduler has the following logs
   
   ```
   [ESC[34m2021-03-11 08:30:51,524ESC[0m] {ESC[34mscheduler_job.py:ESC[0m1898} INFOESC[0m - Reset the following 2 orphaned TaskInstances:
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job 2021-03-11 08:19:21.448002+00:00 [running]>
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job.kubernetes_executor_worker_adhoc_task 2021-03-11 08:19:21.448002+00:00 [running]>ESC[0m 
   ```
   
   ```
   [ESC[34m2021-03-11 08:30:51,646ESC[0m] {ESC[34mscheduler_job.py:ESC[0m1063} INFOESC[0m - Setting the following tasks to queued state:
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job 2021-03-11 08:19:21.448002+00:00 [scheduled]>
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job.kubernetes_executor_worker_adhoc_task 2021-03-11 08:19:21.448002+00:00 [scheduled]>ESC[0m
   ```
   
   Which relate to the task which was killed above. Is this expected? We're using the recreate strategy in k8s, which means the old scheduler is stopped before the new one is created. They are never running at the same time.
   
   Is there perhaps a config option to prevent this from happening?
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] atrbgithub edited a comment on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
atrbgithub edited a comment on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-796580922


   @ephraimbuddy @dimberman @eejbyfeldt 
   
   We're seeing an issue similar to this, but in our case we never have two schedulers running. 
   
   We see this after moving to 2.0.1, we have a dag which has a subdag, both tasks are launched using the k8s executor. The task in the subdag does nothing but sleep for an hour or so. When the task is running, and we kill the scheduler pod, everything is fine, when the scheduler pod returns, the task continues to run.
   
   However, when we redeploy airflow, and the scheduler is recreated (with a new docker image), we see that the task is terminated:
   
   ```
   [2021-03-11 08:29:32,491] {{logging_mixin.py:104}} INFO - I'm sleeping for 10/60 minutes
   [2021-03-11 08:30:32,501] {{logging_mixin.py:104}} INFO - I'm sleeping for 11/60 minutes
   [2021-03-11 08:30:55,688] {{local_task_job.py:187}} WARNING - State of this instance has been externally set to queued. Terminating instance.
   [2021-03-11 08:30:55,690] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 16
   [2021-03-11 08:30:55,691] {{taskinstance.py:1240}} ERROR - Received SIGTERM. Terminating subprocesses.
   [2021-03-11 08:30:55,719] {{taskinstance.py:1456}} ERROR - Task received SIGTERM signal
   ```
   
   This seems to be new behaviour in 2.0.1. The scheduler has the following logs
   
   ```
   [ESC[34m2021-03-11 08:30:51,524ESC[0m] {ESC[34mscheduler_job.py:ESC[0m1898} INFOESC[0m - Reset the following 2 orphaned TaskInstances:
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job 2021-03-11 08:19:21.448002+00:00 [running]>
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job.kubernetes_executor_worker_adhoc_task 2021-03-11 08:19:21.448002+00:00 [running]>ESC[0m 
   ```
   
   ```
   [ESC[34m2021-03-11 08:30:51,646ESC[0m] {ESC[34mscheduler_job.py:ESC[0m1063} INFOESC[0m - Setting the following tasks to queued state:
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job 2021-03-11 08:19:21.448002+00:00 [scheduled]>
           <TaskInstance: kubernetes_executor_worker_adhoc_dag.worker_pod_adhoc_job.kubernetes_executor_worker_adhoc_task 2021-03-11 08:19:21.448002+00:00 [scheduled]>ESC[0m
   ```
   
   Which relate to the task which was killed above. Is this expected? We're using the recreate strategy in k8s, which means the old scheduler is stopped before the new one is created. They are never running at the same time.
   
   Is there perhaps a config option to prevent this from happening?
   
   Edit - We also see this if we manually scale down the scheduler to 0 for a few minutes, and then scale it backup to 1. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] grepthat commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
grepthat commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-771609549


   We saw this happing due to https://github.com/apache/airflow/pull/13705 - Did you run an Airflow version with this fix included?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-764699292


   Did that task run correctly or was it marked as failed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-764699292


   Did that task run correctly or was it marked as failed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-779404724


   > @ephraimbuddy would you be interested in trying to reproduce this?
   
   Yes. I will try to reproduce it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eejbyfeldt commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-771672924


   We did not run with that specific fix, but the one proposed in https://github.com/apache/airflow/issues/13677
   
   Also both of the schedulers were running when the task failures occurred and continued running afterwards, so none the schedulers were not killed/restarted due to livenessProbe.
   
   @grepthat If you saw this issue without that fix, was that the real root cause or were you just hitting this bug more often when you had more scheduler restarts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
kaxil closed issue #13808:
URL: https://github.com/apache/airflow/issues/13808


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] atrbgithub commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
atrbgithub commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-799267633


   @ephraimbuddy @dimberman @eejbyfeldt @kaxil 
   
   We have created a PR for this which seems to resolve it for us:
   https://github.com/apache/airflow/pull/14795
   
   Would it be possible to review this? I'm not sure if it will resolve it for this particular issue, but certainly when a single scheduler is used, and that pod is terminated over and over again, without this fix the tasks are not correctly adopted after the first restart. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-764527556


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] grepthat commented on issue #13808: Task incorrectly marked as orphaned when using 2 schedulers

Posted by GitBox <gi...@apache.org>.
grepthat commented on issue #13808:
URL: https://github.com/apache/airflow/issues/13808#issuecomment-771609549


   We saw this happing due to https://github.com/apache/airflow/pull/13705 - Did you run an Airflow version with this fix included?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org