You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Joseph Harris (JIRA)" <ji...@apache.org> on 2017/12/19 11:31:01 UTC

[jira] [Created] (AIRFLOW-1941) Scheduler / executor loses tasks on restart when enforcing parallelism limit

Joseph Harris created AIRFLOW-1941:
--------------------------------------

             Summary: Scheduler / executor loses tasks on restart when enforcing parallelism limit
                 Key: AIRFLOW-1941
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1941
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.8.1, 1.9.0
         Environment: Linux
            Reporter: Joseph Harris


When running the scheduler with a limited number of cycles - eg:
{{airflow scheduler -n 30}}
and with {{PARALLELISM=32}} set in airflow.cfg

The Executor checks that {{len(self.running) < PARALLELISM}} before calling {{execute_async()}}
https://github.com/apache/incubator-airflow/blob/master/airflow/executors/base_executor.py#L98
When {{self.running}} is full for an extended period of time, the scheduler can exit without having scheduled the remaining tasks in {{self.queued_tasks}}. When it restarts, the lots tasks in {{self.queued_tasks}} don't get scheduled again, and get stuck in the queued state until manually kicked.

We experienced issues with this when exiting tasks with clashing PIDs caused the CeleryExecutor's {{self.running}} to become full of zombie jobs that could not complete.


* The Executor should not hold 'queued' tasks for an extended period of time, as it may exit for any reason. The parallelism constraint should be checked alongside other dependencies.
* When shutting down 'gracefully', the scheduler should at least log a warning if there are any tasks in self.queued_tasks
* Parallelism should be set to infinity if a queue-based/distributed executor is being used (more risky)

This may be a common cause of tasks getting stuck in the 'queued' state when running Celery. 
Although AIRFLOW-900 is resolved in 1.9.0, this issue is still present, and the scheduler is still at risk of exiting without having scheduled tasks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)