You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Robert Kozikowski (JIRA)" <ji...@apache.org> on 2016/09/19 19:49:20 UTC

[jira] [Updated] (AIRFLOW-515) Airflow is stuck without signs of life

     [ https://issues.apache.org/jira/browse/AIRFLOW-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Kozikowski updated AIRFLOW-515:
--------------------------------------
    Summary: Airflow is stuck without signs of life  (was: Airflow system is stuck without signs of life)

> Airflow is stuck without signs of life
> --------------------------------------
>
>                 Key: AIRFLOW-515
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-515
>             Project: Apache Airflow
>          Issue Type: Bug
>         Environment: airflow==1.7.1.3 Dockerfile: https://gist.github.com/kozikow/9290c44f03fba8d0f1299f355ef96b51
> kubernetes config (running within minikube): https://gist.github.com/kozikow/49c812f11d1e4fd43e465e8090619274
>            Reporter: Robert Kozikowski
>
> h1. Symptoms
> Triggering a dag on the web node via the trigger_dag does not start it. I triggered the same dag moment before and it worked. I did not apply any changes in the mean time. 
> Very likely some deadlock on scheduler node deadlocked the whole system. 
> Triggering the dag (airflow trigger_dag on the web node) only adds it to postgres table dag_run. The run does not appear in the flower or the main web gui. System was stuck for 50 minutes without any signs of life, and no logs coming from either scheduler, worker or webserver.
> In the end, system got unblocked after force-restarting scheduler and worker. Is there any way I should health-check my scheduler to avoid similar situation in production?  
> h1. Recent logs
> Logs were collected at 19:10, the system was stuck for almost an hour.
> h2. scheduler
> {noformat}
> /usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: SAWarning: The IN-predicate on "task_instance.execution_date" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate.  Consider alternative strategies for improved performance.
>   'strategies for improved performance.' % expr)
> [2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat
> [2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
> [2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs without a recent heartbeat
> [2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without heartbeat after 2016-09-19 18:13:26.189579
> [2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing ('repos_to_process_dag', 'update_repos_to_process', datetime.datetime(2016, 9, 19, 18, 15, 35, 200082)) through celery, queue=default
> [2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
> [2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
> {noformat}
> h2. web
> {noformat}
> [2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> {noformat}
> h2. worker
> {noformat}
> Logging into: /usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)