You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Brent Driskill (Jira)" <ji...@apache.org> on 2019/11/30 18:07:00 UTC

[jira] [Created] (AIRFLOW-6134) Scheduler hanging every 45 minutes

Brent Driskill created AIRFLOW-6134:
---------------------------------------

             Summary: Scheduler hanging every 45 minutes
                 Key: AIRFLOW-6134
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6134
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.10.6
            Reporter: Brent Driskill


We have been running Airflow successfully for the past few months. However, starting on the morning of 11/27, the scheduler hung for an unknown reason. After restarting it, it continued to hang every 30-45 minutes. We have temporarily implemented a health check to restart it at this interval but the scheduler continues to not be reliable.

The last logs during the hang are the following (this is just logged over and over, I assume this is the other thread):
{code:java}
17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG parsing processes running
17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file paths queued for processing
17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing the following files for processing:
{code}
The last logs before that loop were the following:
{code:java}
17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB connection pool (PID 2232)
17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for <Process(DagFileProcessor493-Process, stopped)>
17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor for <ommitted> finished
{code}
Doing a py-spy on the running process, I see it hung at the following place:
{code:java}
Thread 4566 (idle): "MainThread"    
connect (psycopg2/__init__.py:130)    
connect (sqlalchemy/engine/default.py:482)    
connect (sqlalchemy/engine/strategies.py:114)    
__connect (sqlalchemy/pool/base.py:639)    
__init__ (sqlalchemy/pool/base.py:437)    
_create_connection (sqlalchemy/pool/base.py:308)    
_do_get (sqlalchemy/pool/impl.py:136)    
checkout (sqlalchemy/pool/base.py:492)    
_checkout (sqlalchemy/pool/base.py:760)    
connect (sqlalchemy/pool/base.py:363)    
_wrap_pool_connect (sqlalchemy/engine/base.py:2276)    
_contextual_connect (sqlalchemy/engine/base.py:2242)    
_optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040)    
__enter__ (contextlib.py:112)    
_run_visitor (sqlalchemy/engine/base.py:2048)    
create_all (sqlalchemy/sql/schema.py:4316)    
prepare_models (celery/backends/database/session.py:54)    
session_factory (celery/backends/database/session.py:59)    
ResultSession (celery/backends/database/__init__.py:99)    
_get_task_meta_for (celery/backends/database/__init__.py:122)    
_inner (celery/backends/database/__init__.py:53)    
get_task_meta (celery/backends/base.py:386)    
_get_task_meta (celery/result.py:412)    
state (celery/result.py:473)    
fetch_celery_task_state (airflow/executors/celery_executor.py:106)    
mapstar (multiprocessing/pool.py:44)    
worker (multiprocessing/pool.py:121)    
run (multiprocessing/process.py:99)    
_bootstrap (multiprocessing/process.py:297)    
_launch (multiprocessing/popen_fork.py:74)    
__init__ (multiprocessing/popen_fork.py:20)    
_Popen (multiprocessing/context.py:277)    
start (multiprocessing/process.py:112)    
_repopulate_pool (multiprocessing/pool.py:241)    
__init__ (multiprocessing/pool.py:176)    
Pool (multiprocessing/context.py:119)    
sync (airflow/executors/celery_executor.py:245)    
heartbeat (airflow/executors/base_executor.py:136)    
_execute_helper (airflow/jobs/scheduler_job.py:1445)    
_execute (airflow/jobs/scheduler_job.py:1356)    
run (airflow/jobs/base_job.py:222)    
scheduler (airflow/bin/cli.py:1042)    
wrapper (airflow/utils/cli.py:74)    <module> (airflow:37)
{code}
We are utilizing Postgres as our results_backend and using the CeleryExecutor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)