You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Brent Driskill (Jira)" <ji...@apache.org> on 2019/11/30 18:07:00 UTC
[jira] [Created] (AIRFLOW-6134) Scheduler hanging every 45 minutes
Brent Driskill created AIRFLOW-6134:
---------------------------------------
Summary: Scheduler hanging every 45 minutes
Key: AIRFLOW-6134
URL: https://issues.apache.org/jira/browse/AIRFLOW-6134
Project: Apache Airflow
Issue Type: Bug
Components: scheduler
Affects Versions: 1.10.6
Reporter: Brent Driskill
We have been running Airflow successfully for the past few months. However, starting on the morning of 11/27, the scheduler hung for an unknown reason. After restarting it, it continued to hang every 30-45 minutes. We have temporarily implemented a health check to restart it at this interval but the scheduler continues to not be reliable.
The last logs during the hang are the following (this is just logged over and over, I assume this is the other thread):
{code:java}
17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG parsing processes running
17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file paths queued for processing
17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing the following files for processing:
{code}
The last logs before that loop were the following:
{code:java}
17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB connection pool (PID 2232)
17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for <Process(DagFileProcessor493-Process, stopped)>
17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor for <ommitted> finished
{code}
Doing a py-spy on the running process, I see it hung at the following place:
{code:java}
Thread 4566 (idle): "MainThread"
connect (psycopg2/__init__.py:130)
connect (sqlalchemy/engine/default.py:482)
connect (sqlalchemy/engine/strategies.py:114)
__connect (sqlalchemy/pool/base.py:639)
__init__ (sqlalchemy/pool/base.py:437)
_create_connection (sqlalchemy/pool/base.py:308)
_do_get (sqlalchemy/pool/impl.py:136)
checkout (sqlalchemy/pool/base.py:492)
_checkout (sqlalchemy/pool/base.py:760)
connect (sqlalchemy/pool/base.py:363)
_wrap_pool_connect (sqlalchemy/engine/base.py:2276)
_contextual_connect (sqlalchemy/engine/base.py:2242)
_optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040)
__enter__ (contextlib.py:112)
_run_visitor (sqlalchemy/engine/base.py:2048)
create_all (sqlalchemy/sql/schema.py:4316)
prepare_models (celery/backends/database/session.py:54)
session_factory (celery/backends/database/session.py:59)
ResultSession (celery/backends/database/__init__.py:99)
_get_task_meta_for (celery/backends/database/__init__.py:122)
_inner (celery/backends/database/__init__.py:53)
get_task_meta (celery/backends/base.py:386)
_get_task_meta (celery/result.py:412)
state (celery/result.py:473)
fetch_celery_task_state (airflow/executors/celery_executor.py:106)
mapstar (multiprocessing/pool.py:44)
worker (multiprocessing/pool.py:121)
run (multiprocessing/process.py:99)
_bootstrap (multiprocessing/process.py:297)
_launch (multiprocessing/popen_fork.py:74)
__init__ (multiprocessing/popen_fork.py:20)
_Popen (multiprocessing/context.py:277)
start (multiprocessing/process.py:112)
_repopulate_pool (multiprocessing/pool.py:241)
__init__ (multiprocessing/pool.py:176)
Pool (multiprocessing/context.py:119)
sync (airflow/executors/celery_executor.py:245)
heartbeat (airflow/executors/base_executor.py:136)
_execute_helper (airflow/jobs/scheduler_job.py:1445)
_execute (airflow/jobs/scheduler_job.py:1356)
run (airflow/jobs/base_job.py:222)
scheduler (airflow/bin/cli.py:1042)
wrapper (airflow/utils/cli.py:74) <module> (airflow:37)
{code}
We are utilizing Postgres as our results_backend and using the CeleryExecutor.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)