You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Ash Berlin-Taylor (Jira)" <ji...@apache.org> on 2019/12/06 16:37:00 UTC
[jira] [Assigned] (AIRFLOW-6134) Scheduler hanging every 45
minutes, workers hanging on every job
[ https://issues.apache.org/jira/browse/AIRFLOW-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ash Berlin-Taylor reassigned AIRFLOW-6134:
------------------------------------------
Assignee: Ash Berlin-Taylor
> Scheduler hanging every 45 minutes, workers hanging on every job
> ----------------------------------------------------------------
>
> Key: AIRFLOW-6134
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6134
> Project: Apache Airflow
> Issue Type: Bug
> Components: scheduler, worker
> Affects Versions: 1.10.6
> Reporter: Brent Driskill
> Assignee: Ash Berlin-Taylor
> Priority: Major
>
> We have been running Airflow successfully for the past few months. However, starting on the morning of 11/27, the scheduler hung for an unknown reason. After restarting it, it continued to hang every 30-45 minutes. We have temporarily implemented a health check to restart it at this interval but the scheduler continues to not be reliable.
> The last logs during the hang are the following (this is just logged over and over, I assume this is the other thread):
> {code:java}
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG parsing processes running
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file paths queued for processing
> 17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing the following files for processing:
> {code}
> The last logs before that loop were the following:
> {code:java}
> 17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB connection pool (PID 2232)
> 17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for <Process(DagFileProcessor493-Process, stopped)>
> 17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor for <ommitted> finished
> {code}
> Doing a py-spy on the running process, I see it hung at the following place:
> {code:java}
> Thread 4566 (idle): "MainThread"
> connect (psycopg2/__init__.py:130)
> connect (sqlalchemy/engine/default.py:482)
> connect (sqlalchemy/engine/strategies.py:114)
> __connect (sqlalchemy/pool/base.py:639)
> __init__ (sqlalchemy/pool/base.py:437)
> _create_connection (sqlalchemy/pool/base.py:308)
> _do_get (sqlalchemy/pool/impl.py:136)
> checkout (sqlalchemy/pool/base.py:492)
> _checkout (sqlalchemy/pool/base.py:760)
> connect (sqlalchemy/pool/base.py:363)
> _wrap_pool_connect (sqlalchemy/engine/base.py:2276)
> _contextual_connect (sqlalchemy/engine/base.py:2242)
> _optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040)
> __enter__ (contextlib.py:112)
> _run_visitor (sqlalchemy/engine/base.py:2048)
> create_all (sqlalchemy/sql/schema.py:4316)
> prepare_models (celery/backends/database/session.py:54)
> session_factory (celery/backends/database/session.py:59)
> ResultSession (celery/backends/database/__init__.py:99)
> _get_task_meta_for (celery/backends/database/__init__.py:122)
> _inner (celery/backends/database/__init__.py:53)
> get_task_meta (celery/backends/base.py:386)
> _get_task_meta (celery/result.py:412)
> state (celery/result.py:473)
> fetch_celery_task_state (airflow/executors/celery_executor.py:106)
> mapstar (multiprocessing/pool.py:44)
> worker (multiprocessing/pool.py:121)
> run (multiprocessing/process.py:99)
> _bootstrap (multiprocessing/process.py:297)
> _launch (multiprocessing/popen_fork.py:74)
> __init__ (multiprocessing/popen_fork.py:20)
> _Popen (multiprocessing/context.py:277)
> start (multiprocessing/process.py:112)
> _repopulate_pool (multiprocessing/pool.py:241)
> __init__ (multiprocessing/pool.py:176)
> Pool (multiprocessing/context.py:119)
> sync (airflow/executors/celery_executor.py:245)
> heartbeat (airflow/executors/base_executor.py:136)
> _execute_helper (airflow/jobs/scheduler_job.py:1445)
> _execute (airflow/jobs/scheduler_job.py:1356)
> run (airflow/jobs/base_job.py:222)
> scheduler (airflow/bin/cli.py:1042)
> wrapper (airflow/utils/cli.py:74) <module> (airflow:37)
> {code}
> We are utilizing Postgres as our results_backend and using the CeleryExecutor.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)