You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Roland de Boo (Jira)" <ji...@apache.org> on 2019/08/20 08:31:00 UTC
[jira] [Comment Edited] (AIRFLOW-401) scheduler gets stuck without
a trace
[ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911106#comment-16911106 ]
Roland de Boo edited comment on AIRFLOW-401 at 8/20/19 8:30 AM:
----------------------------------------------------------------
Airflow 1.10.4, running on Kubernetes (Azure AKS), using KubernetesExecutor.
Scheduler gets stuck, portal shows message "{{The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago."}}
{{py-spy output:}}
{code:java}
Total Samples 40100
GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 29.57s 29.57s poll (multiprocessing/popen_fork.py:28)
0.00% 0.00% 0.000s 0.030s _decref (multiprocessing/managers.py:809)
0.00% 0.00% 0.000s 0.040s one_or_none (sqlalchemy/orm/query.py:3261)
0.00% 0.00% 0.000s 0.030s answer_challenge (multiprocessing/connection.py:732)
0.00% 0.00% 371.3s 371.3s recv_into (OpenSSL/SSL.py:1821)
0.00% 0.00% 0.000s 0.020s __missing__ (sqlalchemy/orm/path_registry.py:303)
0.00% 0.00% 0.000s 0.030s wait_procs (psutil/__init__.py:1667)
0.00% 0.00% 0.000s 0.020s _emit_update_statements (sqlalchemy/orm/persistence.py:996)
0.00% 0.00% 0.000s 0.020s setup (sqlalchemy/orm/interfaces.py:547)
0.00% 0.00% 0.000s 0.020s save_obj (sqlalchemy/orm/persistence.py:236)
0.00% 0.00% 0.000s 0.020s flush (sqlalchemy/orm/session.py:2459)
0.00% 0.00% 0.000s 0.030s recv_bytes (multiprocessing/connection.py:216)
0.00% 0.00% 0.000s 371.3s request (kubernetes/client/rest.py:166)
0.00% 0.00% 0.000s 0.010s __iter__ (sqlalchemy/orm/query.py:3334)
0.00% 0.00% 0.000s 0.010s connection (sqlalchemy/orm/session.py:1124)
0.00% 0.00% 0.000s 0.010s _contextual_connect (sqlalchemy/engine/base.py:2231)
0.00% 0.00% 0.000s 0.010s _connection_from_session (sqlalchemy/orm/query.py:3349)
0.00% 0.00% 0.000s 0.020s execute (sqlalchemy/orm/unitofwork.py:422)
0.00% 0.00% 0.000s 0.010s ping_connection (airflow/utils/sqlalchemy.py:70)
0.00% 0.00% 0.000s 371.3s POST (kubernetes/client/rest.py:266)
0.00% 0.00% 0.000s 0.010s __call__ (sqlalchemy/event/attr.py:297)
0.00% 0.00% 0.000s 0.010s __init__ (sqlalchemy/engine/base.py:125)
0.00% 0.00% 0.000s 0.010s on_terminate (airflow/utils/helpers.py:297)
0.00% 0.00% 0.000s 0.030s end (airflow/utils/dag_processing.py:691)
0.00% 0.00% 0.000s 0.020s wait (psutil/__init__.py:1384)
0.00% 0.00% 0.000s 0.010s check_gone (psutil/__init__.py:1640)
0.00% 0.00% 0.000s 371.3s __call_api (kubernetes/client/api_client.py:168)
0.00% 0.00% 0.000s 0.010s makeRecord (logging/__init__.py:1413)
0.00% 0.00% 0.000s 0.020s commit (sqlalchemy/orm/session.py:494)
0.00% 0.00% 0.000s 371.3s begin (http/client.py:307)
0.00% 0.00% 0.000s 0.020s check_gone (psutil/__init__.py:1632)
0.00% 0.00% 0.000s 0.020s wait (psutil/_pslinux.py:1719)
0.00% 0.00% 0.000s 0.030s Client (multiprocessing/connection.py:493)
0.00% 0.00% 0.010s 0.010s do_commit (sqlalchemy/engine/default.py:505)
0.00% 0.00% 0.000s 0.010s _connection_for_bind (sqlalchemy/orm/session.py:1130)
0.00% 0.00% 0.000s 0.020s _get_context_loader (sqlalchemy/orm/interfaces.py:521)
0.00% 0.00% 0.000s 0.010s _get_bind_args (sqlalchemy/orm/query.py:3371)
0.00% 0.00% 0.020s 0.020s check_timeout (psutil/_psposix.py:71)
0.00% 0.00% 0.000s 0.010s _do_commit (sqlalchemy/engine/base.py:1747)
0.00% 0.00% 0.030s 0.030s _recv (multiprocessing/connection.py:379)
0.00% 0.00% 0.000s 0.040s load_on_pk_identity (sqlalchemy/orm/loading.py:282)
0.00% 0.00% 0.000s 0.010s _commit_impl (sqlalchemy/engine/base.py:761)
0.00% 0.00% 0.000s 0.010s commit (sqlalchemy/orm/session.py:498)
0.00% 0.00% 0.000s 371.3s readinto (socket.py:586)
{code}
was (Author: rdeboo):
Airflow 1.10.4, running on Kubernetes (Azure AKS).
Scheduler gets stuck, portal shows message "{{The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago."}}
{{py-spy output:}}
{code:java}
Total Samples 40100
GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 29.57s 29.57s poll (multiprocessing/popen_fork.py:28)
0.00% 0.00% 0.000s 0.030s _decref (multiprocessing/managers.py:809)
0.00% 0.00% 0.000s 0.040s one_or_none (sqlalchemy/orm/query.py:3261)
0.00% 0.00% 0.000s 0.030s answer_challenge (multiprocessing/connection.py:732)
0.00% 0.00% 371.3s 371.3s recv_into (OpenSSL/SSL.py:1821)
0.00% 0.00% 0.000s 0.020s __missing__ (sqlalchemy/orm/path_registry.py:303)
0.00% 0.00% 0.000s 0.030s wait_procs (psutil/__init__.py:1667)
0.00% 0.00% 0.000s 0.020s _emit_update_statements (sqlalchemy/orm/persistence.py:996)
0.00% 0.00% 0.000s 0.020s setup (sqlalchemy/orm/interfaces.py:547)
0.00% 0.00% 0.000s 0.020s save_obj (sqlalchemy/orm/persistence.py:236)
0.00% 0.00% 0.000s 0.020s flush (sqlalchemy/orm/session.py:2459)
0.00% 0.00% 0.000s 0.030s recv_bytes (multiprocessing/connection.py:216)
0.00% 0.00% 0.000s 371.3s request (kubernetes/client/rest.py:166)
0.00% 0.00% 0.000s 0.010s __iter__ (sqlalchemy/orm/query.py:3334)
0.00% 0.00% 0.000s 0.010s connection (sqlalchemy/orm/session.py:1124)
0.00% 0.00% 0.000s 0.010s _contextual_connect (sqlalchemy/engine/base.py:2231)
0.00% 0.00% 0.000s 0.010s _connection_from_session (sqlalchemy/orm/query.py:3349)
0.00% 0.00% 0.000s 0.020s execute (sqlalchemy/orm/unitofwork.py:422)
0.00% 0.00% 0.000s 0.010s ping_connection (airflow/utils/sqlalchemy.py:70)
0.00% 0.00% 0.000s 371.3s POST (kubernetes/client/rest.py:266)
0.00% 0.00% 0.000s 0.010s __call__ (sqlalchemy/event/attr.py:297)
0.00% 0.00% 0.000s 0.010s __init__ (sqlalchemy/engine/base.py:125)
0.00% 0.00% 0.000s 0.010s on_terminate (airflow/utils/helpers.py:297)
0.00% 0.00% 0.000s 0.030s end (airflow/utils/dag_processing.py:691)
0.00% 0.00% 0.000s 0.020s wait (psutil/__init__.py:1384)
0.00% 0.00% 0.000s 0.010s check_gone (psutil/__init__.py:1640)
0.00% 0.00% 0.000s 371.3s __call_api (kubernetes/client/api_client.py:168)
0.00% 0.00% 0.000s 0.010s makeRecord (logging/__init__.py:1413)
0.00% 0.00% 0.000s 0.020s commit (sqlalchemy/orm/session.py:494)
0.00% 0.00% 0.000s 371.3s begin (http/client.py:307)
0.00% 0.00% 0.000s 0.020s check_gone (psutil/__init__.py:1632)
0.00% 0.00% 0.000s 0.020s wait (psutil/_pslinux.py:1719)
0.00% 0.00% 0.000s 0.030s Client (multiprocessing/connection.py:493)
0.00% 0.00% 0.010s 0.010s do_commit (sqlalchemy/engine/default.py:505)
0.00% 0.00% 0.000s 0.010s _connection_for_bind (sqlalchemy/orm/session.py:1130)
0.00% 0.00% 0.000s 0.020s _get_context_loader (sqlalchemy/orm/interfaces.py:521)
0.00% 0.00% 0.000s 0.010s _get_bind_args (sqlalchemy/orm/query.py:3371)
0.00% 0.00% 0.020s 0.020s check_timeout (psutil/_psposix.py:71)
0.00% 0.00% 0.000s 0.010s _do_commit (sqlalchemy/engine/base.py:1747)
0.00% 0.00% 0.030s 0.030s _recv (multiprocessing/connection.py:379)
0.00% 0.00% 0.000s 0.040s load_on_pk_identity (sqlalchemy/orm/loading.py:282)
0.00% 0.00% 0.000s 0.010s _commit_impl (sqlalchemy/engine/base.py:761)
0.00% 0.00% 0.000s 0.010s commit (sqlalchemy/orm/session.py:498)
0.00% 0.00% 0.000s 371.3s readinto (socket.py:586)
{code}
> scheduler gets stuck without a trace
> ------------------------------------
>
> Key: AIRFLOW-401
> URL: https://issues.apache.org/jira/browse/AIRFLOW-401
> Project: Apache Airflow
> Issue Type: Bug
> Components: executors, scheduler
> Affects Versions: 1.7.1.3
> Reporter: Nadeem Ahmed Nazeer
> Assignee: Bolke de Bruin
> Priority: Minor
> Labels: celery, kombu
> Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop.
> The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed
--
This message was sent by Atlassian Jira
(v8.3.2#803003)