You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Roland de Boo (Jira)" <ji...@apache.org> on 2019/08/20 08:30:00 UTC

[jira] [Commented] (AIRFLOW-401) scheduler gets stuck without a trace

    [ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911106#comment-16911106 ] 

Roland de Boo commented on AIRFLOW-401:
---------------------------------------

Airflow 1.10.4, running on Kubernetes (Azure AKS).

Scheduler gets stuck, portal shows message "{{The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago."}}

{{py-spy output:}}

 
{code:java}

Total Samples 40100
GIL: 0.00%, Active: 100.00%, Threads: 1  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   29.57s    29.57s   poll (multiprocessing/popen_fork.py:28)
  0.00%   0.00%   0.000s    0.030s   _decref (multiprocessing/managers.py:809)
  0.00%   0.00%   0.000s    0.040s   one_or_none (sqlalchemy/orm/query.py:3261)
  0.00%   0.00%   0.000s    0.030s   answer_challenge (multiprocessing/connection.py:732)
  0.00%   0.00%   371.3s    371.3s   recv_into (OpenSSL/SSL.py:1821)
  0.00%   0.00%   0.000s    0.020s   __missing__ (sqlalchemy/orm/path_registry.py:303)
  0.00%   0.00%   0.000s    0.030s   wait_procs (psutil/__init__.py:1667)
  0.00%   0.00%   0.000s    0.020s   _emit_update_statements (sqlalchemy/orm/persistence.py:996)
  0.00%   0.00%   0.000s    0.020s   setup (sqlalchemy/orm/interfaces.py:547)
  0.00%   0.00%   0.000s    0.020s   save_obj (sqlalchemy/orm/persistence.py:236)
  0.00%   0.00%   0.000s    0.020s   flush (sqlalchemy/orm/session.py:2459)
  0.00%   0.00%   0.000s    0.030s   recv_bytes (multiprocessing/connection.py:216)
  0.00%   0.00%   0.000s    371.3s   request (kubernetes/client/rest.py:166)
  0.00%   0.00%   0.000s    0.010s   __iter__ (sqlalchemy/orm/query.py:3334)
  0.00%   0.00%   0.000s    0.010s   connection (sqlalchemy/orm/session.py:1124)
  0.00%   0.00%   0.000s    0.010s   _contextual_connect (sqlalchemy/engine/base.py:2231)
  0.00%   0.00%   0.000s    0.010s   _connection_from_session (sqlalchemy/orm/query.py:3349)
  0.00%   0.00%   0.000s    0.020s   execute (sqlalchemy/orm/unitofwork.py:422)
  0.00%   0.00%   0.000s    0.010s   ping_connection (airflow/utils/sqlalchemy.py:70)
  0.00%   0.00%   0.000s    371.3s   POST (kubernetes/client/rest.py:266)
  0.00%   0.00%   0.000s    0.010s   __call__ (sqlalchemy/event/attr.py:297)
  0.00%   0.00%   0.000s    0.010s   __init__ (sqlalchemy/engine/base.py:125)
  0.00%   0.00%   0.000s    0.010s   on_terminate (airflow/utils/helpers.py:297)
  0.00%   0.00%   0.000s    0.030s   end (airflow/utils/dag_processing.py:691)
  0.00%   0.00%   0.000s    0.020s   wait (psutil/__init__.py:1384)
  0.00%   0.00%   0.000s    0.010s   check_gone (psutil/__init__.py:1640)
  0.00%   0.00%   0.000s    371.3s   __call_api (kubernetes/client/api_client.py:168)
  0.00%   0.00%   0.000s    0.010s   makeRecord (logging/__init__.py:1413)
  0.00%   0.00%   0.000s    0.020s   commit (sqlalchemy/orm/session.py:494)
  0.00%   0.00%   0.000s    371.3s   begin (http/client.py:307)
  0.00%   0.00%   0.000s    0.020s   check_gone (psutil/__init__.py:1632)
  0.00%   0.00%   0.000s    0.020s   wait (psutil/_pslinux.py:1719)
  0.00%   0.00%   0.000s    0.030s   Client (multiprocessing/connection.py:493)
  0.00%   0.00%   0.010s    0.010s   do_commit (sqlalchemy/engine/default.py:505)
  0.00%   0.00%   0.000s    0.010s   _connection_for_bind (sqlalchemy/orm/session.py:1130)
  0.00%   0.00%   0.000s    0.020s   _get_context_loader (sqlalchemy/orm/interfaces.py:521)
  0.00%   0.00%   0.000s    0.010s   _get_bind_args (sqlalchemy/orm/query.py:3371)
  0.00%   0.00%   0.020s    0.020s   check_timeout (psutil/_psposix.py:71)
  0.00%   0.00%   0.000s    0.010s   _do_commit (sqlalchemy/engine/base.py:1747)
  0.00%   0.00%   0.030s    0.030s   _recv (multiprocessing/connection.py:379)
  0.00%   0.00%   0.000s    0.040s   load_on_pk_identity (sqlalchemy/orm/loading.py:282)
  0.00%   0.00%   0.000s    0.010s   _commit_impl (sqlalchemy/engine/base.py:761)
  0.00%   0.00%   0.000s    0.010s   commit (sqlalchemy/orm/session.py:498)
  0.00%   0.00%   0.000s    371.3s   readinto (socket.py:586)
 {code}
 

> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors, scheduler
>    Affects Versions: 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>              Labels: celery, kombu
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian Jira
(v8.3.2#803003)