You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Neil Calabroso (Jira)" <ji...@apache.org> on 2019/09/18 10:29:00 UTC

[jira] [Commented] (AIRFLOW-401) scheduler gets stuck without a trace

    [ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932286#comment-16932286 ] 

Neil Calabroso commented on AIRFLOW-401:
----------------------------------------

Currently experiencing this issue in `Ubuntu 14.04` using `python 3.6.8`. This started when we upgraded our staging environment from `1.10.1` to `1.10.4`. We're using `LocalExecutor` and the process is handled by upstart.

I'm also getting the issue in the Web UI:  The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago.

For this sample, I got 3 stuck processes:

 
{code:java}
root@airflow-staging/home/ubuntu# ps aux | grep scheduler
airflow  21595  0.2  1.3 469868 109976 ?       S    09:52   0:04 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow  21602  0.0  1.1 1500268 95992 ?       Tl   09:52   0:00 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow  21648  0.0  1.1 467796 94628 ?        S    09:52   0:00 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
root     25735  0.0  0.0  10472   920 pts/3    S+   10:24   0:00 grep --color=auto scheduler
{code}
 

Running py-spy to each process gives

 
{code:java}
Collecting samples from 'pid: 21595' (python v3.6.8)
Total Samples 500
GIL: 0.00%, Active: 100.00%, Threads: 1  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%    5.00s     5.00s   _recv (multiprocessing/connection.py:379)
  0.00% 100.00%   0.000s     5.00s   wrapper (airflow/utils/cli.py:74)
  0.00% 100.00%   0.000s     5.00s   scheduler (airflow/bin/cli.py:1013)
  0.00% 100.00%   0.000s     5.00s   end (airflow/executors/local_executor.py:233)
  0.00% 100.00%   0.000s     5.00s   <module> (airflow:32)
  0.00% 100.00%   0.000s     5.00s   recv (multiprocessing/connection.py:250)
  0.00% 100.00%   0.000s     5.00s   _execute (airflow/jobs/scheduler_job.py:1323)
  0.00% 100.00%   0.000s     5.00s   end (airflow/executors/local_executor.py:212)
  0.00% 100.00%   0.000s     5.00s   _callmethod (multiprocessing/managers.py:757)
  0.00% 100.00%   0.000s     5.00s   join (<string>:2)
  0.00% 100.00%   0.000s     5.00s   _recv_bytes (multiprocessing/connection.py:407)
  0.00% 100.00%   0.000s     5.00s   _execute_helper (airflow/jobs/scheduler_job.py:1463)
  0.00% 100.00%   0.000s     5.00s   run (airflow/jobs/base_job.py:213){code}
 
{code:java}
root@airflow-staging:/home/ubuntu# py-spy --pid 21602
Error: Failed to suspend process
Reason: EPERM: Operation not permitted{code}
 
{code:java}
Collecting samples from 'pid: 21648' (python v3.6.8)
Total Samples 28381
GIL: 0.00%, Active: 100.00%, Threads: 1  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   283.8s    283.8s   _try_wait (subprocess.py:1424)
  0.00% 100.00%   0.000s    283.8s   call (subprocess.py:289)
  0.00% 100.00%   0.000s    283.8s   start (airflow/executors/local_executor.py:184)
  0.00% 100.00%   0.000s    283.8s   wrapper (airflow/utils/cli.py:74)
  0.00% 100.00%   0.000s    283.8s   _bootstrap (multiprocessing/process.py:258)
  0.00% 100.00%   0.000s    283.8s   _execute_helper (airflow/jobs/scheduler_job.py:1347)
  0.00% 100.00%   0.000s    283.8s   execute_work (airflow/executors/local_executor.py:86)
  0.00% 100.00%   0.000s    283.8s   <module> (airflow:32)
  0.00% 100.00%   0.000s    283.8s   _launch (multiprocessing/popen_fork.py:73)
  0.00% 100.00%   0.000s    283.8s   run (airflow/jobs/base_job.py:213)
  0.00% 100.00%   0.000s    283.8s   check_call (subprocess.py:306)
  0.00% 100.00%   0.000s    283.8s   start (multiprocessing/process.py:105)
  0.00% 100.00%   0.000s    283.8s   run (airflow/executors/local_executor.py:116)
  0.00% 100.00%   0.000s    283.8s   wait (subprocess.py:1477)
  0.00% 100.00%   0.000s    283.8s   scheduler (airflow/bin/cli.py:1013)
  0.00% 100.00%   0.000s    283.8s   _Popen (multiprocessing/context.py:277)
  0.00% 100.00%   0.000s    283.8s   _Popen (multiprocessing/context.py:223)
  0.00% 100.00%   0.000s    283.8s   start (airflow/executors/local_executor.py:224)
  0.00% 100.00%   0.000s    283.8s   _execute (airflow/jobs/scheduler_job.py:1323)
  0.00% 100.00%   0.000s    283.8s   __init__ (multiprocessing/popen_fork.py:19)
{code}
 

We will try to downgrade to `1.10.3` first and see if this problem persists.

 

> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors, scheduler
>    Affects Versions: 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>              Labels: celery, kombu
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)