You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Neil Calabroso (Jira)" <ji...@apache.org> on 2019/09/18 10:29:00 UTC
[jira] [Commented] (AIRFLOW-401) scheduler gets stuck without a
trace
[ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932286#comment-16932286 ]
Neil Calabroso commented on AIRFLOW-401:
----------------------------------------
Currently experiencing this issue in `Ubuntu 14.04` using `python 3.6.8`. This started when we upgraded our staging environment from `1.10.1` to `1.10.4`. We're using `LocalExecutor` and the process is handled by upstart.
I'm also getting the issue in the Web UI: The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago.
For this sample, I got 3 stuck processes:
{code:java}
root@airflow-staging/home/ubuntu# ps aux | grep scheduler
airflow 21595 0.2 1.3 469868 109976 ? S 09:52 0:04 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow 21602 0.0 1.1 1500268 95992 ? Tl 09:52 0:00 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow 21648 0.0 1.1 467796 94628 ? S 09:52 0:00 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
root 25735 0.0 0.0 10472 920 pts/3 S+ 10:24 0:00 grep --color=auto scheduler
{code}
Running py-spy to each process gives
{code:java}
Collecting samples from 'pid: 21595' (python v3.6.8)
Total Samples 500
GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 5.00s 5.00s _recv (multiprocessing/connection.py:379)
0.00% 100.00% 0.000s 5.00s wrapper (airflow/utils/cli.py:74)
0.00% 100.00% 0.000s 5.00s scheduler (airflow/bin/cli.py:1013)
0.00% 100.00% 0.000s 5.00s end (airflow/executors/local_executor.py:233)
0.00% 100.00% 0.000s 5.00s <module> (airflow:32)
0.00% 100.00% 0.000s 5.00s recv (multiprocessing/connection.py:250)
0.00% 100.00% 0.000s 5.00s _execute (airflow/jobs/scheduler_job.py:1323)
0.00% 100.00% 0.000s 5.00s end (airflow/executors/local_executor.py:212)
0.00% 100.00% 0.000s 5.00s _callmethod (multiprocessing/managers.py:757)
0.00% 100.00% 0.000s 5.00s join (<string>:2)
0.00% 100.00% 0.000s 5.00s _recv_bytes (multiprocessing/connection.py:407)
0.00% 100.00% 0.000s 5.00s _execute_helper (airflow/jobs/scheduler_job.py:1463)
0.00% 100.00% 0.000s 5.00s run (airflow/jobs/base_job.py:213){code}
{code:java}
root@airflow-staging:/home/ubuntu# py-spy --pid 21602
Error: Failed to suspend process
Reason: EPERM: Operation not permitted{code}
{code:java}
Collecting samples from 'pid: 21648' (python v3.6.8)
Total Samples 28381
GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 283.8s 283.8s _try_wait (subprocess.py:1424)
0.00% 100.00% 0.000s 283.8s call (subprocess.py:289)
0.00% 100.00% 0.000s 283.8s start (airflow/executors/local_executor.py:184)
0.00% 100.00% 0.000s 283.8s wrapper (airflow/utils/cli.py:74)
0.00% 100.00% 0.000s 283.8s _bootstrap (multiprocessing/process.py:258)
0.00% 100.00% 0.000s 283.8s _execute_helper (airflow/jobs/scheduler_job.py:1347)
0.00% 100.00% 0.000s 283.8s execute_work (airflow/executors/local_executor.py:86)
0.00% 100.00% 0.000s 283.8s <module> (airflow:32)
0.00% 100.00% 0.000s 283.8s _launch (multiprocessing/popen_fork.py:73)
0.00% 100.00% 0.000s 283.8s run (airflow/jobs/base_job.py:213)
0.00% 100.00% 0.000s 283.8s check_call (subprocess.py:306)
0.00% 100.00% 0.000s 283.8s start (multiprocessing/process.py:105)
0.00% 100.00% 0.000s 283.8s run (airflow/executors/local_executor.py:116)
0.00% 100.00% 0.000s 283.8s wait (subprocess.py:1477)
0.00% 100.00% 0.000s 283.8s scheduler (airflow/bin/cli.py:1013)
0.00% 100.00% 0.000s 283.8s _Popen (multiprocessing/context.py:277)
0.00% 100.00% 0.000s 283.8s _Popen (multiprocessing/context.py:223)
0.00% 100.00% 0.000s 283.8s start (airflow/executors/local_executor.py:224)
0.00% 100.00% 0.000s 283.8s _execute (airflow/jobs/scheduler_job.py:1323)
0.00% 100.00% 0.000s 283.8s __init__ (multiprocessing/popen_fork.py:19)
{code}
We will try to downgrade to `1.10.3` first and see if this problem persists.
> scheduler gets stuck without a trace
> ------------------------------------
>
> Key: AIRFLOW-401
> URL: https://issues.apache.org/jira/browse/AIRFLOW-401
> Project: Apache Airflow
> Issue Type: Bug
> Components: executors, scheduler
> Affects Versions: 1.7.1.3
> Reporter: Nadeem Ahmed Nazeer
> Assignee: Bolke de Bruin
> Priority: Minor
> Labels: celery, kombu
> Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop.
> The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed
--
This message was sent by Atlassian Jira
(v8.3.4#803005)