You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Leonid Evdokimov (JIRA)" <ji...@apache.org> on 2016/12/08 14:19:58 UTC

[jira] [Commented] (AIRFLOW-401) scheduler gets stuck without a trace

    [ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15732334#comment-15732334 ] 

Leonid Evdokimov commented on AIRFLOW-401:
------------------------------------------

I've seen alike problem with CeleryExecutor on 1.7.1.3 as well. Observable symptoms were following:

* There is process tree with one parent processes of {{airflow scheduler -n 5}} and two children subprocesses
* Parent process is using 100% CPU running in {{waitpid(W_NOHANG)}} busy-loop
* Two children are stuck on {{recvfrom(<rabbitmq-socket>)}}

I used {{puckel/docker-airflow:1.7.1.3-5}} docker image to reproduce the bug, the bug _MAY_ be specific to some version of celery / rabbitmq library.

I'm not 100% sure, but it seems to me that there are two different bugs with alike symptoms in 1.7.1.3.

There is possible workaround for the bug while running CeleryExecutor — it's possible to run scheduler (and *only* scheduler) with strict CPU time limit, so it'll be terminated when it enters busy-loop. Running {{prlimit --cpu=35:40 -- airflow scheduler -n 10}} solved the issue for me.

I don't know what sort of inconsistencies may be triggered by killing the scheduler in the middle of execution, so use the recipe at your own risk :)

> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)