You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Rick Otten (JIRA)" <ji...@apache.org> on 2017/07/13 14:58:00 UTC

[jira] [Commented] (AIRFLOW-401) scheduler gets stuck without a trace

    [ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085816#comment-16085816 ] 

Rick Otten commented on AIRFLOW-401:
------------------------------------

Using the LocalExecutor with 1.8rc2, on Ubuntu 16.04,  we are still observing an issue that looks very similar to this.  What it looks like is happening is the scheduler spawns a number of child processes _parallelism = X_ .  When a task runs, it consumes one of these child processes.  When the task finishes (we are mostly using the SSHExecuteOperator and the PostgresOperator in our tasks), the child process is marked as *defunct* by the Operating system.   The Scheduler/LocalExecutor will not reuse that child process.  It is used up.

If there is a break in the tasks to be run, the scheduler restarts itself, which resets all of the defunct child processes back to a ready state.  However, if you have a long running task, mixed with a bunch of short running tasks, the short running tasks will use up all of the available children.  The scheduler then queues all new jobs that come along until that one long running task finishes and the scheduler can restart itself to clear the child pool.

We will try setting up Celery to run our tasks.  Hopefully that will help keep things running when they are expected to run.



> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck_7hours.png, scheduler_stuck.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)