You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Robert DeCaire (Jira)" <ji...@apache.org> on 2019/11/05 14:25:00 UTC

[jira] [Comment Edited] (AIRFLOW-5506) Airflow scheduler stuck

    [ https://issues.apache.org/jira/browse/AIRFLOW-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967563#comment-16967563 ] 

Robert DeCaire edited comment on AIRFLOW-5506 at 11/5/19 2:24 PM:
------------------------------------------------------------------

I was having this issue, and it turned out to be the SLA. It appeared to be repeatedly trying to connect to postgres and then failing. We were getting a ton of "unexpected EOF on client connection with an open transaction" errors in the db logs, in addition to the "Killing PID" logs in the scheduler. It looks like the scheduler wouldn't start any new tasks until the SLA successfully resolved, but it was happening mid-way through a DAG run for no apparent reason, and was locking the DAG completely. We removed the references to the SLA from the DAG and the problem vanished.

Edit: Probably worth noting that we weren't doing anything too strenuous. Running one DAG, with a somewhat complex graph, but not something like triggering a hundred DAG runs at once. This bug completely shut it down after completing about half the tasks in the DAG.


was (Author: rdecaire):
I was having this issue, and it turned out to be the SLA. It appeared to be repeatedly trying to connect to postgres and then failing. We were getting a ton of "unexpected EOF on client connection with an open transaction" errors in the db logs, in addition to the "Killing PID" logs in the scheduler. It looks like the scheduler wouldn't start any new tasks until the SLA successfully resolved, but it was happening mid-way through a DAG run for no apparent reason, and was locking the DAG completely. We removed the references to the SLA from the DAG and the problem vanished.

> Airflow scheduler stuck
> -----------------------
>
>                 Key: AIRFLOW-5506
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5506
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.4, 1.10.5
>            Reporter: t oo
>            Priority: Major
>
> re-post of [https://stackoverflow.com/questions/57713394/airflow-scheduler-stuck] and slack discussion
>  
>  
> I'm testing the use of Airflow, and after triggering a (seemingly) large number of DAGs at the same time, it seems to just fail to schedule anything and starts killing processes. These are the logs the scheduler prints:
> {{[2019-08-29 11:17:13,542] \{scheduler_job.py:214} WARNING - Killing PID 199809
> [2019-08-29 11:17:13,544] \{scheduler_job.py:214} WARNING - Killing PID 199809
> [2019-08-29 11:17:44,614] \{scheduler_job.py:214} WARNING - Killing PID 2992
> [2019-08-29 11:17:44,614] \{scheduler_job.py:214} WARNING - Killing PID 2992
> [2019-08-29 11:18:15,692] \{scheduler_job.py:214} WARNING - Killing PID 5174
> [2019-08-29 11:18:15,693] \{scheduler_job.py:214} WARNING - Killing PID 5174
> [2019-08-29 11:18:46,765] \{scheduler_job.py:214} WARNING - Killing PID 22410
> [2019-08-29 11:18:46,766] \{scheduler_job.py:214} WARNING - Killing PID 22410
> [2019-08-29 11:19:17,845] \{scheduler_job.py:214} WARNING - Killing PID 42177
> [2019-08-29 11:19:17,846] \{scheduler_job.py:214} WARNING - Killing PID 42177
> ...}}
> I'm using a LocalExecutor with a PostgreSQL backend DB. It seems to be happening only after I'm triggering a large number (>100) of DAGs at about the same time using external triggering. As in:
> {{airflow trigger_dag DAG_NAME}}
> After waiting for it to finish killing whatever processes he is killing, he starts executing all of the tasks properly. I don't even know what these processes were, as I can't really see them after they are killed...
> Did anyone encounter this kind of behavior? Any idea why would that happen?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)