You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Johannes Kaufmann (Jira)" <ji...@apache.org> on 2019/09/04 06:55:00 UTC

[jira] [Commented] (AIRFLOW-4499) scheduler process running (in ps) but not doing anything, not writing to log for 3+hrs and not processing tasks

    [ https://issues.apache.org/jira/browse/AIRFLOW-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922191#comment-16922191 ] 

Johannes Kaufmann commented on AIRFLOW-4499:
--------------------------------------------

We are also facing this issue. Its not the same as AIRFLOW-401 as well, as CPU load is 0. So it looks more like the scheduler process is in a zombie like mode.

 

We set -r 1800, but it periodically does not restart. 

Looking through the syslogs yields the following pattern (I removed the overly verbose beginning of every logline and also added some comments regarding its structure):

 

{{[2019-08-29 03:50:55,196] \{jobs.py:406} INFO - Processing /path/to/airflow/dags/all_purpose_dag.py took 53.097 seconds
[2019-08-29 03:50:55,200] \{settings.py:206} DEBUG - Disposing DB connection pool (PID 19287)

# Start of repeated pattern until we restart the server.
[2019-08-29 03:50:55,375] \{jobs.py:1573} DEBUG - Starting Loop...
[2019-08-29 03:50:55,376] \{jobs.py:1584} DEBUG - Harvesting DAG parsing results
[2019-08-29 03:50:55,376] \{jobs.py:1586} DEBUG - Harvested 0 SimpleDAGs
[2019-08-29 03:50:55,376] \{jobs.py:1621} DEBUG - Heartbeating the executor
[2019-08-29 03:50:55,376] \{base_executor.py:124} DEBUG - 0 running task instances
[2019-08-29 03:50:55,376] \{base_executor.py:125} DEBUG - 0 in queue
[2019-08-29 03:50:55,376] \{base_executor.py:126} DEBUG - 120 open slots
[2019-08-29 03:50:55,376] \{base_executor.py:146} DEBUG - Calling the <class 'airflow.executors.local_executor.LocalExecutor'> sync method
[2019-08-29 03:50:55,377] \{jobs.py:1642} DEBUG - Ran scheduling loop in 0.00 seconds
[2019-08-29 03:50:55,377] \{jobs.py:1645} DEBUG - Sleeping for 1.00 seconds
[2019-08-29 03:50:56,378] \{jobs.py:1663} DEBUG - Sleeping for 1.00 seconds to prevent excessive logging
# End of repeated pattern until we restart the server.

# Here is the (slightly different) pattern again.
[2019-08-29 03:50:57,379] \{jobs.py:1573} DEBUG - Starting Loop...
[2019-08-29 03:50:57,379] \{jobs.py:1584} DEBUG - Harvesting DAG parsing results
[2019-08-29 03:50:57,380] \{jobs.py:1586} DEBUG - Harvested 0 SimpleDAGs
[2019-08-29 03:50:57,380] \{jobs.py:1621} DEBUG - Heartbeating the executor
[2019-08-29 03:50:57,380] \{base_executor.py:124} DEBUG - 0 running task instances
[2019-08-29 03:50:57,380] \{base_executor.py:125} DEBUG - 0 in queue
[2019-08-29 03:50:57,380] \{base_executor.py:126} DEBUG - 120 open slots
[2019-08-29 03:50:57,380] \{base_executor.py:146} DEBUG - Calling the <class 'airflow.executors.local_executor.LocalExecutor'> sync method
[2019-08-29 03:50:57,380] \{jobs.py:1633} DEBUG - Heartbeating the scheduler
[2019-08-29 03:50:57,391] \{jobs.py:193} DEBUG - [heartbeat]}}

> scheduler process running (in ps) but not doing anything, not writing to log for 3+hrs and not processing tasks
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4499
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4499
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.3
>            Reporter: t oo
>            Priority: Critical
>
> blogs mention this as long-standing issue but i could not see open JIRA for it.
> scheduler process running (in ps -ef) but not doing anything, not writing to log for 3+hrs and not processing tasks
> band-aid solution here:
> new config value ---> scheduler_restart_mins = x
> implement auto-restart of scheduler process if scheduler log file not updated within 2*x mins and scheduler process start time is older than x mins
> env: localexecutor



--
This message was sent by Atlassian Jira
(v8.3.2#803003)