You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Aizhamal Nurmamat kyzy (JIRA)" <ji...@apache.org> on 2019/05/18 03:14:04 UTC

[jira] [Updated] (AIRFLOW-3037) RBAC webserver high CPU usage

     [ https://issues.apache.org/jira/browse/AIRFLOW-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aizhamal Nurmamat kyzy updated AIRFLOW-3037:
--------------------------------------------
         Labels: performance usability webapp  (was: performance usability)
    Component/s:     (was: webserver)
                     (was: webapp)
                 ui

Moving to ui component as part of component refactor. Webapp component will be removed.

> RBAC webserver high CPU usage
> -----------------------------
>
>                 Key: AIRFLOW-3037
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3037
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: ui
>    Affects Versions: 1.10.0
>         Environment: Digital Ocean standard droplet with 1 vCPU, 2gb ram and ubuntu 16.04.3
>            Reporter: kkkkk
>            Priority: Major
>              Labels: performance, usability, webapp
>         Attachments: airflow_webserver_rbac.log
>
>
> The startup process of the rbac web server workers is doing something very CPU intensive.
> This causes the web server to start up very, very slowly and causes persistent high CPU usage on low powered servers.
> For example, on my digital ocean standard droplet with 1 vcpu and 2gb of ram, the web server (in default config) took up to 15 min to start.
> The problem is that in the default config airflow starts 4 workers with a timeout of 120 seconds and refreshes (starts a new one & kills the oldest one) every 30 seconds.
> If you run just one worker, it takes it about 1 minute to start on that server, but if you run four at a time, it doesn't leave enough resources for the workers to start within the web_server_worker_timeout. The workers are killed and the process repeats. The process is repeated until the workers manage to start in time (which ranged for me between 10 and 20min).
> Once the workers are running the CPU load goes down.
> But then airflow starts refreshing workers every 30s, which is shorter than the amount of time it takes for a worker to fully start and CPU load to go down again.
> This causes a persistent high CPU load.
> Of course, this can be worked around by configuring the webserver appropriately:
>  
>  
> {code:java}
> workers = 2
> web_server_worker_timeout = 300
> worker_refresh_interval = 1800
> {code}
>  
> I wasn't able to figure out what exactly is causing the high CPU load. It may be a bug or it may be necessary, but you may want to consider reducing the load so that the new rbac webserver runs as efficient in the default config on lower-end systems as the old web server.
> At the very least it should be considered to explain this in the rbac documentation and point people to these settings.
> I have attached a log of what is happening, but here are the highlights:
> {quote}[2018-09-09 09:05:02,201] \{{models.py:491}} DEBUG - Loaded DAG <DAG: redacted_dag_name4>
> Running the Gunicorn Server with:
> Workers: 4 sync
> Host: 0.0.0.0:8080
> Timeout: 120
> Logfiles: - -
> =================================================================
> [...]
> [2018-09-09 09:05:26 +0000] [398] [INFO] Starting gunicorn 19.9.0
> [2018-09-09 09:05:26 +0000] [398] [INFO] Listening at: http://0.0.0.0:8080 (398)
> [2018-09-09 09:05:26 +0000] [398] [INFO] Using worker: sync
> [2018-09-09 09:05:26 +0000] [403] [INFO] Booting worker with pid: 403
> [2018-09-09 09:05:26 +0000] [404] [INFO] Booting worker with pid: 404
> [2018-09-09 09:05:26 +0000] [405] [INFO] Booting worker with pid: 405
> [2018-09-09 09:05:26 +0000] [406] [INFO] Booting worker with pid: 406
> [2018-09-09 09:05:27,388] \{{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
> [2018-09-09 09:05:28,951] \{{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
> *[... Repeats a lot of times for 2min]*
> [2018-09-09 09:07:25,967] \{{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
> [2018-09-09 09:07:27 +0000] [398] [CRITICAL] WORKER TIMEOUT (pid:403)
> [2018-09-09 09:07:27 +0000] [398] [CRITICAL] WORKER TIMEOUT (pid:404)
> [2018-09-09 09:07:27 +0000] [398] [CRITICAL] WORKER TIMEOUT (pid:405)
> [2018-09-09 09:07:27 +0000] [403] [INFO] Worker exiting (pid: 403)
> [2018-09-09 09:07:27,372] \{{settings.py:188}} DEBUG - Disposing DB connection pool (PID 403)
> [2018-09-09 09:07:27 +0000] [404] [INFO] Worker exiting (pid: 404)
> [2018-09-09 09:07:27,440] \{{settings.py:188}} DEBUG - Disposing DB connection pool (PID 404)
> [2018-09-09 09:07:27 +0000] [398] [CRITICAL] WORKER TIMEOUT (pid:406)
> [2018-09-09 09:07:27 +0000] [405] [INFO] Worker exiting (pid: 405)
> [2018-09-09 09:07:27,573] \{{settings.py:188}} DEBUG - Disposing DB connection pool (PID 405)
> [2018-09-09 09:07:27 +0000] [406] [INFO] Worker exiting (pid: 406)
> [2018-09-09 09:07:27,686] \{{settings.py:188}} DEBUG - Disposing DB connection pool (PID 406)
> [2018-09-09 09:07:28,058] \{{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
> [2018-09-09 09:07:28 +0000] [441] [INFO] Booting worker with pid: 441
> [2018-09-09 09:07:29 +0000] [443] [INFO] Booting worker with pid: 443
> [2018-09-09 09:07:29 +0000] [444] [INFO] Booting worker with pid: 444
> [2018-09-09 09:07:29 +0000] [445] [INFO] Booting worker with pid: 445
> [2018-09-09 09:07:30,004] \{{cli.py:717}} DEBUG - [0 / 2] some workers are starting up, waiting...
> [2018-09-09 09:07:31,896] \{{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
> *[... This cycle Repeats a for 8 min until...]*
> [2018-09-09 09:15:18,347] \{{cli.py:717}} DEBUG - [2 / 4] some workers are starting up, waiting...
> [2018-09-09 09:15:19,381] \{{cli.py:738}} DEBUG - [4 / 4] sleeping for 30s starting doing a refresh...
> [2018-09-09 09:15:49,421] \{{cli.py:692}} DEBUG - [4 / 4] doing a refresh of 1 workers
> [2018-09-09 09:15:49 +0000] [398] [INFO] Handling signal: ttin
> [2018-09-09 09:15:49 +0000] [577] [INFO] Booting worker with pid: 577
> *[... As soon as all workers start, airflow begins refreshing them, which continues to cause high CPU load]*
> [2018-09-09 09:15:49,641] \{{cli.py:717}} DEBUG - [4 / 5] some workers are starting up, waiting...
> *[... waiting continues]*
> [2018-09-09 09:16:04,712] \{{cli.py:717}} DEBUG - [4 / 5] some workers are starting up, waiting...
> [2018-09-09 09:16:05,741] \{{cli.py:724}} DEBUG - [5 / 5] killing 1 workers
> [2018-09-09 09:16:05 +0000] [398] [INFO] Handling signal: ttou
> [2018-09-09 09:16:05 +0000] [553] [INFO] Worker exiting (pid: 553)
> [2018-09-09 09:16:06,871] \{{cli.py:738}} DEBUG - [4 / 4] sleeping for 30s starting doing a refresh...
> [2018-09-09 09:16:36,906] \{{cli.py:692}} DEBUG - [4 / 4] doing a refresh of 1 workers
> [2018-09-09 09:16:36 +0000] [398] [INFO] Handling signal: ttin
> [2018-09-09 09:16:36 +0000] [588] [INFO] Booting worker with pid: 588
> [2018-09-09 09:16:37,176] \{{cli.py:717}} DEBUG - [4 / 5] some workers are starting up, waiting...
> [2018-09-09 09:16:38,239] \{{cli.py:717}} DEBUG - [4 / 5] some workers are starting up, waiting...
> *[...]*
> {quote}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)