You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ruiqin Yang <yr...@gmail.com> on 2018/07/17 21:39:34 UTC
[Proposal] Scale Airflow
Hi guys,
I'd like to proposal a few improvements to Airflow that would help to scale
Airflow:
Scheduler:
1.
- Problem: scheduler loop became slow when # of running task grows too
large, thus slows down DAG parsing/scheduler loop and creates scheduling
delay, AIRFLOW-2156
<https://issues.apache.org/jira/browse/AIRFLOW-2156>
- Proposal: Parallelize celery querying.
- Progress: Dan Davydov( @aoen) has made a change to parallelize
celery querying and we have been running with it in production for 1+
month. It solved scheduling delay problem we have in production when we
have ~15k running task at peak and has been proven in our stress testing
cluster to be able to handle ~30k running task. We have 10x+ performance
improvement on celery querying with 16 subprocess querying
celery and that
can be configured.
2.
- Problem: DAG parsing loop coupled with scheduler loop, thus places
bottleneck on DAG parsing and creates scheduling delay, AIRFLOW-2760
<https://issues.apache.org/jira/browse/AIRFLOW-2760>
- Proposal: Decouple DAG parsing loop and scheduler loop.
- Progress: Prototype worked locally.
3.
- Problem: scheduler loop became slow when # of tasks needed to be
queued became too large, thus slows down DAG parsing/scheduler loop and
creates scheduling delay, AIRFLOW-2761
<https://issues.apache.org/jira/browse/AIRFLOW-2761>
- Proposal: Parallelize celery enqueuing.
- Progress: Not started yet. Planned for Q3.
Webserver:
1.
- Problem: Webserver parses DagBag twice during start up, thus causes
webserver start up being slow with large # of DAG files, AIRFLOW-2615
<https://issues.apache.org/jira/browse/AIRFLOW-2615>
- Proposal: Remove the redundant DagBag parsing.
- Progress: Tried an attempt
<https://github.com/apache/incubator-airflow/pull/3506> but failed.
Planned for Q3.
2.
- Problem: Webserver parses DagBag in a single thread fashion, thus
causes webserver start up being slow with large # of DAG files,
AIRFLOW-2762 <https://issues.apache.org/jira/browse/AIRFLOW-2762>
- Proposal: Parallelize DagBag parsing in webserver. Because not all
DAGs are pickable so webserver will thus lose access to the actual DAG
object, but only worker should need to use the actual DAG object.
- Progress: Not started yet. Planned for Q3.
Feedbacks are hugely appreciated.
Cheers,
Kevin Y