You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ruiqin Yang <yr...@gmail.com> on 2018/07/17 21:39:34 UTC

[Proposal] Scale Airflow

Hi guys,

I'd like to proposal a few improvements to Airflow that would help to scale
Airflow:

Scheduler:

   1.
   - Problem: scheduler loop became slow when # of running task grows too
      large, thus slows down DAG parsing/scheduler loop and creates scheduling
      delay, AIRFLOW-2156
      <https://issues.apache.org/jira/browse/AIRFLOW-2156>
      - Proposal: Parallelize celery querying.
      - Progress: Dan Davydov( @aoen) has made a change to parallelize
      celery querying and we have been running with it in production for 1+
      month. It solved scheduling delay problem we have in production when we
      have ~15k running task at peak and has been proven in our stress testing
      cluster to be able to handle ~30k running task. We have 10x+ performance
      improvement on celery querying with 16 subprocess querying
celery and that
      can be configured.
   2.
      - Problem: DAG parsing loop coupled with scheduler loop, thus places
      bottleneck on DAG parsing and creates scheduling delay, AIRFLOW-2760
      <https://issues.apache.org/jira/browse/AIRFLOW-2760>
      - Proposal: Decouple DAG parsing loop and scheduler loop.
      - Progress: Prototype worked locally.
   3.
      - Problem: scheduler loop became slow when # of tasks needed to be
      queued became too large, thus slows down DAG parsing/scheduler loop and
      creates scheduling delay, AIRFLOW-2761
      <https://issues.apache.org/jira/browse/AIRFLOW-2761>
      - Proposal: Parallelize celery enqueuing.
      - Progress: Not started yet. Planned for Q3.

Webserver:

   1.
      - Problem: Webserver parses DagBag twice during start up, thus causes
      webserver start up being slow with large # of DAG files, AIRFLOW-2615
      <https://issues.apache.org/jira/browse/AIRFLOW-2615>
      - Proposal: Remove the redundant DagBag parsing.
      - Progress: Tried an attempt
      <https://github.com/apache/incubator-airflow/pull/3506> but failed.
      Planned for Q3.
   2.
      - Problem: Webserver parses DagBag in a single thread fashion, thus
      causes webserver start up being slow with large # of DAG files,
      AIRFLOW-2762 <https://issues.apache.org/jira/browse/AIRFLOW-2762>
      - Proposal: Parallelize DagBag parsing in webserver. Because not all
      DAGs are pickable so webserver will thus lose access to the actual DAG
      object, but only worker should need to use the actual DAG object.
      - Progress: Not started yet. Planned for Q3.

Feedbacks are hugely appreciated.

Cheers,
Kevin Y