You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Tao He <li...@alibaba-inc.com.INVALID> on 2021/07/20 08:58:34 UTC

Proposal about changes in Airflow for better integration with project Vineyard

Hi folks,

We are the team behind project vineyard ( https://github.com/v6d-io/v6d <https://github.com/v6d-io/v6d> ), which is an immutable
in-memory data manager. We are working on the integration with airflow. To fully leverage the
the capability of vineyard to improve the end-to-end performance of airflow workflows, we propose
a set of changes in DAG, executor, and scheduler to allow third-party provides inject hooks into
the airflow framework.

As we are not experts on airflow, before starting working on pull requests, we want to brief our
proposal here for early suggestions and feedback. We have prepared slides to outline the ideas:
https://docs.google.com/presentation/d/1r4gg4RX0PMrhRjpSpoRhe5Ck4FaFm8Xt9DSoZipsNa0/edit?usp=sharing <https://docs.google.com/presentation/d/1r4gg4RX0PMrhRjpSpoRhe5Ck4FaFm8Xt9DSoZipsNa0/edit?usp=sharing>

Briefly, vineyard is a data-sharing engine that aims to provide efficient sharing for immediate data
for bigdata analytical workflows. To make vineyard could be leveraged by existing airflow jobs (without
non-trivial refactor the DAG definition code), our proposal includes

1. XCom interface: allow the operators to push more (optional) metadata to xcom

2. DAG level hooks for operators: operators in airflow could define their own `pre_execute` and
    `post_execute` hook, but at the DAG level, it is hard to inject a hook for all operators for preprocessing
   jobs, e.g., prepare the required inputs before executing an operator. We propose to support DAG
   level pre/post hooks in airflow.

3. Scheduler: airflow supports specifying custom XCom backend via `AIRFLOW__CORE__XCOM_BACKEND`.
    It would be great if the scheduler (scheduler_job.py) could be extensible as well.

4. Executor: if scheduler airflow could assign some “data locality information” when enqueuing tasks,
    there would be performance gain if the underlying backend supports accessing local objects faster
    than remote ones (e.g., ray backend or vineyard backend). The executor should consider such
    locality information when pop tasks to run from the queue.

If there's anything wrong with how airflow internally works, please correct us. We are seeking
comments about the whole design as well as detailed proposed changes from the airflow developers!

We also would like to see airflow moving towards better Kubernetes integration and support. Airflow
has the ability to launch pods (KubernetesExecutor and CeleryKubernetesExecutor), and we believe
by leverage the DAG and linage information, airflow could gain more in modern cloud environment
orchestrated by Kubernetes!

Thanks in advance!

Best regards,
Tao

Re: Proposal about changes in Airflow for better integration with project Vineyard

Posted by Jarek Potiuk <ja...@potiuk.com>.
Without going into details, I think a lot of what you propose here does not
really require changes in Airflow and can be implemented in a very similar
manner to what is being implemented currently with Ray.

I will not comment yet on the details here. I recommend you to go and watch
the talk From Airflow Summit from last week -
https://airflowsummit.org/sessions/2021/airflow-ray/ . I think Daniel would
be the best person to comment on how close or far your ideas are from
something that we already work on, but when I saw the talk from Daniel,
that was my first thought that we will need more integrations like that and
that Airflow should provide (if not already provided) hooks for similar
integrations and from the first glance it looks like you are addressing
similar problems: Custom XCom allowed to push more metadata, seamless data
sharing between tasks, "data locality information for performance" are at
the core of Ray's integration.

I think this might be one of the streams of the discussions on how to
proceed with that.

I will look forward to Daniel's comment before diving a bit deeper, but I
am certainly interested in making Airflow works seamlessly with a number of
similar data-sharing and execution engines. We need to carefully consider
different needs of similar engines and while Ray integration is the first
that is being "tried" in similar way, maybe should already start on Airflow
Improvement Proposal  where we will gather the needs and discuss what
approach will be best.

J.

On Tue, Jul 20, 2021 at 10:58 AM Tao He <li...@alibaba-inc.com.invalid>
wrote:

> Hi folks,
>
> We are the team behind project vineyard ( https://github.com/v6d-io/v6d ),
> which is an immutable
> in-memory data manager. We are working on the integration with airflow. To
> fully leverage the
> the capability of vineyard to improve the end-to-end performance of
> airflow workflows, we propose
> a set of changes in DAG, executor, and scheduler to allow third-party
> provides inject hooks into
> the airflow framework.
>
> As we are not experts on airflow, before starting working on pull
> requests, we want to brief our
> proposal here for early suggestions and feedback. We have prepared slides
> to outline the ideas:
>
> https://docs.google.com/presentation/d/1r4gg4RX0PMrhRjpSpoRhe5Ck4FaFm8Xt9DSoZipsNa0/edit?usp=sharing
>
> Briefly, vineyard is a data-sharing engine that aims to provide efficient
> sharing for immediate data
> for bigdata analytical workflows. To make vineyard could be leveraged by
> existing airflow jobs (without
> non-trivial refactor the DAG definition code), our proposal includes
>
> 1. XCom interface: allow the operators to push more (optional) metadata to
> xcom
>
> 2. DAG level hooks for operators: operators in airflow could define their
> own `pre_execute` and
>     `post_execute` hook, but at the DAG level, it is hard to inject a hook
> for all operators for preprocessing
>    jobs, e.g., prepare the required inputs before executing an operator.
> We propose to support DAG
>    level pre/post hooks in airflow.
>
> 3. Scheduler: airflow supports specifying custom XCom backend via
> `AIRFLOW__CORE__XCOM_BACKEND`.
>     It would be great if the scheduler (scheduler_job.py) could be
> extensible as well.
>
> 4. Executor: if scheduler airflow could assign some “data locality
> information” when enqueuing tasks,
>     there would be performance gain if the underlying backend supports
> accessing local objects faster
>     than remote ones (e.g., ray backend or vineyard backend). The executor
> should consider such
>     locality information when pop tasks to run from the queue.
>
> If there's anything wrong with how airflow internally works, please
> correct us. We are seeking
> comments about the whole design as well as detailed proposed changes from
> the airflow developers!
>
> We also would like to see airflow moving towards better Kubernetes
> integration and support. Airflow
> has the ability to launch pods (KubernetesExecutor and
> CeleryKubernetesExecutor), and we believe
> by leverage the DAG and linage information, airflow could gain more in
> modern cloud environment
> orchestrated by Kubernetes!
>
> Thanks in advance!
>
> Best regards,
> Tao
>


-- 
+48 660 796 129