You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by "EKC (Erik Cederstrand)" <EK...@novozymes.com> on 2017/02/13 16:01:15 UTC

Celery or Dask?

Hello all,


I'm investigating why some of our DAGs are not being scheduled properly ( ran into https://issues.apache.org/jira/browse/AIRFLOW-342, among other things). Coupled with comments on this list, I'm getting the impression that Celery is a second-class citizen and core developers are mainly using Dask. Is this correct?


If Dask support is simply more mature and more likely to have issues responded to, I'll consider migrating our installation.


Thanks,

Erik

Re: Celery or Dask?

Posted by "EKC (Erik Cederstrand)" <EK...@novozymes.com>.
Thanks to both for correcting my understanding. I'll see what information I can collect on our issues and report back if I get anything coherent.


Kind regards,

Erik

________________________________
From: Jeremiah Lowin <jl...@apache.org>
Sent: Monday, February 13, 2017 6:26:15 PM
To: dev@airflow.incubator.apache.org
Subject: Re: Celery or Dask?

As far as I know I'm the only person using Dask with Airflow at the moment.
I've been using Dask for a variety of other (non-Airflow) tasks and have
found it to be a great tool. However, it's important to note that Celery is
a much more mature project with finer control over how tasks are executed.
In fact Dask's objectives are totally different (I think of it as
"pure-Python Spark") but it happens to expose similar functionality to
Celery through its Distributed subproject.

I added a DaskExecutor to Airflow in my last commit and am working on
improving the unit tests now. I've been running the DaskExecutor in a test
environment with good results, but between the fact that you have to run
Airflow's bleeding-edge master branch to get it and that I'm the only
person kicking its tires (at the moment), I would only recommend using it
if you like to live very dangerously indeed.

In the near future, I can see Dask being a recommended way to scale Airflow
beyond a single machine due to the ease of setting it up -- but not yet.

On Mon, Feb 13, 2017 at 11:04 AM Bolke de Bruin <bd...@gmail.com> wrote:

Dask just landed in master. So no Celery is the most used option to
scale-out.

Always interested in what you are running into, but please be prepared to
provide a lot of info on your setup.

- Boke

> On 13 Feb 2017, at 17:01, EKC (Erik Cederstrand) <EK...@novozymes.com>
wrote:
>
> Hello all,
>
>
> I'm investigating why some of our DAGs are not being scheduled properly (
ran into https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-342&data=01%7C01%7CEKC%40novozymes.com%7Cba906a466ee24463ab0908d4543580ac%7C43d5f49ee03a4d22a2285684196bb001%7C0&sdata=TYksYDtZ2QEG4ZV0oMi345yvQPBIPm449X0QaaKfct0%3D&reserved=0, among other
things). Coupled with comments on this list, I'm getting the impression
that Celery is a second-class citizen and core developers are mainly using
Dask. Is this correct?
>
>
> If Dask support is simply more mature and more likely to have issues
responded to, I'll consider migrating our installation.
>
>
> Thanks,
>
> Erik

Re: Celery or Dask?

Posted by Jeremiah Lowin <jl...@apache.org>.
As far as I know I'm the only person using Dask with Airflow at the moment.
I've been using Dask for a variety of other (non-Airflow) tasks and have
found it to be a great tool. However, it's important to note that Celery is
a much more mature project with finer control over how tasks are executed.
In fact Dask's objectives are totally different (I think of it as
"pure-Python Spark") but it happens to expose similar functionality to
Celery through its Distributed subproject.

I added a DaskExecutor to Airflow in my last commit and am working on
improving the unit tests now. I've been running the DaskExecutor in a test
environment with good results, but between the fact that you have to run
Airflow's bleeding-edge master branch to get it and that I'm the only
person kicking its tires (at the moment), I would only recommend using it
if you like to live very dangerously indeed.

In the near future, I can see Dask being a recommended way to scale Airflow
beyond a single machine due to the ease of setting it up -- but not yet.

On Mon, Feb 13, 2017 at 11:04 AM Bolke de Bruin <bd...@gmail.com> wrote:

Dask just landed in master. So no Celery is the most used option to
scale-out.

Always interested in what you are running into, but please be prepared to
provide a lot of info on your setup.

- Boke

> On 13 Feb 2017, at 17:01, EKC (Erik Cederstrand) <EK...@novozymes.com>
wrote:
>
> Hello all,
>
>
> I'm investigating why some of our DAGs are not being scheduled properly (
ran into https://issues.apache.org/jira/browse/AIRFLOW-342, among other
things). Coupled with comments on this list, I'm getting the impression
that Celery is a second-class citizen and core developers are mainly using
Dask. Is this correct?
>
>
> If Dask support is simply more mature and more likely to have issues
responded to, I'll consider migrating our installation.
>
>
> Thanks,
>
> Erik

Re: Celery or Dask?

Posted by Bolke de Bruin <bd...@gmail.com>.
Dask just landed in master. So no Celery is the most used option to scale-out.

Always interested in what you are running into, but please be prepared to provide a lot of info on your setup.

- Boke

> On 13 Feb 2017, at 17:01, EKC (Erik Cederstrand) <EK...@novozymes.com> wrote:
> 
> Hello all,
> 
> 
> I'm investigating why some of our DAGs are not being scheduled properly ( ran into https://issues.apache.org/jira/browse/AIRFLOW-342, among other things). Coupled with comments on this list, I'm getting the impression that Celery is a second-class citizen and core developers are mainly using Dask. Is this correct?
> 
> 
> If Dask support is simply more mature and more likely to have issues responded to, I'll consider migrating our installation.
> 
> 
> Thanks,
> 
> Erik