You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <ja...@potiuk.com> on 2021/11/25 10:40:10 UTC

[DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Hello Everyone,

I recently had some discussions and thought about some new features
implemented already and planned and in-progress work, and I had a
thought - that maybe worth discussing here.

It's very likely many of the people involved had similar discussion
and thoughts, but maybe it's worth spelling it out now and have a
common "direction" we are heading for the future of airflow when it
comes to executors.

TL;DR; I think the recent changes and possibly some future
improvements and optimisation can lead us to the situation that we
will not need Celery Executor (nor CeleryKubernetes)  and can phase it
out eventually - leaving only Local, Kubernetes and soon coming
LocalKubernetes one. We might still "support" CeleryExecutor for
backwards compatibility and people who do not want to run Kubernetes,
but in a way the main reasons why Celery would be preferred over
Kubernetes should be gone soon IMHO.

Why do I think so ?

I think so because I believe the main problems of having
CeleryExecutor in the first place are largely gone. The main reason
why Celery executor was better than the Kubernetes one was that you
could run more short tasks with far less overhead and latency. However
we have now either already implemented or easy to optimise ways of
significantly decreasing the need of running small tasks via "remote"
executors.

The following things already happened:

1) We have Deferrable Operators support. Most of the code there - for
mostly small tasks or parts of the operators that wait for something
already executed in triggerer for those.

2) We have a HA scheduler where you could run multiple schedulers with
Local Executor - thus you can get scalability in LocalExecutor for
small tasks.

3) We had some optimisations in DummyOperator where triggering is done
in Scheduler.

What still can (or is being already done):

* While triggerer does not (I believe) support multiple instances for
now, it has been designed from ground up to support HA/scalability.

* We can rewrite a lot of the operators we have to be Deferrable -
especially those that reach out to external services.

* We can make more "built-in" operators that have some declarative
behaviour rather than imperative "execute" and have them evaluated
directly in Scheduler. We had a discussion about it in
https://github.com/apache/airflow/pull/19361 - but looks like it
should be possible to implement - for example - "DayOfWeek" operator
that would be evaluated in Scheduler and triggering decisions could be
made there. We could probably add quite a number of such "optimized"
operators that could be declarative and evaluated in a scheduler with
virtually 0 overhead.

* with LocalKubernetes executor coming
https://github.com/apache/airflow/pull/19729 combined with
HA/scalability of scheduler (thus scalability of Local Executors) - It
seems that any reasonable installation will have enough scalability
and capacity to locally execute all the remaining "small tasks" in
Local Executors. We could even try to figure out some good pattern of
figuring out which tasks are "small" and automatically using
LocalExecutor for them - eventually.

It seems to me that with those upcoming changes, LocalKubernetes
should be default executor in the future rather than Celery (which is
now kind-of de facto "default"). We could even likly think about
adding more options of similar kind for GCP/AWS/Azure - using native
capabilities of those platforms rather than using generic "Kubernetes"
as remote execution. I can imagine using Fargate (AWS team could
contribute it ), Cloud Run (Google team), Azure Container Instances
(maybe Microsoft will finally also embrace Airflow :) ) .  That would
make the Airflow architecture more "Multiple Cloud Native".

Why do I think Celery Executor should be "gone" (possibly not
immediately but possibly with less priority) ?

Problem with Celery is that even with KEDA autoscaling Celery Executor
has big problems with scaling-in (also had discussions about it
recently - with the AWS team among others). Celery is complex and we
are using maybe 5% of it's capabilities (however I had a recent
discussion (at PyWaw where I gave talk about Airflow dependencies)
with people who are heavily using Celery with their product and
utilise a lot more of those capabilities and they are rather unhappy
with the problems they have to deal with and stability of more complex
features of Celery.

I'd love to hear what others think on the subject? It would be great
to have some common "direction" we are heading in agreed and "vision"
of Airflow in the future when it comes to Executors, and I have a
feeling that we are just about a pivotal point where we can all
consciously change our paradigm of thinking about Airflow executors
and prioritising things differently.

J.

Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Posted by James Coder <jc...@gmail.com>.
Just to throw I my 2 cents here, one huge benefit of the celery executor, and a large reason I use Airflow, is that it allows you to make use of multiple queues. We are in the process of transitioning workflows from on prem to cloud and the celery executor allows me to easily send tasks that aren’t ready for the cloud yet to on prem workers while things that can already be executed in k8s get sent to the appropriate queue there.
I am all for simplifying the deployment and to be able to get rid of the overhead of running a message broker as well as all the dependencies celery requires would be a big plus. Additionally as Jarek mentions above, the “scaling in” using celery leaves much to be desired and the idea of running multiple schedulers with local executors is intriguing, but without multiple queue support,  it creates a gap currently filled by celery.

James

James Coder
________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Thursday, November 25, 2021 7:22:45 AM
To: dev@airflow.apache.org <de...@airflow.apache.org>
Subject: Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Very Good comments Ash !
Food for thought indeed - indeed LocalExecutor for multi-tenant is
no-go (thought about it too :). I agree there are different cases and
I agree totally that Celery will stay there for a looong time (maybe
forever).

Maybe the "phasing out" is too strong of a statement (I deliberately
did not use "deprecated" because that was really not my intention to
"remove it" .

I thought more of changing the "thinking" we have in Airflow.

Currently the thinking is (at least in my head):

"if you want auto-scaling solution with support for long and short
tasks - by default go to CeleryKubernetesExecutor"

However I think that we **might** have another target in the future:

"if you want auto-scaling solution with support for long and short
tasks - by default go to Local<N>Executor or even if you care for
multi-tenancy, <N>Executor **might** be enough" (where N is
"Kubernetes" today but might be "Fargate/CloudRun/ContainerInstances"
etc.

J.



On Thu, Nov 25, 2021 at 12:51 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> Hi Jarek,
>
> Trigger does support multiple instances already.
> Deferrable tasks still need a normal task slot on a worker to start off and then defer to a trigger right now as well.
>
> While I have no love for Celery (or how we mis-use it in Airflow more accurately), and I agree that we aren't using many of it's capabilities, deprecating/removing the Celery executor doesn't feel right to me. Yet. And not for a long while either.
>
> First there is the multi-tenancy issue (discussion happening tomorrow of course) - and if the scheduler is multi-tenant then I wouldn't feel safe running _any_ user/DAG code on the scheduler node at all, so for that to be possible we wouldn't be able to use Local Executor at all!. For instance all SLA misses, and DAG level callbacks would need to go via an executor to run on a worker.
>
> Then there is my goal for Airflow: I want us to be better at running many smaller tasks (which largely rules out Kubernetes due to pod start up time), and while LocalExecutor would work with that model, I think a multi-node deployment that doesn't involve running multiple schedulers should be possible -- being able to scale worker slots (for actual data processing in Airflow, not just kicking of external jobs) interdependently of scheduling throughput is desirable to me. Afterall, running a scheduler is not free in terms of load on the database.
>
> Essentially by running multiple schedulers with LocalExecutor I worry that we have build a poor imitation of a distributed job queue (i.e. Celery) without all the years of experience that Celery has of making it robust. Also lets not forget that building any kind of distributed queue is a Difficult Problem and there always have to be tradeoffs.
>
> -ash
>
>
> On Thu, Nov 25 2021 at 11:40:10 +0100, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> Hello Everyone, I recently had some discussions and thought about some new features implemented already and planned and in-progress work, and I had a thought - that maybe worth discussing here. It's very likely many of the people involved had similar discussion and thoughts, but maybe it's worth spelling it out now and have a common "direction" we are heading for the future of airflow when it comes to executors. TL;DR; I think the recent changes and possibly some future improvements and optimisation can lead us to the situation that we will not need Celery Executor (nor CeleryKubernetes) and can phase it out eventually - leaving only Local, Kubernetes and soon coming LocalKubernetes one. We might still "support" CeleryExecutor for backwards compatibility and people who do not want to run Kubernetes, but in a way the main reasons why Celery would be preferred over Kubernetes should be gone soon IMHO. Why do I think so ? I think so because I believe the main problems of having CeleryExecutor in the first place are largely gone. The main reason why Celery executor was better than the Kubernetes one was that you could run more short tasks with far less overhead and latency. However we have now either already implemented or easy to optimise ways of significantly decreasing the need of running small tasks via "remote" executors. The following things already happened: 1) We have Deferrable Operators support. Most of the code there - for mostly small tasks or parts of the operators that wait for something already executed in triggerer for those. 2) We have a HA scheduler where you could run multiple schedulers with Local Executor - thus you can get scalability in LocalExecutor for small tasks. 3) We had some optimisations in DummyOperator where triggering is done in Scheduler. What still can (or is being already done): * While triggerer does not (I believe) support multiple instances for now, it has been designed from ground up to support HA/scalability. * We can rewrite a lot of the operators we have to be Deferrable - especially those that reach out to external services. * We can make more "built-in" operators that have some declarative behaviour rather than imperative "execute" and have them evaluated directly in Scheduler. We had a discussion about it in https://github.com/apache/airflow/pull/19361 - but looks like it should be possible to implement - for example - "DayOfWeek" operator that would be evaluated in Scheduler and triggering decisions could be made there. We could probably add quite a number of such "optimized" operators that could be declarative and evaluated in a scheduler with virtually 0 overhead. * with LocalKubernetes executor coming https://github.com/apache/airflow/pull/19729 combined with HA/scalability of scheduler (thus scalability of Local Executors) - It seems that any reasonable installation will have enough scalability and capacity to locally execute all the remaining "small tasks" in Local Executors. We could even try to figure out some good pattern of figuring out which tasks are "small" and automatically using LocalExecutor for them - eventually. It seems to me that with those upcoming changes, LocalKubernetes should be default executor in the future rather than Celery (which is now kind-of de facto "default"). We could even likly think about adding more options of similar kind for GCP/AWS/Azure - using native capabilities of those platforms rather than using generic "Kubernetes" as remote execution. I can imagine using Fargate (AWS team could contribute it ), Cloud Run (Google team), Azure Container Instances (maybe Microsoft will finally also embrace Airflow :) ) . That would make the Airflow architecture more "Multiple Cloud Native". Why do I think Celery Executor should be "gone" (possibly not immediately but possibly with less priority) ? Problem with Celery is that even with KEDA autoscaling Celery Executor has big problems with scaling-in (also had discussions about it recently - with the AWS team among others). Celery is complex and we are using maybe 5% of it's capabilities (however I had a recent discussion (at PyWaw where I gave talk about Airflow dependencies) with people who are heavily using Celery with their product and utilise a lot more of those capabilities and they are rather unhappy with the problems they have to deal with and stability of more complex features of Celery. I'd love to hear what others think on the subject? It would be great to have some common "direction" we are heading in agreed and "vision" of Airflow in the future when it comes to Executors, and I have a feeling that we are just about a pivotal point where we can all consciously change our paradigm of thinking about Airflow executors and prioritising things differently. J.

Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Posted by Jarek Potiuk <ja...@potiuk.com>.
Very Good comments Ash !
Food for thought indeed - indeed LocalExecutor for multi-tenant is
no-go (thought about it too :). I agree there are different cases and
I agree totally that Celery will stay there for a looong time (maybe
forever).

Maybe the "phasing out" is too strong of a statement (I deliberately
did not use "deprecated" because that was really not my intention to
"remove it" .

I thought more of changing the "thinking" we have in Airflow.

Currently the thinking is (at least in my head):

"if you want auto-scaling solution with support for long and short
tasks - by default go to CeleryKubernetesExecutor"

However I think that we **might** have another target in the future:

"if you want auto-scaling solution with support for long and short
tasks - by default go to Local<N>Executor or even if you care for
multi-tenancy, <N>Executor **might** be enough" (where N is
"Kubernetes" today but might be "Fargate/CloudRun/ContainerInstances"
etc.

J.



On Thu, Nov 25, 2021 at 12:51 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> Hi Jarek,
>
> Trigger does support multiple instances already.
> Deferrable tasks still need a normal task slot on a worker to start off and then defer to a trigger right now as well.
>
> While I have no love for Celery (or how we mis-use it in Airflow more accurately), and I agree that we aren't using many of it's capabilities, deprecating/removing the Celery executor doesn't feel right to me. Yet. And not for a long while either.
>
> First there is the multi-tenancy issue (discussion happening tomorrow of course) - and if the scheduler is multi-tenant then I wouldn't feel safe running _any_ user/DAG code on the scheduler node at all, so for that to be possible we wouldn't be able to use Local Executor at all!. For instance all SLA misses, and DAG level callbacks would need to go via an executor to run on a worker.
>
> Then there is my goal for Airflow: I want us to be better at running many smaller tasks (which largely rules out Kubernetes due to pod start up time), and while LocalExecutor would work with that model, I think a multi-node deployment that doesn't involve running multiple schedulers should be possible -- being able to scale worker slots (for actual data processing in Airflow, not just kicking of external jobs) interdependently of scheduling throughput is desirable to me. Afterall, running a scheduler is not free in terms of load on the database.
>
> Essentially by running multiple schedulers with LocalExecutor I worry that we have build a poor imitation of a distributed job queue (i.e. Celery) without all the years of experience that Celery has of making it robust. Also lets not forget that building any kind of distributed queue is a Difficult Problem and there always have to be tradeoffs.
>
> -ash
>
>
> On Thu, Nov 25 2021 at 11:40:10 +0100, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> Hello Everyone, I recently had some discussions and thought about some new features implemented already and planned and in-progress work, and I had a thought - that maybe worth discussing here. It's very likely many of the people involved had similar discussion and thoughts, but maybe it's worth spelling it out now and have a common "direction" we are heading for the future of airflow when it comes to executors. TL;DR; I think the recent changes and possibly some future improvements and optimisation can lead us to the situation that we will not need Celery Executor (nor CeleryKubernetes) and can phase it out eventually - leaving only Local, Kubernetes and soon coming LocalKubernetes one. We might still "support" CeleryExecutor for backwards compatibility and people who do not want to run Kubernetes, but in a way the main reasons why Celery would be preferred over Kubernetes should be gone soon IMHO. Why do I think so ? I think so because I believe the main problems of having CeleryExecutor in the first place are largely gone. The main reason why Celery executor was better than the Kubernetes one was that you could run more short tasks with far less overhead and latency. However we have now either already implemented or easy to optimise ways of significantly decreasing the need of running small tasks via "remote" executors. The following things already happened: 1) We have Deferrable Operators support. Most of the code there - for mostly small tasks or parts of the operators that wait for something already executed in triggerer for those. 2) We have a HA scheduler where you could run multiple schedulers with Local Executor - thus you can get scalability in LocalExecutor for small tasks. 3) We had some optimisations in DummyOperator where triggering is done in Scheduler. What still can (or is being already done): * While triggerer does not (I believe) support multiple instances for now, it has been designed from ground up to support HA/scalability. * We can rewrite a lot of the operators we have to be Deferrable - especially those that reach out to external services. * We can make more "built-in" operators that have some declarative behaviour rather than imperative "execute" and have them evaluated directly in Scheduler. We had a discussion about it in https://github.com/apache/airflow/pull/19361 - but looks like it should be possible to implement - for example - "DayOfWeek" operator that would be evaluated in Scheduler and triggering decisions could be made there. We could probably add quite a number of such "optimized" operators that could be declarative and evaluated in a scheduler with virtually 0 overhead. * with LocalKubernetes executor coming https://github.com/apache/airflow/pull/19729 combined with HA/scalability of scheduler (thus scalability of Local Executors) - It seems that any reasonable installation will have enough scalability and capacity to locally execute all the remaining "small tasks" in Local Executors. We could even try to figure out some good pattern of figuring out which tasks are "small" and automatically using LocalExecutor for them - eventually. It seems to me that with those upcoming changes, LocalKubernetes should be default executor in the future rather than Celery (which is now kind-of de facto "default"). We could even likly think about adding more options of similar kind for GCP/AWS/Azure - using native capabilities of those platforms rather than using generic "Kubernetes" as remote execution. I can imagine using Fargate (AWS team could contribute it ), Cloud Run (Google team), Azure Container Instances (maybe Microsoft will finally also embrace Airflow :) ) . That would make the Airflow architecture more "Multiple Cloud Native". Why do I think Celery Executor should be "gone" (possibly not immediately but possibly with less priority) ? Problem with Celery is that even with KEDA autoscaling Celery Executor has big problems with scaling-in (also had discussions about it recently - with the AWS team among others). Celery is complex and we are using maybe 5% of it's capabilities (however I had a recent discussion (at PyWaw where I gave talk about Airflow dependencies) with people who are heavily using Celery with their product and utilise a lot more of those capabilities and they are rather unhappy with the problems they have to deal with and stability of more complex features of Celery. I'd love to hear what others think on the subject? It would be great to have some common "direction" we are heading in agreed and "vision" of Airflow in the future when it comes to Executors, and I have a feeling that we are just about a pivotal point where we can all consciously change our paradigm of thinking about Airflow executors and prioritising things differently. J.

Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Posted by Ash Berlin-Taylor <as...@apache.org>.
Hi Jarek,

Trigger does support multiple instances already.
Deferrable tasks still need a normal task slot on a worker to start off 
and then defer to a trigger right now as well.

While I have no love for Celery (or how we mis-use it in Airflow more 
accurately), and I agree that we aren't using many of it's 
capabilities, deprecating/removing the Celery executor doesn't feel 
right to me. Yet. And not for a long while either.

First there is the multi-tenancy issue (discussion happening tomorrow 
of course) - and if the scheduler is multi-tenant then I wouldn't feel 
safe running _any_ user/DAG code on the scheduler node at all, so for 
that to be possible we wouldn't be able to use Local Executor at all!. 
For instance all SLA misses, and DAG level callbacks would need to go 
via an executor to run on a worker.

Then there is my goal for Airflow: I want us to be better at running 
many smaller tasks (which largely rules out Kubernetes due to pod start 
up time), and while LocalExecutor would work with that model, I think a 
multi-node deployment that doesn't involve running multiple schedulers 
should be possible -- being able to scale worker slots (for /actual/ 
data processing in Airflow, not just kicking of external jobs) 
interdependently of scheduling throughput is desirable to me. Afterall, 
running a scheduler is not free in terms of load on the database.

Essentially by running multiple schedulers with LocalExecutor I worry 
that we have build a poor imitation of a distributed job queue (i.e. 
Celery) without all the years of experience that Celery has of making 
it robust. Also lets not forget that building any kind of distributed 
queue is a Difficult Problem and there always have to be tradeoffs.

-ash


On Thu, Nov 25 2021 at 11:40:10 +0100, Jarek Potiuk <ja...@potiuk.com> 
wrote:
> Hello Everyone,
> 
> I recently had some discussions and thought about some new features
> implemented already and planned and in-progress work, and I had a
> thought - that maybe worth discussing here.
> 
> It's very likely many of the people involved had similar discussion
> and thoughts, but maybe it's worth spelling it out now and have a
> common "direction" we are heading for the future of airflow when it
> comes to executors.
> 
> TL;DR; I think the recent changes and possibly some future
> improvements and optimisation can lead us to the situation that we
> will not need Celery Executor (nor CeleryKubernetes)  and can phase it
> out eventually - leaving only Local, Kubernetes and soon coming
> LocalKubernetes one. We might still "support" CeleryExecutor for
> backwards compatibility and people who do not want to run Kubernetes,
> but in a way the main reasons why Celery would be preferred over
> Kubernetes should be gone soon IMHO.
> 
> Why do I think so ?
> 
> I think so because I believe the main problems of having
> CeleryExecutor in the first place are largely gone. The main reason
> why Celery executor was better than the Kubernetes one was that you
> could run more short tasks with far less overhead and latency. However
> we have now either already implemented or easy to optimise ways of
> significantly decreasing the need of running small tasks via "remote"
> executors.
> 
> The following things already happened:
> 
> 1) We have Deferrable Operators support. Most of the code there - for
> mostly small tasks or parts of the operators that wait for something
> already executed in triggerer for those.
> 
> 2) We have a HA scheduler where you could run multiple schedulers with
> Local Executor - thus you can get scalability in LocalExecutor for
> small tasks.
> 
> 3) We had some optimisations in DummyOperator where triggering is done
> in Scheduler.
> 
> What still can (or is being already done):
> 
> * While triggerer does not (I believe) support multiple instances for
> now, it has been designed from ground up to support HA/scalability.
> 
> * We can rewrite a lot of the operators we have to be Deferrable -
> especially those that reach out to external services.
> 
> * We can make more "built-in" operators that have some declarative
> behaviour rather than imperative "execute" and have them evaluated
> directly in Scheduler. We had a discussion about it in
> <https://github.com/apache/airflow/pull/19361> - but looks like it
> should be possible to implement - for example - "DayOfWeek" operator
> that would be evaluated in Scheduler and triggering decisions could be
> made there. We could probably add quite a number of such "optimized"
> operators that could be declarative and evaluated in a scheduler with
> virtually 0 overhead.
> 
> * with LocalKubernetes executor coming
> <https://github.com/apache/airflow/pull/19729> combined with
> HA/scalability of scheduler (thus scalability of Local Executors) - It
> seems that any reasonable installation will have enough scalability
> and capacity to locally execute all the remaining "small tasks" in
> Local Executors. We could even try to figure out some good pattern of
> figuring out which tasks are "small" and automatically using
> LocalExecutor for them - eventually.
> 
> It seems to me that with those upcoming changes, LocalKubernetes
> should be default executor in the future rather than Celery (which is
> now kind-of de facto "default"). We could even likly think about
> adding more options of similar kind for GCP/AWS/Azure - using native
> capabilities of those platforms rather than using generic "Kubernetes"
> as remote execution. I can imagine using Fargate (AWS team could
> contribute it ), Cloud Run (Google team), Azure Container Instances
> (maybe Microsoft will finally also embrace Airflow :) ) .  That would
> make the Airflow architecture more "Multiple Cloud Native".
> 
> Why do I think Celery Executor should be "gone" (possibly not
> immediately but possibly with less priority) ?
> 
> Problem with Celery is that even with KEDA autoscaling Celery Executor
> has big problems with scaling-in (also had discussions about it
> recently - with the AWS team among others). Celery is complex and we
> are using maybe 5% of it's capabilities (however I had a recent
> discussion (at PyWaw where I gave talk about Airflow dependencies)
> with people who are heavily using Celery with their product and
> utilise a lot more of those capabilities and they are rather unhappy
> with the problems they have to deal with and stability of more complex
> features of Celery.
> 
> I'd love to hear what others think on the subject? It would be great
> to have some common "direction" we are heading in agreed and "vision"
> of Airflow in the future when it comes to Executors, and I have a
> feeling that we are just about a pivotal point where we can all
> consciously change our paradigm of thinking about Airflow executors
> and prioritising things differently.
> 
> J.


Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yep. Definitely - part of AIP-1 :).

Having the Executor extended to run all kinds of  "workloads" is a great idea!

And I love the comments - re Fargate and Batch cases - really cool to
see the different perspectives here.  We definitely need to get more
such discussions :)

On Fri, Nov 26, 2021 at 3:06 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> This split Fargate/Lambda executor idea has some relevance for the AIP-1/multi-tenancy discussion too.
>
> One of the things I had been considering for that is that we need to move DAG-level callbacks out of the scheduler (currently run via the parsing process run on each scheduler) as we can't have scheduler nodes running any user code in multi-tenancy for security reasons.
>
> So my idea here is that we extend the role of the Executor to be "run workloads" -- wether that is "execute this TI" or "run this DAG SLA miss callback". Crucially it _doesn't_ have to run it all the same, so a BaseExecutor could write the callbacks in to a DB table that processors could pick up (mechanism TBD.) but, crucially, by having it be part of the Executor interface we can subclass it, and in this Fargate/Lambda example we could have callbacks run in Lambdas!
>
> -a
>
> On Thu, Nov 25 2021 at 23:18:17 +0000, "Oliveira, Niko" <on...@amazon.com.INVALID> wrote:
>
> We could even likely think about
>
> adding more options of similar kind for GCP/AWS/Azure - using native capabilities of those platforms rather than using generic "Kubernetes" as remote execution. I can imagine using Fargate (AWS team could contribute it ), Cloud Run (Google team), Azure Container Instances (maybe Microsoft will finally also embrace Airflow :) ) . That would make the Airflow architecture more "Multiple Cloud Native". From the AWS side we're very interested and happy to work on something like a Fargate executor; it's on our roadmap either way. But I think a generalized "cloud" or "serverless" executor would make a lot of sense. From AWS alone you may want to execute "small" tasks within a Lambda (quick start up time but small amount of compute and a 15min max run time) and then "medium" to "large" tasks in ECS Fargate or Batch (with longer startup times but more compute available), etc. And the same goes for other cloud provider equivalents. A harmonized and configurable solution could make directing tasks to different execution environments very smooth. ________________________________________ From: Jarek Potiuk <ja...@potiuk.com> Sent: Thursday, November 25, 2021 2:40 AM To: dev@airflow.apache.org Subject: [EXTERNAL] [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?) CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hello Everyone, I recently had some discussions and thought about some new features implemented already and planned and in-progress work, and I had a thought - that maybe worth discussing here. It's very likely many of the people involved had similar discussion and thoughts, but maybe it's worth spelling it out now and have a common "direction" we are heading for the future of airflow when it comes to executors. TL;DR; I think the recent changes and possibly some future improvements and optimisation can lead us to the situation that we will not need Celery Executor (nor CeleryKubernetes) and can phase it out eventually - leaving only Local, Kubernetes and soon coming LocalKubernetes one. We might still "support" CeleryExecutor for backwards compatibility and people who do not want to run Kubernetes, but in a way the main reasons why Celery would be preferred over Kubernetes should be gone soon IMHO. Why do I think so ? I think so because I believe the main problems of having CeleryExecutor in the first place are largely gone. The main reason why Celery executor was better than the Kubernetes one was that you could run more short tasks with far less overhead and latency. However we have now either already implemented or easy to optimise ways of significantly decreasing the need of running small tasks via "remote" executors. The following things already happened: 1) We have Deferrable Operators support. Most of the code there - for mostly small tasks or parts of the operators that wait for something already executed in triggerer for those. 2) We have a HA scheduler where you could run multiple schedulers with Local Executor - thus you can get scalability in LocalExecutor for small tasks. 3) We had some optimisations in DummyOperator where triggering is done in Scheduler. What still can (or is being already done): * While triggerer does not (I believe) support multiple instances for now, it has been designed from ground up to support HA/scalability. * We can rewrite a lot of the operators we have to be Deferrable - especially those that reach out to external services. * We can make more "built-in" operators that have some declarative behaviour rather than imperative "execute" and have them evaluated directly in Scheduler. We had a discussion about it in https://github.com/apache/airflow/pull/19361 - but looks like it should be possible to implement - for example - "DayOfWeek" operator that would be evaluated in Scheduler and triggering decisions could be made there. We could probably add quite a number of such "optimized" operators that could be declarative and evaluated in a scheduler with virtually 0 overhead. * with LocalKubernetes executor coming https://github.com/apache/airflow/pull/19729 combined with HA/scalability of scheduler (thus scalability of Local Executors) - It seems that any reasonable installation will have enough scalability and capacity to locally execute all the remaining "small tasks" in Local Executors. We could even try to figure out some good pattern of figuring out which tasks are "small" and automatically using LocalExecutor for them - eventually. It seems to me that with those upcoming changes, LocalKubernetes should be default executor in the future rather than Celery (which is now kind-of de facto "default"). We could even likly think about adding more options of similar kind for GCP/AWS/Azure - using native capabilities of those platforms rather than using generic "Kubernetes" as remote execution. I can imagine using Fargate (AWS team could contribute it ), Cloud Run (Google team), Azure Container Instances (maybe Microsoft will finally also embrace Airflow :) ) . That would make the Airflow architecture more "Multiple Cloud Native". Why do I think Celery Executor should be "gone" (possibly not immediately but possibly with less priority) ? Problem with Celery is that even with KEDA autoscaling Celery Executor has big problems with scaling-in (also had discussions about it recently - with the AWS team among others). Celery is complex and we are using maybe 5% of it's capabilities (however I had a recent discussion (at PyWaw where I gave talk about Airflow dependencies) with people who are heavily using Celery with their product and utilise a lot more of those capabilities and they are rather unhappy with the problems they have to deal with and stability of more complex features of Celery. I'd love to hear what others think on the subject? It would be great to have some common "direction" we are heading in agreed and "vision" of Airflow in the future when it comes to Executors, and I have a feeling that we are just about a pivotal point where we can all consciously change our paradigm of thinking about Airflow executors and prioritising things differently. J.

Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Posted by Ash Berlin-Taylor <as...@apache.org>.
This split Fargate/Lambda executor idea has some relevance for the 
AIP-1/multi-tenancy discussion too.

One of the things I had been considering for that is that we need to 
move DAG-level callbacks out of the scheduler (currently run via the 
parsing process run on each scheduler) as we can't have scheduler nodes 
running /any/ user code in multi-tenancy for security reasons.

So my idea here is that we extend the role of the Executor to be "run 
workloads" -- wether that is "execute this TI" or "run this DAG SLA 
miss callback". Crucially it _doesn't_ have to run it all the same, so 
a BaseExecutor could write the callbacks in to a DB table that 
processors could pick up (mechanism TBD.) but, crucially, by having it 
be part of the Executor interface we can subclass it, and in this 
Fargate/Lambda example we could have callbacks run in Lambdas!

-a

On Thu, Nov 25 2021 at 23:18:17 +0000, "Oliveira, Niko" 
<on...@amazon.com.INVALID> wrote:
>>  We could even likely think about
> adding more options of similar kind for GCP/AWS/Azure - using native
> capabilities of those platforms rather than using generic "Kubernetes"
> as remote execution. I can imagine using Fargate (AWS team could
> contribute it ), Cloud Run (Google team), Azure Container Instances
> (maybe Microsoft will finally also embrace Airflow :) ) .  That would
> make the Airflow architecture more "Multiple Cloud Native".
> 
> From the AWS side we're very interested and happy to work on 
> something like a Fargate executor; it's on our roadmap either way.
> 
> But I think a generalized "cloud" or "serverless" executor would make 
> a lot of sense. From AWS alone you may want to execute "small" tasks 
> within a Lambda (quick start up time but small amount of compute and 
> a 15min max run time) and then "medium" to "large" tasks in ECS 
> Fargate or Batch (with longer startup times but more compute 
> available), etc. And the same goes for other cloud provider 
> equivalents. A harmonized and configurable solution could make 
> directing tasks to different execution environments very smooth.
> 
> ________________________________________
> From: Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>>
> Sent: Thursday, November 25, 2021 2:40 AM
> To: dev@airflow.apache.org <ma...@airflow.apache.org>
> Subject: [EXTERNAL] [DISCUSS] Shaping the future of executors for 
> Airflow (slowly phasing out Celery ?)
> 
> CAUTION: This email originated from outside of the organization. Do 
> not click links or open attachments unless you can confirm the sender 
> and know the content is safe.
> 
> 
> 
> Hello Everyone,
> 
> I recently had some discussions and thought about some new features
> implemented already and planned and in-progress work, and I had a
> thought - that maybe worth discussing here.
> 
> It's very likely many of the people involved had similar discussion
> and thoughts, but maybe it's worth spelling it out now and have a
> common "direction" we are heading for the future of airflow when it
> comes to executors.
> 
> TL;DR; I think the recent changes and possibly some future
> improvements and optimisation can lead us to the situation that we
> will not need Celery Executor (nor CeleryKubernetes)  and can phase it
> out eventually - leaving only Local, Kubernetes and soon coming
> LocalKubernetes one. We might still "support" CeleryExecutor for
> backwards compatibility and people who do not want to run Kubernetes,
> but in a way the main reasons why Celery would be preferred over
> Kubernetes should be gone soon IMHO.
> 
> Why do I think so ?
> 
> I think so because I believe the main problems of having
> CeleryExecutor in the first place are largely gone. The main reason
> why Celery executor was better than the Kubernetes one was that you
> could run more short tasks with far less overhead and latency. However
> we have now either already implemented or easy to optimise ways of
> significantly decreasing the need of running small tasks via "remote"
> executors.
> 
> The following things already happened:
> 
> 1) We have Deferrable Operators support. Most of the code there - for
> mostly small tasks or parts of the operators that wait for something
> already executed in triggerer for those.
> 
> 2) We have a HA scheduler where you could run multiple schedulers with
> Local Executor - thus you can get scalability in LocalExecutor for
> small tasks.
> 
> 3) We had some optimisations in DummyOperator where triggering is done
> in Scheduler.
> 
> What still can (or is being already done):
> 
> * While triggerer does not (I believe) support multiple instances for
> now, it has been designed from ground up to support HA/scalability.
> 
> * We can rewrite a lot of the operators we have to be Deferrable -
> especially those that reach out to external services.
> 
> * We can make more "built-in" operators that have some declarative
> behaviour rather than imperative "execute" and have them evaluated
> directly in Scheduler. We had a discussion about it in
> <https://github.com/apache/airflow/pull/19361> - but looks like it
> should be possible to implement - for example - "DayOfWeek" operator
> that would be evaluated in Scheduler and triggering decisions could be
> made there. We could probably add quite a number of such "optimized"
> operators that could be declarative and evaluated in a scheduler with
> virtually 0 overhead.
> 
> * with LocalKubernetes executor coming
> <https://github.com/apache/airflow/pull/19729> combined with
> HA/scalability of scheduler (thus scalability of Local Executors) - It
> seems that any reasonable installation will have enough scalability
> and capacity to locally execute all the remaining "small tasks" in
> Local Executors. We could even try to figure out some good pattern of
> figuring out which tasks are "small" and automatically using
> LocalExecutor for them - eventually.
> 
> It seems to me that with those upcoming changes, LocalKubernetes
> should be default executor in the future rather than Celery (which is
> now kind-of de facto "default"). We could even likly think about
> adding more options of similar kind for GCP/AWS/Azure - using native
> capabilities of those platforms rather than using generic "Kubernetes"
> as remote execution. I can imagine using Fargate (AWS team could
> contribute it ), Cloud Run (Google team), Azure Container Instances
> (maybe Microsoft will finally also embrace Airflow :) ) .  That would
> make the Airflow architecture more "Multiple Cloud Native".
> 
> Why do I think Celery Executor should be "gone" (possibly not
> immediately but possibly with less priority) ?
> 
> Problem with Celery is that even with KEDA autoscaling Celery Executor
> has big problems with scaling-in (also had discussions about it
> recently - with the AWS team among others). Celery is complex and we
> are using maybe 5% of it's capabilities (however I had a recent
> discussion (at PyWaw where I gave talk about Airflow dependencies)
> with people who are heavily using Celery with their product and
> utilise a lot more of those capabilities and they are rather unhappy
> with the problems they have to deal with and stability of more complex
> features of Celery.
> 
> I'd love to hear what others think on the subject? It would be great
> to have some common "direction" we are heading in agreed and "vision"
> of Airflow in the future when it comes to Executors, and I have a
> feeling that we are just about a pivotal point where we can all
> consciously change our paradigm of thinking about Airflow executors
> and prioritising things differently.
> 
> J.


Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Posted by Leon Smith <le...@bidnamic.com>.
I wrote an AWS Batch executor for our company (can provide later on if its
of interest to people) to attempt to move away from celery but we started
to hit up against some UI issues on AWS Batch & the number of jobs we
pushed through it.

The real killer though was the execution/spin up time as even at a few
seconds when you apply that to the thousands of sensors we have it took us
way outside of our acceptable SLA window so we reverted back to celery.

Granted we could look at smart sensors again (it was just/newly released
and had some issues if i recall correctly so we ruled that approach out).
Instead we started to explore a split executor approach sending jobs to
Batch where we care about isolation & sensor jobs to a celery executor or
some sqs & lambda contraption.

Overall phasing out celery sounds scary unless the speed issue is addressed
& I don't think moving that work into the scheduler the right path to be
moving down.
If a split executor becomes a thing and there is still a way to send jobs
down a "fast path" I think the use-case for Celery does diminish
considerably.


On Thu, 25 Nov 2021 at 23:18, Oliveira, Niko <on...@amazon.com.invalid>
wrote:

> > We could even likely think about
> adding more options of similar kind for GCP/AWS/Azure - using native
> capabilities of those platforms rather than using generic "Kubernetes"
> as remote execution. I can imagine using Fargate (AWS team could
> contribute it ), Cloud Run (Google team), Azure Container Instances
> (maybe Microsoft will finally also embrace Airflow :) ) .  That would
> make the Airflow architecture more "Multiple Cloud Native".
>
> From the AWS side we're very interested and happy to work on something
> like a Fargate executor; it's on our roadmap either way.
>
> But I think a generalized "cloud" or "serverless" executor would make a
> lot of sense. From AWS alone you may want to execute "small" tasks within a
> Lambda (quick start up time but small amount of compute and a 15min max run
> time) and then "medium" to "large" tasks in ECS Fargate or Batch (with
> longer startup times but more compute available), etc. And the same goes
> for other cloud provider equivalents. A harmonized and configurable
> solution could make directing tasks to different execution environments
> very smooth.
>
> ________________________________________
> From: Jarek Potiuk <ja...@potiuk.com>
> Sent: Thursday, November 25, 2021 2:40 AM
> To: dev@airflow.apache.org
> Subject: [EXTERNAL] [DISCUSS] Shaping the future of executors for Airflow
> (slowly phasing out Celery ?)
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Hello Everyone,
>
> I recently had some discussions and thought about some new features
> implemented already and planned and in-progress work, and I had a
> thought - that maybe worth discussing here.
>
> It's very likely many of the people involved had similar discussion
> and thoughts, but maybe it's worth spelling it out now and have a
> common "direction" we are heading for the future of airflow when it
> comes to executors.
>
> TL;DR; I think the recent changes and possibly some future
> improvements and optimisation can lead us to the situation that we
> will not need Celery Executor (nor CeleryKubernetes)  and can phase it
> out eventually - leaving only Local, Kubernetes and soon coming
> LocalKubernetes one. We might still "support" CeleryExecutor for
> backwards compatibility and people who do not want to run Kubernetes,
> but in a way the main reasons why Celery would be preferred over
> Kubernetes should be gone soon IMHO.
>
> Why do I think so ?
>
> I think so because I believe the main problems of having
> CeleryExecutor in the first place are largely gone. The main reason
> why Celery executor was better than the Kubernetes one was that you
> could run more short tasks with far less overhead and latency. However
> we have now either already implemented or easy to optimise ways of
> significantly decreasing the need of running small tasks via "remote"
> executors.
>
> The following things already happened:
>
> 1) We have Deferrable Operators support. Most of the code there - for
> mostly small tasks or parts of the operators that wait for something
> already executed in triggerer for those.
>
> 2) We have a HA scheduler where you could run multiple schedulers with
> Local Executor - thus you can get scalability in LocalExecutor for
> small tasks.
>
> 3) We had some optimisations in DummyOperator where triggering is done
> in Scheduler.
>
> What still can (or is being already done):
>
> * While triggerer does not (I believe) support multiple instances for
> now, it has been designed from ground up to support HA/scalability.
>
> * We can rewrite a lot of the operators we have to be Deferrable -
> especially those that reach out to external services.
>
> * We can make more "built-in" operators that have some declarative
> behaviour rather than imperative "execute" and have them evaluated
> directly in Scheduler. We had a discussion about it in
> https://github.com/apache/airflow/pull/19361 - but looks like it
> should be possible to implement - for example - "DayOfWeek" operator
> that would be evaluated in Scheduler and triggering decisions could be
> made there. We could probably add quite a number of such "optimized"
> operators that could be declarative and evaluated in a scheduler with
> virtually 0 overhead.
>
> * with LocalKubernetes executor coming
> https://github.com/apache/airflow/pull/19729 combined with
> HA/scalability of scheduler (thus scalability of Local Executors) - It
> seems that any reasonable installation will have enough scalability
> and capacity to locally execute all the remaining "small tasks" in
> Local Executors. We could even try to figure out some good pattern of
> figuring out which tasks are "small" and automatically using
> LocalExecutor for them - eventually.
>
> It seems to me that with those upcoming changes, LocalKubernetes
> should be default executor in the future rather than Celery (which is
> now kind-of de facto "default"). We could even likly think about
> adding more options of similar kind for GCP/AWS/Azure - using native
> capabilities of those platforms rather than using generic "Kubernetes"
> as remote execution. I can imagine using Fargate (AWS team could
> contribute it ), Cloud Run (Google team), Azure Container Instances
> (maybe Microsoft will finally also embrace Airflow :) ) .  That would
> make the Airflow architecture more "Multiple Cloud Native".
>
> Why do I think Celery Executor should be "gone" (possibly not
> immediately but possibly with less priority) ?
>
> Problem with Celery is that even with KEDA autoscaling Celery Executor
> has big problems with scaling-in (also had discussions about it
> recently - with the AWS team among others). Celery is complex and we
> are using maybe 5% of it's capabilities (however I had a recent
> discussion (at PyWaw where I gave talk about Airflow dependencies)
> with people who are heavily using Celery with their product and
> utilise a lot more of those capabilities and they are rather unhappy
> with the problems they have to deal with and stability of more complex
> features of Celery.
>
> I'd love to hear what others think on the subject? It would be great
> to have some common "direction" we are heading in agreed and "vision"
> of Airflow in the future when it comes to Executors, and I have a
> feeling that we are just about a pivotal point where we can all
> consciously change our paradigm of thinking about Airflow executors
> and prioritising things differently.
>
> J.
>

Re: [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

Posted by "Oliveira, Niko" <on...@amazon.com.INVALID>.
> We could even likely think about
adding more options of similar kind for GCP/AWS/Azure - using native
capabilities of those platforms rather than using generic "Kubernetes"
as remote execution. I can imagine using Fargate (AWS team could
contribute it ), Cloud Run (Google team), Azure Container Instances
(maybe Microsoft will finally also embrace Airflow :) ) .  That would
make the Airflow architecture more "Multiple Cloud Native".

From the AWS side we're very interested and happy to work on something like a Fargate executor; it's on our roadmap either way.

But I think a generalized "cloud" or "serverless" executor would make a lot of sense. From AWS alone you may want to execute "small" tasks within a Lambda (quick start up time but small amount of compute and a 15min max run time) and then "medium" to "large" tasks in ECS Fargate or Batch (with longer startup times but more compute available), etc. And the same goes for other cloud provider equivalents. A harmonized and configurable solution could make directing tasks to different execution environments very smooth.
 
________________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Thursday, November 25, 2021 2:40 AM
To: dev@airflow.apache.org
Subject: [EXTERNAL] [DISCUSS] Shaping the future of executors for Airflow (slowly phasing out Celery ?)

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



Hello Everyone,

I recently had some discussions and thought about some new features
implemented already and planned and in-progress work, and I had a
thought - that maybe worth discussing here.

It's very likely many of the people involved had similar discussion
and thoughts, but maybe it's worth spelling it out now and have a
common "direction" we are heading for the future of airflow when it
comes to executors.

TL;DR; I think the recent changes and possibly some future
improvements and optimisation can lead us to the situation that we
will not need Celery Executor (nor CeleryKubernetes)  and can phase it
out eventually - leaving only Local, Kubernetes and soon coming
LocalKubernetes one. We might still "support" CeleryExecutor for
backwards compatibility and people who do not want to run Kubernetes,
but in a way the main reasons why Celery would be preferred over
Kubernetes should be gone soon IMHO.

Why do I think so ?

I think so because I believe the main problems of having
CeleryExecutor in the first place are largely gone. The main reason
why Celery executor was better than the Kubernetes one was that you
could run more short tasks with far less overhead and latency. However
we have now either already implemented or easy to optimise ways of
significantly decreasing the need of running small tasks via "remote"
executors.

The following things already happened:

1) We have Deferrable Operators support. Most of the code there - for
mostly small tasks or parts of the operators that wait for something
already executed in triggerer for those.

2) We have a HA scheduler where you could run multiple schedulers with
Local Executor - thus you can get scalability in LocalExecutor for
small tasks.

3) We had some optimisations in DummyOperator where triggering is done
in Scheduler.

What still can (or is being already done):

* While triggerer does not (I believe) support multiple instances for
now, it has been designed from ground up to support HA/scalability.

* We can rewrite a lot of the operators we have to be Deferrable -
especially those that reach out to external services.

* We can make more "built-in" operators that have some declarative
behaviour rather than imperative "execute" and have them evaluated
directly in Scheduler. We had a discussion about it in
https://github.com/apache/airflow/pull/19361 - but looks like it
should be possible to implement - for example - "DayOfWeek" operator
that would be evaluated in Scheduler and triggering decisions could be
made there. We could probably add quite a number of such "optimized"
operators that could be declarative and evaluated in a scheduler with
virtually 0 overhead.

* with LocalKubernetes executor coming
https://github.com/apache/airflow/pull/19729 combined with
HA/scalability of scheduler (thus scalability of Local Executors) - It
seems that any reasonable installation will have enough scalability
and capacity to locally execute all the remaining "small tasks" in
Local Executors. We could even try to figure out some good pattern of
figuring out which tasks are "small" and automatically using
LocalExecutor for them - eventually.

It seems to me that with those upcoming changes, LocalKubernetes
should be default executor in the future rather than Celery (which is
now kind-of de facto "default"). We could even likly think about
adding more options of similar kind for GCP/AWS/Azure - using native
capabilities of those platforms rather than using generic "Kubernetes"
as remote execution. I can imagine using Fargate (AWS team could
contribute it ), Cloud Run (Google team), Azure Container Instances
(maybe Microsoft will finally also embrace Airflow :) ) .  That would
make the Airflow architecture more "Multiple Cloud Native".

Why do I think Celery Executor should be "gone" (possibly not
immediately but possibly with less priority) ?

Problem with Celery is that even with KEDA autoscaling Celery Executor
has big problems with scaling-in (also had discussions about it
recently - with the AWS team among others). Celery is complex and we
are using maybe 5% of it's capabilities (however I had a recent
discussion (at PyWaw where I gave talk about Airflow dependencies)
with people who are heavily using Celery with their product and
utilise a lot more of those capabilities and they are rather unhappy
with the problems they have to deal with and stability of more complex
features of Celery.

I'd love to hear what others think on the subject? It would be great
to have some common "direction" we are heading in agreed and "vision"
of Airflow in the future when it comes to Executors, and I have a
feeling that we are just about a pivotal point where we can all
consciously change our paradigm of thinking about Airflow executors
and prioritising things differently.

J.