You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ping Zhang <pi...@umich.edu> on 2021/12/17 06:47:26 UTC

[DISCUSS] Thoughts on active-active scheduler mode in production cluster

Hi Airflow community,

I would like to share some of my thoughts on the active-active scheduler HA
mode.

I am wondering whether the active-active scheduler mode is really needed to
improve the scheduler performance.

One scheduler host can easily support ~5000 dags in our production with
only max scheduling delay of ~60 seconds (for the largest dag ~23K tasks)
after our Next-Gen Scheduler work.

I don't see a need to set up the active-active scheduler for the
performance reason.

[image: image.png]
Setting up the active-active scheduler mode can only increase the
complexity of cluster operations. There are also restrictions on DB,
including DB types and DB versions.

I do agree that the airflow scheduler needs better HA. We could use the
active-passive mode.This can greatly simplify the scheduler code, without
needing the lock in the code and dealing with potential deadlock.

We noticed that the majority of our prod incidents come from the database.
With the current active-active HA mode, it might exacerbate the problem.

Would love to hear your thoughts about this.


Best wishes

Ping Zhang

Re: [DISCUSS] Thoughts on active-active scheduler mode in production cluster

Posted by Ping Zhang <pi...@umich.edu>.
Hi Jarek,

Really appreciate the thorough information about it. I will deep dive those
references.

Thanks

Ping


On Sat, Dec 18, 2021 at 12:50 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> I believe scheduler's active/active horizontal scalability was one of
> the last "single point of failure" we addressed for scalability. For
> many years, scheduler was the only one that was not possible to scale.
> We also had a number of reports from other customers that it became a
> bottleneck for them. There were at least two talks about it at the
> first Airflow Summit about it where our users make workarounds for
> their "scheduler scalability" problems.I also personally think (and
> I've seen it for a long time) - that if your system's scalability
> depends on a single processor's/DB connection, this will hit you
> sooner or later. So having a scalable solution where you can scale.
>
> However I think before you make any assumptions from your "current
> use", it would be great if you look at the past discussions and
> resources, and see both the context of the change and our quest of
> making Airflow something different than it was before - serving more
> cases that it did before and becoming a much more versatile scheduler
> that can handle a lot more than what you could do with 1.10 (which
> your experience is mostly about).
>
> There was a very extensive discussion and testing as part of the
> AIP-15 when we discussed this (I think it started two years ago) and
> results of the discussion and analysis are captured here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103092651
> .
> I'd say your observation and case is specific to what you see is very
> specific to your case and the version of Airflow you use, but in a
> number of other cases this problem started to show up. Different users
> have different structures of DAGs/sizes where a single scheduler
> starts to show its limits. And to be honest - your case is by far not
> the "biggest" one that we saw. And most importantly - not the biggest
> we want to handle. Our "forward looking" is what really brought us as
> a community to addressing this in the first place.
>
> To be perfectly honest - staying with what Airflow could do a year or
> two ago is not exciting at all. Airflow 2 is all about the future, as
> much as it embraces the past. We are aiming for a MUCH BIGGER scale
> that you can do with the single scheduler than even what you
> explained. Future of Airflow goes FAR beyond the current use cases.
> Limiting Airflow to what it could do a year ago is not our goal at
> all. We really want to make Airflow a much more generic scheduler that
> handles way more cases - thousands of scheduled tasks per second -
> possibly even handling streaming flows in the future and being able to
> react to changes in fractions of seconds. For that - scalability is a
> must and Ash and the Astronomer team did some very extensive testing
> around the scalability approach we've chosen. And we did an extensive
> review of the concept but then the code and we performed a very
> detailed walk through over the code, where most active committers took
> a very, very deep look into how it was done. And we had a lot of
> comments, fixes and improvements (and also a number of fixes afterward
> to make it robust, scalable and future-looking). Finally I also
> encourage you to take a look at the fantastic talk that Ash gave at
> the Airflow summit describing the decisions behind the new scheduler
> architecture: https://www.youtube.com/watch?v=DYC4-xElccE. That can
> give you more context of what and why was implemented there.
>
> You can read more about it in this article:
> https://www.astronomer.io/blog/airflow-2-scheduler - including a short
> write-up on what are the use cases that might benefit from the
> scalability of scheduler
>
> So in short - yes, we think (I believe in the name of all the
> community members that discussed, agreed to and took part in the
> Airflow 2 effort) that active-active scheduler is a must - if not for
> current scale and cases (where we think it is already useful) - then
> for all the future cases that we want Airflow to excel at.
>
> I think soon you will start many more cases coming from those investments.
>
> J.
>
> On Fri, Dec 17, 2021 at 7:47 AM Ping Zhang <pi...@umich.edu> wrote:
> >
> > Hi Airflow community,
> >
> > I would like to share some of my thoughts on the active-active scheduler
> HA mode.
> >
> > I am wondering whether the active-active scheduler mode is really needed
> to improve the scheduler performance.
> >
> > One scheduler host can easily support ~5000 dags in our production with
> only max scheduling delay of ~60 seconds (for the largest dag ~23K tasks)
> after our Next-Gen Scheduler work.
> >
> > I don't see a need to set up the active-active scheduler for the
> performance reason.
> >
> >
> > Setting up the active-active scheduler mode can only increase the
> complexity of cluster operations. There are also restrictions on DB,
> including DB types and DB versions.
> >
> > I do agree that the airflow scheduler needs better HA. We could use the
> active-passive mode.This can greatly simplify the scheduler code, without
> needing the lock in the code and dealing with potential deadlock.
> >
> > We noticed that the majority of our prod incidents come from the
> database. With the current active-active HA mode, it might exacerbate the
> problem.
> >
> > Would love to hear your thoughts about this.
> >
> >
> > Best wishes
> >
> > Ping Zhang
>

Re: [DISCUSS] Thoughts on active-active scheduler mode in production cluster

Posted by Jarek Potiuk <ja...@potiuk.com>.
I believe scheduler's active/active horizontal scalability was one of
the last "single point of failure" we addressed for scalability. For
many years, scheduler was the only one that was not possible to scale.
We also had a number of reports from other customers that it became a
bottleneck for them. There were at least two talks about it at the
first Airflow Summit about it where our users make workarounds for
their "scheduler scalability" problems.I also personally think (and
I've seen it for a long time) - that if your system's scalability
depends on a single processor's/DB connection, this will hit you
sooner or later. So having a scalable solution where you can scale.

However I think before you make any assumptions from your "current
use", it would be great if you look at the past discussions and
resources, and see both the context of the change and our quest of
making Airflow something different than it was before - serving more
cases that it did before and becoming a much more versatile scheduler
that can handle a lot more than what you could do with 1.10 (which
your experience is mostly about).

There was a very extensive discussion and testing as part of the
AIP-15 when we discussed this (I think it started two years ago) and
results of the discussion and analysis are captured here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103092651.
I'd say your observation and case is specific to what you see is very
specific to your case and the version of Airflow you use, but in a
number of other cases this problem started to show up. Different users
have different structures of DAGs/sizes where a single scheduler
starts to show its limits. And to be honest - your case is by far not
the "biggest" one that we saw. And most importantly - not the biggest
we want to handle. Our "forward looking" is what really brought us as
a community to addressing this in the first place.

To be perfectly honest - staying with what Airflow could do a year or
two ago is not exciting at all. Airflow 2 is all about the future, as
much as it embraces the past. We are aiming for a MUCH BIGGER scale
that you can do with the single scheduler than even what you
explained. Future of Airflow goes FAR beyond the current use cases.
Limiting Airflow to what it could do a year ago is not our goal at
all. We really want to make Airflow a much more generic scheduler that
handles way more cases - thousands of scheduled tasks per second -
possibly even handling streaming flows in the future and being able to
react to changes in fractions of seconds. For that - scalability is a
must and Ash and the Astronomer team did some very extensive testing
around the scalability approach we've chosen. And we did an extensive
review of the concept but then the code and we performed a very
detailed walk through over the code, where most active committers took
a very, very deep look into how it was done. And we had a lot of
comments, fixes and improvements (and also a number of fixes afterward
to make it robust, scalable and future-looking). Finally I also
encourage you to take a look at the fantastic talk that Ash gave at
the Airflow summit describing the decisions behind the new scheduler
architecture: https://www.youtube.com/watch?v=DYC4-xElccE. That can
give you more context of what and why was implemented there.

You can read more about it in this article:
https://www.astronomer.io/blog/airflow-2-scheduler - including a short
write-up on what are the use cases that might benefit from the
scalability of scheduler

So in short - yes, we think (I believe in the name of all the
community members that discussed, agreed to and took part in the
Airflow 2 effort) that active-active scheduler is a must - if not for
current scale and cases (where we think it is already useful) - then
for all the future cases that we want Airflow to excel at.

I think soon you will start many more cases coming from those investments.

J.

On Fri, Dec 17, 2021 at 7:47 AM Ping Zhang <pi...@umich.edu> wrote:
>
> Hi Airflow community,
>
> I would like to share some of my thoughts on the active-active scheduler HA mode.
>
> I am wondering whether the active-active scheduler mode is really needed to improve the scheduler performance.
>
> One scheduler host can easily support ~5000 dags in our production with only max scheduling delay of ~60 seconds (for the largest dag ~23K tasks) after our Next-Gen Scheduler work.
>
> I don't see a need to set up the active-active scheduler for the performance reason.
>
>
> Setting up the active-active scheduler mode can only increase the complexity of cluster operations. There are also restrictions on DB, including DB types and DB versions.
>
> I do agree that the airflow scheduler needs better HA. We could use the active-passive mode.This can greatly simplify the scheduler code, without needing the lock in the code and dealing with potential deadlock.
>
> We noticed that the majority of our prod incidents come from the database. With the current active-active HA mode, it might exacerbate the problem.
>
> Would love to hear your thoughts about this.
>
>
> Best wishes
>
> Ping Zhang