You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by "Vandon, Raphael" <va...@amazon.com.INVALID> on 2023/03/23 22:24:40 UTC

[DISCUSS] a cache for Airflow Variables

Hello,
I’d like to submit to discussion the idea of having a cache on Airflow Variables.
The aim is to reduce DAG parsing time and Secret Manager API bill for users who make a liberal use of Variable.get in their DAG files.
The drawback, of course, is that caches introduce a delay in how fast changes are propagated.

For improved readability, I kindly invite you to read and comment in the github discussion : https://github.com/apache/airflow/discussions/30265

Cheers


Re: [DISCUSS] a cache for Airflow Variables

Posted by Jarek Potiuk <ja...@potiuk.com>.
Let's. I dismissed my review and added a comment on how I propose to see
it there.

On Wed, Apr 5, 2023 at 2:11 AM Vandon, Raphael <va...@amazon.com.invalid>
wrote:

> I like the proposal of having a way to invalidate the cache manually when
> changes need to be propagated quickly.
> I think this is something we can ship without in the first iteration,
> provided we make the cache disabled by default. Then we could add it as an
> option in the Airflow CLI, and eventually in the UI, but this should be
> done in incremental commits in my opinion to keep the change size
> manageable.
>
> There is an open PR here https://github.com/apache/airflow/pull/30259
> I guess we can continue the conversation there ?
>
>

Re: [DISCUSS] a cache for Airflow Variables

Posted by "Vandon, Raphael" <va...@amazon.com.INVALID>.
I like the proposal of having a way to invalidate the cache manually when changes need to be propagated quickly.
I think this is something we can ship without in the first iteration, provided we make the cache disabled by default. Then we could add it as an option in the Airflow CLI, and eventually in the UI, but this should be done in incremental commits in my opinion to keep the change size manageable.

There is an open PR here https://github.com/apache/airflow/pull/30259
I guess we can continue the conversation there ?


RE: [DISCUSS] a cache for Airflow Variables

Posted by "Scheffler Jens (XC-DX/ETV5)" <Je...@de.bosch.com.INVALID>.
Hi Jarek,

I like this two step approach. So I think this is good and not breaking. Reasons

1) Currently the regular re-parsing eats up (in total) a lot of un-needed time, so any kind of (easy) chachin if nothing changed would be good. This constant re-parsing not only eats up a lot of (cumulated) CPU but also puts very high pressure on having fast DAG parsing, because we can not afford parsing for longer to kill the scheduler cycle. DagFileProcessor separates this from the scheduling loop.

2) Our use case: We want to share the DAG code across the instances but want to have certain features turned on/off parameterized. For example in the test instance we done want to publish results but still want to run the sage DAG for testing. Or we want to have the option to tune some batch parameters w/o need to re-deploy code. The only change that I know are Airflow Variables (which I think are great to be able to control general parameters w/o re-deployment during runtime and can be auto-provisioned) or ENV variables (whereas ENV needs to gt into the deployment and are static, can not be changed during runtime.

Otherwise I really like the (below) discussion about advanced caching strategies such that there is (an option) from code to tell when a re-parse actually needs to happen. This would allow more options on when a re-parse needs to happen and external factors like excessive imports or lookups are not putting the overall parsing and scheduling inn danger. 
As well we don't need to put as many rules into the head of people starting with Airflow (whom we currently always need to tell what is wrong when they contribute ther first DAGs with top level code) - which dis-satisfies an entry user potentially.

Mit freundlichen Grüßen / Best regards

Jens Scheffler

Deterministik open Loop (XC-DX/ETV5)
Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | GERMANY | www.bosch.com
Tel. +49 711 811-91508 | Mobil +49 160 90417410 | Threema / Threema Work: KKTVR3F4 | Jens.Scheffler@de.bosch.com

Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; Geschäftsführung: Dr. Stefan Hartung, 
Dr. Christian Fischer, Filiz Albrecht, Dr. Markus Forschner, Dr. Markus Heyn, Dr. Tanja Rückert

-----Original Message-----
From: Jarek Potiuk <ja...@potiuk.com> 
Sent: Donnerstag, 30. März 2023 21:16
To: dev@airflow.apache.org
Subject: Re: [DISCUSS] a cache for Airflow Variables

Hello here.

I would like to propose something.

1) I really like the idea of conditionally disabling parsing - and I would really love that this idea is hashed out and discussed more (especially if those who came with it have a bit more time than me, I seem to be involved in more things recently :D. I would really like this to happen, but I am afraid it might take more time to discuss some of the consequences, also it would necessarily require our users to adopt and maintain some kind of exclusion mechanism (and we know it will take a lot of time for adoption).

2) On the other hand - we have a very small, very localized and already tested change that can help a number of users. Change that only involves running shared cache memory in DagFileProcessor, that does not require any changes from the users (mayve except configuration parameter change to enable it). It does not really change operational complexity nor bears any risks (especially if it will be disabled by default).

Or this is at least how I see the options we have. What I would hate most is that neither of the two happens because nobody will want to spend their time discussing approving and implementing 1) whereas 2) will be blocked because 1) is a better (though only ideated) solution.

My proposal will be to get that in for dag file processor only, disabled by default - with some extra documentation explaining the consequences.

Does it bear any risks I am not aware of for us if we do so? Is it too much to ask ?

J.


On Mon, Mar 27, 2023 at 8:26 PM Vandon, Raphael <va...@amazon.com.invalid> wrote:
>
> My initial goal when working on this cache was mostly to shorten DAG parsing times, simply because that's what I was looking at, so I'd be happy with restricting this cache to dag parsing.
> I'm still relatively new to airflow codebase, so I don't know all the implications this change has, so I'm grateful for the comments here.
>
> The benefits can be quite noticeable, depending a lot on the context. If the dag file is simple, then a network call is going to be slow in comparison.
> And if the parsing interval is short compared to the dag execution schedule, then the number of calls to get secrets is going to be dominated by the dag parsing rather than the executions.
>
> The scenario where this brings the most benefits is many simple dag files, all querying the same key from the Variables, parsed regularly, and ran less often.
>
> @Hussein says that "the user can implement [their] own secret backend", but it's not an easy task. They'd have to implement it as a wrapper around the custom backend they want to use, since there can only be one custom secret backend. And implementing an in-memory cache that works cross-process just as a custom backend is straight up impossible.
>
> About the secure caching, since I'm only caching in-memory, I didn't do anything to that regard, but we already have something in place to encrypt secrets when they are saved in the metastore using cryptography.fernet.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org

Re: [DISCUSS] a cache for Airflow Variables

Posted by Jarek Potiuk <ja...@potiuk.com>.
Hello here.

I would like to propose something.

1) I really like the idea of conditionally disabling parsing - and I
would really love that this idea is hashed out and discussed more
(especially if those who came with it have a bit more time than me, I
seem to be involved in more things recently :D. I would really like
this to happen, but I am afraid it might take more time to discuss
some of the consequences, also it would necessarily require our users
to adopt and maintain some kind of exclusion mechanism (and we know it
will take a lot of time for adoption).

2) On the other hand - we have a very small, very localized and
already tested change that can help a number of users. Change that
only involves running shared cache memory in DagFileProcessor, that
does not require any changes from the users (mayve except
configuration parameter change to enable it). It does not really
change operational complexity nor bears any risks (especially if it
will be disabled by default).

Or this is at least how I see the options we have. What I would hate
most is that neither of the two happens because nobody will want to
spend their time discussing approving and implementing 1) whereas 2)
will be blocked because 1) is a better (though only ideated) solution.

My proposal will be to get that in for dag file processor only,
disabled by default - with some extra documentation explaining the
consequences.

Does it bear any risks I am not aware of for us if we do so? Is it too
much to ask ?

J.


On Mon, Mar 27, 2023 at 8:26 PM Vandon, Raphael
<va...@amazon.com.invalid> wrote:
>
> My initial goal when working on this cache was mostly to shorten DAG parsing times, simply because that's what I was looking at, so I'd be happy with restricting this cache to dag parsing.
> I'm still relatively new to airflow codebase, so I don't know all the implications this change has, so I'm grateful for the comments here.
>
> The benefits can be quite noticeable, depending a lot on the context. If the dag file is simple, then a network call is going to be slow in comparison.
> And if the parsing interval is short compared to the dag execution schedule, then the number of calls to get secrets is going to be dominated by the dag parsing rather than the executions.
>
> The scenario where this brings the most benefits is many simple dag files, all querying the same key from the Variables, parsed regularly, and ran less often.
>
> @Hussein says that "the user can implement [their] own secret backend", but it's not an easy task. They'd have to implement it as a wrapper around the custom backend they want to use, since there can only be one custom secret backend. And implementing an in-memory cache that works cross-process just as a custom backend is straight up impossible.
>
> About the secure caching, since I'm only caching in-memory, I didn't do anything to that regard, but we already have something in place to encrypt secrets when they are saved in the metastore using cryptography.fernet.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: [DISCUSS] a cache for Airflow Variables

Posted by "Vandon, Raphael" <va...@amazon.com.INVALID>.
My initial goal when working on this cache was mostly to shorten DAG parsing times, simply because that's what I was looking at, so I'd be happy with restricting this cache to dag parsing.
I'm still relatively new to airflow codebase, so I don't know all the implications this change has, so I'm grateful for the comments here.

The benefits can be quite noticeable, depending a lot on the context. If the dag file is simple, then a network call is going to be slow in comparison.
And if the parsing interval is short compared to the dag execution schedule, then the number of calls to get secrets is going to be dominated by the dag parsing rather than the executions.

The scenario where this brings the most benefits is many simple dag files, all querying the same key from the Variables, parsed regularly, and ran less often.

@Hussein says that "the user can implement [their] own secret backend", but it's not an easy task. They'd have to implement it as a wrapper around the custom backend they want to use, since there can only be one custom secret backend. And implementing an in-memory cache that works cross-process just as a custom backend is straight up impossible.

About the secure caching, since I'm only caching in-memory, I didn't do anything to that regard, but we already have something in place to encrypt secrets when they are saved in the metastore using cryptography.fernet.


Re: [DISCUSS] a cache for Airflow Variables

Posted by Jarek Potiuk <ja...@potiuk.com>.
Two comments here as well (just technical rather than whether we
should do it or not) - to clarify maybe what the proposal really is.

Hussein:

Yeah I had exactly the same thoughts and that was even my original
response in the PR
https://github.com/apache/airflow/pull/30259#issuecomment-1481610945.
Using redis/memcache is possible, it can help a lot with decreasing
pressure on the DB/Secret especially in the case of multi-node setup.
Perfect alignment here.

However, I think this idea is aiming for something different and solve
a different proble, It's not only decreasing the pressure on an
external DB/Secrets but also (and more importantly) it does it all
in-memory of the instance which we are using to handle all
multi-processing cases (and bases it on the fact that thanks to Ash's
fork implementation few years ago we handle a lot of processing in DFP
with multiprocessing/forking already. In a way I see it as a further
optimisation and building on top of that change that allowed us to use
the memory and CPU much more effectively there thanks to replacing
launching separate interpreter for each processing with forking (which
saves a lot of CPU for initialisation and has the nice benefit of
reusing a lot of memory that is shared following copy-on-write model).
I see this one as a similar optimization (yes, to optimise patterns
that we currently classify as "bad", but we could reclassify them as
"ok")..

It just makes it much more effective to handle the case where the same
variables are used over-and-over on the same machine which handles
parsing (or execution - I will come back to it in a moment). It does
not only save on DB/Secret, but on the network communication at-all in
those cases. In a way it is as if you compare L1/ L2 caches in the
processor with L3 cache on the motherboard. L1/L2 (the proposal from
Raphael) is super-fast and small, whereas L3 (memcache/redis) is much
slower (orders of magnitude difference) - and the end result is that
there is a lot less communication involved. Also, the idea of the
proposal here is that you do not have to complicate the deployment by
adding another component (or assigning additional functions to
existing components). The "in-memory" solution will work for everyone,
without making a single change to their deployment. No matter if they
already use redis/memcache in their deployment (for example when they
use K8S executor, they likely have no redis as they won't need it).
The redis/memcache is a more complete solution for sure as it solves a
different problem (for the multi-node case), but I think it's a bit
like comparing apples to pears.

Daniel:

Just to be precise and clarify - it **might** work for Celery, but it
is a bit more risky because we are not fully aware how Celery uses
multiprocessing/billiard under-the-hood. It will likely work as long
as we do this:

if celery:
    import billard as mp
else:
    import multiprocessing as mp

And then use mp everywhere.

This is a solution that seems to solve all the problems with
multiprocessing/celery which are known. See for example those issues
in our repo: https://github.com/apache/airflow/issues?q=is%3Aissue+billiard+is%3Aclosed+.
Using multiprocessing and celery is simply not a really good idea
because there is a reason why they forked it and created billiard
instead (and optimized the in-memory usage there). Billiard is
something we already use in celery (it is used under-the hood and some
modification there are implemented to allow async processing of celery
which makes the two libraries incompatible:
https://github.com/celery/celery/issues/5362#issuecomment-466862879).
So we **could** (providing sufficient testing) enable it also for
celery. However, it is quite a bit more risky, because it could clash
with the way celery uses billiard under the hood. ( I am guessing
there - mostly out of caution, reading many cases in the past where
people had problems with it.

J

On Sun, Mar 26, 2023 at 6:53 AM Daniel Standish
<da...@astronomer.io.invalid> wrote:
>
> Surprised to hear that it doesn't work with celery. Is that right?  I
> assumed that this was the main target.
>
> If it's really only a benefit in dag processor, it's surprising that it
> provides much benefit because it should be one call per var-file-parse; in
> worker it will be once per ti and I assumed this would be where the heavy
> calls come from. Maybe I miss something.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: [DISCUSS] a cache for Airflow Variables

Posted by Daniel Standish <da...@astronomer.io.INVALID>.
Surprised to hear that it doesn't work with celery. Is that right?  I
assumed that this was the main target.

If it's really only a benefit in dag processor, it's surprising that it
provides much benefit because it should be one call per var-file-parse; in
worker it will be once per ti and I assumed this would be where the heavy
calls come from. Maybe I miss something.

Re: [DISCUSS] a cache for Airflow Variables

Posted by Hussein Awala <hu...@awala.fr>.
Since the user can implement his own secret backend, I don't see the need to add a cache layer and maintain it in Airflow regardless of its type either local (e.g. RocksDB) or shared (Redis), especially since we provide hooks to communicate with the most of the DB types which makes it easier for the user. Also caching the secrets in a secure way is not an easy task.

But I like Ash's idea of providing a way to tell Airflow not to continuously rescan a file, by adding a new Airflow configuration to tell the DagFileProcessor how to process the files:
```
dag_file_processing_conf = {
    "<file name pattern>": {
        "process_on_change": true
    },
    "some_domain/*_static.py": {
        "process_on_change_with_ttl": 3600 // process the files when there is a change or after an hour if there is not
    },
    "some_domain/*_dynamic.py": {
        "process_each": 600 // similar to min_file_process_interval but it can have a different value
    }
}
```
Then the DagFileProcessor compares the filename and its path with the provided patterns to know if there is a custom configuration or if it should use the default setting.

Hussein

________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Friday, March 24, 2023 11:59:44 AM
To: dev@airflow.apache.org
Subject: Re: [DISCUSS] a cache for Airflow Variables

Two points:

> I like Daniel's idea: implement this _in_ the provider itself (means you can't use in memory cache, but can use the DB or redis etc) then it doesn't even need user to upgrade to 2.6.
> Also it means that a problem with a single provider doesn't make all of core significantly more complex.

I think this is not a good idea. It's not an AWS secret provider
problem. It's the same story for all secret backends (including custom
ones). People complained a lot about Hashicorp Vault  being heavily
pressured by Airflow for example. The change would help there - it's a
general optimisation for our secret backend architecture in this
context, and not a single-provider issue (at all).

> And it doesn't tie us in to the existing multi-processing model. But to connect this discussion to other discussions going on right now: Why is parsing even a special process: I can easily see a change where "parse this file please" becomes a Workload, and all workloads run on the workers/runners.

I think this is a separate discussion and this change **could** be
implemented now (2.6), where the discussion about workload and how we
can implement future logic is pretty independent. Maybe 2.7, maybe
2.8, who knows.

I think this is really the case of a tactical solution improving our
user experience with very limited effort (which IMHO does not "heavily
complicate" core BTW. - it's very small and localized and even if we
change multiprocessing model we could easily adapt it) vs. strategic
long-term improvement in the architecture.

As I said - I am cool with either leaving it as is or implementing
this one in the current DFP (with preference for the latter). And I
think we should consider that seriously without dismissing it base on
the grounds of "we have bigger changes that might or might not come",
because we have a chance to improve the experience of our users now.

J,

On Fri, Mar 24, 2023 at 11:28 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> I like Daniel's idea: implement this _in_ the provider itself (means you can't use in memory cache, but can use the DB or redis etc) then it doesn't even need user to upgrade to 2.6.
>
> Also it means that a problem with a single provider doesn't make all of core significantly more complex.
> And it doesn't tie us in to the existing multi-processing model. But to connect this discussion to other discussions going on right now: Why is parsing even a special process: I can easily see a change where "parse this file please" becomes a Workload, and all workloads run on the workers/runners.
> -ash
> On Mar 24 2023, at 9:48 am, Elad Kalif <el...@apache.org> wrote:
> > Jarek I can argue that by doing that the problem is not gone but just
> > reduced to a level where it's even harder to spot and adresses by the
> > users. This is the worst kind of problems to tackle!
> >
> > In my prespective if we don't address the issue directly (general solution
> > to top level code issues) any attempt to minimize it must come with a clear
> > way of letting users know what is going on.
> >
> > בתאריך יום ו׳, 24 במרץ 2023, 12:37, מאת Jarek Potiuk ‏<ja...@potiuk.com>:
> > > TL;DR; I am in favour of the change (limiting it to DagFileProcessor
> > > only) and turning using Variables at top-level as good practice
> > > (2.6+).
> > >
> > > Here is why.
> > >
> > > I partially agree with Elad /Ash, but I am not 0/1 on that subject any
> > > more. I think there is a more nuanced approach that we could take
> > > here. I had a very similar view yesterday but I slept on it and
> > > changed my mind - to an extent.
> > >
> > > Putting aside (for a moment, I will come back to it further on) the
> > > fact that this one is addressing "bad practices" of users. I want to
> > > focus on technicalities first and risks.
> > >
> > > I initially contested the idea seeing the PR and raised my doubts
> > > (actually the discussion here is because I asked Raphael to bring it
> > > here): https://github.com/apache/airflow/pull/30259#issuecomment-148161094
> > > .
> > > The reasons I raised were based on earlier discussions and attempts to
> > > do similar things (including by myself) when I hit some problems
> > > related to multiprocessing and especially Celery. After looking at the
> > > code and hearing from Raphael I think the solution he came up with
> > > **might** work in a stable and good way for DagFileProcessor, where it
> > > is risky to implement for Celery (because it's billiard <>
> > > multiprocessing weird relation - Celery has it's own multiprocessing
> > > drop-in replacement - billiard that at times clashes with
> > > multiprocessing). The risk that this caching approach will cause
> > > problems when enabled for Celery is high and while I see how it can
> > > work for our DagFileProcessor, it will be a strong "no" for Celery.
> > > And As Daniel rightfully noticed, in K8S executor it is ineffective at
> > > all.
> > >
> > > However, I think (again putting aside the bad practices angle) - I
> > > **think** if we only do that for DagFileProcessor (I think caching in
> > > workers is generally out of question), from the technical point of
> > > view it has only benefits.
> > >
> > > * the amount of traffic it generates is WAY smaller in general - I
> > > think there is a nice angle to it that in this way it can decrease the
> > > amount of traffic generated even across multiple DAGs - imagine 100
> > > DAGs using the same variable - with single Scheduler/DagFileProcessor
> > > that might be a really nice resource saving (not only Secrets but just
> > > DB traffic and in some cases even number of connections to establish
> > > to the DB.
> > > * this will also - interestingly - help in case of AIP-44 and db-less
> > > dag file processor (in terms of lower number of requests to the
> > > internal-API
> > > * I think in case of the DagFileProcessor, my initial worry about ttl
> > > (and sometimes seeing outdated version of the variable) is far, far
> > > less than in case of workers. Unlike for tasks /workers, DAGs are
> > > parsed in essentially random intervals and there is never a
> > > "dependency" between DAGs so whether the variable is fresh or stale a
> > > bit - does not matter (because the same DAG can be parsed now or 5
> > > minutes ago and we have essentially no control on it and we cannot
> > > rely on it)(and.
> > >
> > > So - technically speaking, to summarize, as long as we test it
> > > thoroughly - also at scale, I see no problem with getting it in - as
> > > long as we only limit it to DagFileProcessor.
> > >
> > > Now - coming back to the reason "should we make life easier for our
> > > users who violate best practices" (essentially helping them to violate
> > > those).
> > >
> > > I think in general we should not make life easier for users who
> > > violate best practices. And I agree with Elad's points. And as a side
> > > comment I love Ash's idea about not re-parsing - this should be a
> > > discussion started separately and I would be 100% supportive for that
> > > one. It's the first time I hear it from Ash :) and I love it - there
> > > are other things to consider there - like when to trigger reparsing
> > > etc/
> > >
> > > However, I think with this change we have the opportunity of doing
> > > something different - i.e. change our best practices.
> > >
> > > Why not changing our best practices to (essentially):
> > >
> > > "Do not use long running things like API calls. You can use Airflow
> > > Variables though - they are implemented in the way that workaround the
> > > limitations, at the expense of propagation time it takes when you
> > > change the variable" (or something like that).
> > >
> > > Philosophical rant here (sorry for those who do not like such analogies):
> > >
> > > The nice thing is that for many of our users (who currently violate
> > > them) they will suddenly with **just** upgrading Airflow fall from
> > > "violating best practices" camp to "following best practices" camp. I
> > > think that could be a major win for everyone. Yes, we can preach the
> > > best practices but seeing how often they are violated (for good
> > > reason, alternatives are not as "easy" to use), another good approach
> > > is to embrace the behaviours and let them become good practices. That
> > > worked for major religions for example (we are close to it - so think
> > > Easter being absorbed from Pagan spring equinox celebrations
> > > originally). It's a very effective strategy to bring new followers and
> > > strengthen binding with those who struggle with their practices being
> > > touted as "bad". And this all without making a single change to their
> > > DAG files/structure (this would be necessary if we follow the
> > > "no-reparsing" idea by Ash).
> > >
> > > It would also make migration to 2.6 way more attractive for many users
> > > (if it becomes part of 2.6). Think our message to our users "yes you
> > > pay a lot for Secret access, but by migrating to 2.6 your bills will
> > > drop-down dramatically" with just upgrading and no changes to your
> > > DAGs. In the current economic climate, that alone might convince some
> > > decision makers to invest in the migration to 2.6.
> > >
> > > This is not a super-strong opinion, I would also be ok if we leave
> > > things as they are, but I think that there are no technical reasons
> > > why this would not work. Those are more "communication" issues - how
> > > we communicate to the users, how to be VERY CLEAR that "using
> > > variables at top level" is only good for 2.6+ and that it is a very
> > > bad idea for pre-2.6.
> > >
> > > I think (despite many years of arguing and advocating - including my
> > > own statements/words/rants - differently) we should keep an open mind
> > > to that option.
> > >
> > > Thoughts?
> > >
> > > J
> > >
> > > On Fri, Mar 24, 2023 at 2:42 AM Daniel Standish
> > > <da...@astronomer.io.invalid> wrote:
> > > >
> > > > It would not help with kubernetes executor. Did you try with local
> > > > executor?
> > > >
> > > > Another option to consider is to implement this specifically on the AWS
> > > > secrets manager secrets backend itself.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > > For additional commands, e-mail: dev-help@airflow.apache.org
> > >
> > >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: [DISCUSS] a cache for Airflow Variables

Posted by Jarek Potiuk <ja...@potiuk.com>.
Two points:

> I like Daniel's idea: implement this _in_ the provider itself (means you can't use in memory cache, but can use the DB or redis etc) then it doesn't even need user to upgrade to 2.6.
> Also it means that a problem with a single provider doesn't make all of core significantly more complex.

I think this is not a good idea. It's not an AWS secret provider
problem. It's the same story for all secret backends (including custom
ones). People complained a lot about Hashicorp Vault  being heavily
pressured by Airflow for example. The change would help there - it's a
general optimisation for our secret backend architecture in this
context, and not a single-provider issue (at all).

> And it doesn't tie us in to the existing multi-processing model. But to connect this discussion to other discussions going on right now: Why is parsing even a special process: I can easily see a change where "parse this file please" becomes a Workload, and all workloads run on the workers/runners.

I think this is a separate discussion and this change **could** be
implemented now (2.6), where the discussion about workload and how we
can implement future logic is pretty independent. Maybe 2.7, maybe
2.8, who knows.

I think this is really the case of a tactical solution improving our
user experience with very limited effort (which IMHO does not "heavily
complicate" core BTW. - it's very small and localized and even if we
change multiprocessing model we could easily adapt it) vs. strategic
long-term improvement in the architecture.

As I said - I am cool with either leaving it as is or implementing
this one in the current DFP (with preference for the latter). And I
think we should consider that seriously without dismissing it base on
the grounds of "we have bigger changes that might or might not come",
because we have a chance to improve the experience of our users now.

J,

On Fri, Mar 24, 2023 at 11:28 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> I like Daniel's idea: implement this _in_ the provider itself (means you can't use in memory cache, but can use the DB or redis etc) then it doesn't even need user to upgrade to 2.6.
>
> Also it means that a problem with a single provider doesn't make all of core significantly more complex.
> And it doesn't tie us in to the existing multi-processing model. But to connect this discussion to other discussions going on right now: Why is parsing even a special process: I can easily see a change where "parse this file please" becomes a Workload, and all workloads run on the workers/runners.
> -ash
> On Mar 24 2023, at 9:48 am, Elad Kalif <el...@apache.org> wrote:
> > Jarek I can argue that by doing that the problem is not gone but just
> > reduced to a level where it's even harder to spot and adresses by the
> > users. This is the worst kind of problems to tackle!
> >
> > In my prespective if we don't address the issue directly (general solution
> > to top level code issues) any attempt to minimize it must come with a clear
> > way of letting users know what is going on.
> >
> > בתאריך יום ו׳, 24 במרץ 2023, 12:37, מאת Jarek Potiuk ‏<ja...@potiuk.com>:
> > > TL;DR; I am in favour of the change (limiting it to DagFileProcessor
> > > only) and turning using Variables at top-level as good practice
> > > (2.6+).
> > >
> > > Here is why.
> > >
> > > I partially agree with Elad /Ash, but I am not 0/1 on that subject any
> > > more. I think there is a more nuanced approach that we could take
> > > here. I had a very similar view yesterday but I slept on it and
> > > changed my mind - to an extent.
> > >
> > > Putting aside (for a moment, I will come back to it further on) the
> > > fact that this one is addressing "bad practices" of users. I want to
> > > focus on technicalities first and risks.
> > >
> > > I initially contested the idea seeing the PR and raised my doubts
> > > (actually the discussion here is because I asked Raphael to bring it
> > > here): https://github.com/apache/airflow/pull/30259#issuecomment-148161094
> > > .
> > > The reasons I raised were based on earlier discussions and attempts to
> > > do similar things (including by myself) when I hit some problems
> > > related to multiprocessing and especially Celery. After looking at the
> > > code and hearing from Raphael I think the solution he came up with
> > > **might** work in a stable and good way for DagFileProcessor, where it
> > > is risky to implement for Celery (because it's billiard <>
> > > multiprocessing weird relation - Celery has it's own multiprocessing
> > > drop-in replacement - billiard that at times clashes with
> > > multiprocessing). The risk that this caching approach will cause
> > > problems when enabled for Celery is high and while I see how it can
> > > work for our DagFileProcessor, it will be a strong "no" for Celery.
> > > And As Daniel rightfully noticed, in K8S executor it is ineffective at
> > > all.
> > >
> > > However, I think (again putting aside the bad practices angle) - I
> > > **think** if we only do that for DagFileProcessor (I think caching in
> > > workers is generally out of question), from the technical point of
> > > view it has only benefits.
> > >
> > > * the amount of traffic it generates is WAY smaller in general - I
> > > think there is a nice angle to it that in this way it can decrease the
> > > amount of traffic generated even across multiple DAGs - imagine 100
> > > DAGs using the same variable - with single Scheduler/DagFileProcessor
> > > that might be a really nice resource saving (not only Secrets but just
> > > DB traffic and in some cases even number of connections to establish
> > > to the DB.
> > > * this will also - interestingly - help in case of AIP-44 and db-less
> > > dag file processor (in terms of lower number of requests to the
> > > internal-API
> > > * I think in case of the DagFileProcessor, my initial worry about ttl
> > > (and sometimes seeing outdated version of the variable) is far, far
> > > less than in case of workers. Unlike for tasks /workers, DAGs are
> > > parsed in essentially random intervals and there is never a
> > > "dependency" between DAGs so whether the variable is fresh or stale a
> > > bit - does not matter (because the same DAG can be parsed now or 5
> > > minutes ago and we have essentially no control on it and we cannot
> > > rely on it)(and.
> > >
> > > So - technically speaking, to summarize, as long as we test it
> > > thoroughly - also at scale, I see no problem with getting it in - as
> > > long as we only limit it to DagFileProcessor.
> > >
> > > Now - coming back to the reason "should we make life easier for our
> > > users who violate best practices" (essentially helping them to violate
> > > those).
> > >
> > > I think in general we should not make life easier for users who
> > > violate best practices. And I agree with Elad's points. And as a side
> > > comment I love Ash's idea about not re-parsing - this should be a
> > > discussion started separately and I would be 100% supportive for that
> > > one. It's the first time I hear it from Ash :) and I love it - there
> > > are other things to consider there - like when to trigger reparsing
> > > etc/
> > >
> > > However, I think with this change we have the opportunity of doing
> > > something different - i.e. change our best practices.
> > >
> > > Why not changing our best practices to (essentially):
> > >
> > > "Do not use long running things like API calls. You can use Airflow
> > > Variables though - they are implemented in the way that workaround the
> > > limitations, at the expense of propagation time it takes when you
> > > change the variable" (or something like that).
> > >
> > > Philosophical rant here (sorry for those who do not like such analogies):
> > >
> > > The nice thing is that for many of our users (who currently violate
> > > them) they will suddenly with **just** upgrading Airflow fall from
> > > "violating best practices" camp to "following best practices" camp. I
> > > think that could be a major win for everyone. Yes, we can preach the
> > > best practices but seeing how often they are violated (for good
> > > reason, alternatives are not as "easy" to use), another good approach
> > > is to embrace the behaviours and let them become good practices. That
> > > worked for major religions for example (we are close to it - so think
> > > Easter being absorbed from Pagan spring equinox celebrations
> > > originally). It's a very effective strategy to bring new followers and
> > > strengthen binding with those who struggle with their practices being
> > > touted as "bad". And this all without making a single change to their
> > > DAG files/structure (this would be necessary if we follow the
> > > "no-reparsing" idea by Ash).
> > >
> > > It would also make migration to 2.6 way more attractive for many users
> > > (if it becomes part of 2.6). Think our message to our users "yes you
> > > pay a lot for Secret access, but by migrating to 2.6 your bills will
> > > drop-down dramatically" with just upgrading and no changes to your
> > > DAGs. In the current economic climate, that alone might convince some
> > > decision makers to invest in the migration to 2.6.
> > >
> > > This is not a super-strong opinion, I would also be ok if we leave
> > > things as they are, but I think that there are no technical reasons
> > > why this would not work. Those are more "communication" issues - how
> > > we communicate to the users, how to be VERY CLEAR that "using
> > > variables at top level" is only good for 2.6+ and that it is a very
> > > bad idea for pre-2.6.
> > >
> > > I think (despite many years of arguing and advocating - including my
> > > own statements/words/rants - differently) we should keep an open mind
> > > to that option.
> > >
> > > Thoughts?
> > >
> > > J
> > >
> > > On Fri, Mar 24, 2023 at 2:42 AM Daniel Standish
> > > <da...@astronomer.io.invalid> wrote:
> > > >
> > > > It would not help with kubernetes executor. Did you try with local
> > > > executor?
> > > >
> > > > Another option to consider is to implement this specifically on the AWS
> > > > secrets manager secrets backend itself.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > > For additional commands, e-mail: dev-help@airflow.apache.org
> > >
> > >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: [DISCUSS] a cache for Airflow Variables

Posted by Ash Berlin-Taylor <as...@apache.org>.
I like Daniel's idea: implement this _in_ the provider itself (means you can't use in memory cache, but can use the DB or redis etc) then it doesn't even need user to upgrade to 2.6.

Also it means that a problem with a single provider doesn't make all of core significantly more complex.
And it doesn't tie us in to the existing multi-processing model. But to connect this discussion to other discussions going on right now: Why is parsing even a special process: I can easily see a change where "parse this file please" becomes a Workload, and all workloads run on the workers/runners.
-ash
On Mar 24 2023, at 9:48 am, Elad Kalif <el...@apache.org> wrote:
> Jarek I can argue that by doing that the problem is not gone but just
> reduced to a level where it's even harder to spot and adresses by the
> users. This is the worst kind of problems to tackle!
>
> In my prespective if we don't address the issue directly (general solution
> to top level code issues) any attempt to minimize it must come with a clear
> way of letting users know what is going on.
>
> בתאריך יום ו׳, 24 במרץ 2023, 12:37, מאת Jarek Potiuk ‏<ja...@potiuk.com>:
> > TL;DR; I am in favour of the change (limiting it to DagFileProcessor
> > only) and turning using Variables at top-level as good practice
> > (2.6+).
> >
> > Here is why.
> >
> > I partially agree with Elad /Ash, but I am not 0/1 on that subject any
> > more. I think there is a more nuanced approach that we could take
> > here. I had a very similar view yesterday but I slept on it and
> > changed my mind - to an extent.
> >
> > Putting aside (for a moment, I will come back to it further on) the
> > fact that this one is addressing "bad practices" of users. I want to
> > focus on technicalities first and risks.
> >
> > I initially contested the idea seeing the PR and raised my doubts
> > (actually the discussion here is because I asked Raphael to bring it
> > here): https://github.com/apache/airflow/pull/30259#issuecomment-148161094
> > .
> > The reasons I raised were based on earlier discussions and attempts to
> > do similar things (including by myself) when I hit some problems
> > related to multiprocessing and especially Celery. After looking at the
> > code and hearing from Raphael I think the solution he came up with
> > **might** work in a stable and good way for DagFileProcessor, where it
> > is risky to implement for Celery (because it's billiard <>
> > multiprocessing weird relation - Celery has it's own multiprocessing
> > drop-in replacement - billiard that at times clashes with
> > multiprocessing). The risk that this caching approach will cause
> > problems when enabled for Celery is high and while I see how it can
> > work for our DagFileProcessor, it will be a strong "no" for Celery.
> > And As Daniel rightfully noticed, in K8S executor it is ineffective at
> > all.
> >
> > However, I think (again putting aside the bad practices angle) - I
> > **think** if we only do that for DagFileProcessor (I think caching in
> > workers is generally out of question), from the technical point of
> > view it has only benefits.
> >
> > * the amount of traffic it generates is WAY smaller in general - I
> > think there is a nice angle to it that in this way it can decrease the
> > amount of traffic generated even across multiple DAGs - imagine 100
> > DAGs using the same variable - with single Scheduler/DagFileProcessor
> > that might be a really nice resource saving (not only Secrets but just
> > DB traffic and in some cases even number of connections to establish
> > to the DB.
> > * this will also - interestingly - help in case of AIP-44 and db-less
> > dag file processor (in terms of lower number of requests to the
> > internal-API
> > * I think in case of the DagFileProcessor, my initial worry about ttl
> > (and sometimes seeing outdated version of the variable) is far, far
> > less than in case of workers. Unlike for tasks /workers, DAGs are
> > parsed in essentially random intervals and there is never a
> > "dependency" between DAGs so whether the variable is fresh or stale a
> > bit - does not matter (because the same DAG can be parsed now or 5
> > minutes ago and we have essentially no control on it and we cannot
> > rely on it)(and.
> >
> > So - technically speaking, to summarize, as long as we test it
> > thoroughly - also at scale, I see no problem with getting it in - as
> > long as we only limit it to DagFileProcessor.
> >
> > Now - coming back to the reason "should we make life easier for our
> > users who violate best practices" (essentially helping them to violate
> > those).
> >
> > I think in general we should not make life easier for users who
> > violate best practices. And I agree with Elad's points. And as a side
> > comment I love Ash's idea about not re-parsing - this should be a
> > discussion started separately and I would be 100% supportive for that
> > one. It's the first time I hear it from Ash :) and I love it - there
> > are other things to consider there - like when to trigger reparsing
> > etc/
> >
> > However, I think with this change we have the opportunity of doing
> > something different - i.e. change our best practices.
> >
> > Why not changing our best practices to (essentially):
> >
> > "Do not use long running things like API calls. You can use Airflow
> > Variables though - they are implemented in the way that workaround the
> > limitations, at the expense of propagation time it takes when you
> > change the variable" (or something like that).
> >
> > Philosophical rant here (sorry for those who do not like such analogies):
> >
> > The nice thing is that for many of our users (who currently violate
> > them) they will suddenly with **just** upgrading Airflow fall from
> > "violating best practices" camp to "following best practices" camp. I
> > think that could be a major win for everyone. Yes, we can preach the
> > best practices but seeing how often they are violated (for good
> > reason, alternatives are not as "easy" to use), another good approach
> > is to embrace the behaviours and let them become good practices. That
> > worked for major religions for example (we are close to it - so think
> > Easter being absorbed from Pagan spring equinox celebrations
> > originally). It's a very effective strategy to bring new followers and
> > strengthen binding with those who struggle with their practices being
> > touted as "bad". And this all without making a single change to their
> > DAG files/structure (this would be necessary if we follow the
> > "no-reparsing" idea by Ash).
> >
> > It would also make migration to 2.6 way more attractive for many users
> > (if it becomes part of 2.6). Think our message to our users "yes you
> > pay a lot for Secret access, but by migrating to 2.6 your bills will
> > drop-down dramatically" with just upgrading and no changes to your
> > DAGs. In the current economic climate, that alone might convince some
> > decision makers to invest in the migration to 2.6.
> >
> > This is not a super-strong opinion, I would also be ok if we leave
> > things as they are, but I think that there are no technical reasons
> > why this would not work. Those are more "communication" issues - how
> > we communicate to the users, how to be VERY CLEAR that "using
> > variables at top level" is only good for 2.6+ and that it is a very
> > bad idea for pre-2.6.
> >
> > I think (despite many years of arguing and advocating - including my
> > own statements/words/rants - differently) we should keep an open mind
> > to that option.
> >
> > Thoughts?
> >
> > J
> >
> > On Fri, Mar 24, 2023 at 2:42 AM Daniel Standish
> > <da...@astronomer.io.invalid> wrote:
> > >
> > > It would not help with kubernetes executor. Did you try with local
> > > executor?
> > >
> > > Another option to consider is to implement this specifically on the AWS
> > > secrets manager secrets backend itself.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > For additional commands, e-mail: dev-help@airflow.apache.org
> >
> >
>


Re: [DISCUSS] a cache for Airflow Variables

Posted by Elad Kalif <el...@apache.org>.
Jarek I can argue that by doing that the problem is not gone but just
reduced to a level where it's even harder to spot and adresses by the
users. This is the worst kind of problems to tackle!

In my prespective if we don't address the issue directly (general solution
to top level code issues) any attempt to minimize it must come with a clear
way of letting users know what is going on.

בתאריך יום ו׳, 24 במרץ 2023, 12:37, מאת Jarek Potiuk ‏<ja...@potiuk.com>:

> TL;DR; I am in favour of the change (limiting it to DagFileProcessor
> only) and turning using Variables at top-level as good practice
> (2.6+).
>
> Here is why.
>
> I partially agree with Elad /Ash, but I am not 0/1 on that subject any
> more. I think there is a more nuanced approach that we could take
> here. I had a very similar view yesterday but I slept on it and
> changed my mind - to an extent.
>
> Putting aside (for a moment, I will come back to it further on) the
> fact that this one is addressing "bad practices" of users. I want to
> focus on technicalities first and risks.
>
> I initially contested the idea seeing the PR and raised my doubts
> (actually the discussion here is because I asked Raphael to bring it
> here): https://github.com/apache/airflow/pull/30259#issuecomment-148161094
> .
> The reasons I raised were based on earlier discussions and attempts to
> do similar things (including by myself) when I hit some problems
> related to multiprocessing and especially Celery. After looking at the
> code and hearing from Raphael I think the solution he came up with
> **might** work in a stable and good way for DagFileProcessor, where it
> is risky to implement for Celery (because it's billiard <>
> multiprocessing weird relation - Celery has it's own multiprocessing
> drop-in replacement  - billiard that at times clashes with
> multiprocessing). The risk that this caching approach will cause
> problems when enabled for Celery is high and while I see how it can
> work for our DagFileProcessor, it will be a strong "no" for Celery.
> And As Daniel rightfully noticed, in K8S executor it is ineffective at
> all.
>
> However, I think (again putting aside the bad practices angle) - I
> **think** if we only do that for DagFileProcessor (I think caching in
> workers is generally out of question), from the technical point of
> view it has only benefits.
>
> * the amount of traffic it generates is WAY smaller in general - I
> think there is a nice angle to it that in this way it can decrease the
> amount of traffic generated even across multiple DAGs - imagine 100
> DAGs using the same variable - with single Scheduler/DagFileProcessor
> that might be a really nice resource saving (not only Secrets but just
> DB traffic and in some cases even number of connections to establish
> to the DB.
> * this will also - interestingly - help in case of AIP-44 and db-less
> dag file processor (in terms of lower number of requests to the
> internal-API
> * I think in case of the DagFileProcessor, my initial worry about ttl
> (and sometimes seeing outdated version of the variable) is far, far
> less than in case of workers. Unlike for tasks /workers, DAGs are
> parsed in essentially random intervals and there is never a
> "dependency" between DAGs so whether the variable is fresh or stale a
> bit - does not matter (because the same DAG can be parsed now or 5
> minutes ago and we have essentially no control on it and we cannot
> rely on it)(and.
>
> So - technically speaking, to summarize, as long as we test it
> thoroughly - also at scale, I see no problem with getting it in - as
> long as we only limit it to DagFileProcessor.
>
> Now - coming back to the reason "should we make life easier for our
> users who violate best practices" (essentially helping them to violate
> those).
>
> I think in general we should not make life easier for users who
> violate best practices. And I agree with Elad's points. And as a side
> comment I love Ash's idea about not re-parsing - this should be a
> discussion started separately and I would be 100% supportive for that
> one. It's the first time I hear it from Ash :) and I love it - there
> are other things to consider there - like when to trigger reparsing
> etc/
>
> However, I think with this change we have the opportunity of doing
> something different - i.e. change our best practices.
>
> Why not changing our best practices to (essentially):
>
> "Do not use long running things like API calls. You can use Airflow
> Variables though - they are implemented in the way that workaround the
> limitations, at the expense of propagation time it takes when you
> change the variable" (or something like that).
>
> Philosophical rant here (sorry for those who do not like such analogies):
>
> The nice thing is that for many of our users (who currently violate
> them) they will suddenly with **just** upgrading Airflow fall from
> "violating best practices" camp to "following best practices" camp. I
> think that could be a major win for everyone. Yes, we can preach the
> best practices but seeing how often they are violated (for good
> reason, alternatives are not as "easy" to use), another good approach
> is to embrace the behaviours and let them become good practices. That
> worked for major religions for example (we are close to it - so think
> Easter being absorbed from Pagan spring equinox celebrations
> originally). It's a very effective strategy to bring new followers and
> strengthen binding with those who struggle with their practices being
> touted as "bad". And this all without making a single change to their
> DAG files/structure (this would be necessary if we follow the
> "no-reparsing" idea by Ash).
>
> It would also make migration to 2.6 way more attractive for many users
> (if it becomes part of 2.6). Think our message to our users "yes you
> pay a lot for Secret access, but by migrating to 2.6 your bills will
> drop-down dramatically" with just upgrading and no changes to your
> DAGs. In the current economic climate, that alone might convince some
> decision makers to invest in the migration to 2.6.
>
> This is not a super-strong opinion, I would also be ok if we leave
> things as they are, but I think that there are no technical reasons
> why this would not work. Those are more "communication" issues - how
> we communicate to the users, how to be VERY CLEAR that "using
> variables at top level" is only good for 2.6+ and that it is a very
> bad idea for pre-2.6.
>
> I think (despite many years of arguing and advocating - including my
> own statements/words/rants - differently) we should keep an open mind
> to that option.
>
> Thoughts?
>
> J
>
> On Fri, Mar 24, 2023 at 2:42 AM Daniel Standish
> <da...@astronomer.io.invalid> wrote:
> >
> > It would not help with kubernetes executor.  Did you try with local
> > executor?
> >
> > Another option to consider is to implement this specifically on the AWS
> > secrets manager secrets backend itself.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> For additional commands, e-mail: dev-help@airflow.apache.org
>
>

Re: [DISCUSS] a cache for Airflow Variables

Posted by Jarek Potiuk <ja...@potiuk.com>.
TL;DR; I am in favour of the change (limiting it to DagFileProcessor
only) and turning using Variables at top-level as good practice
(2.6+).

Here is why.

I partially agree with Elad /Ash, but I am not 0/1 on that subject any
more. I think there is a more nuanced approach that we could take
here. I had a very similar view yesterday but I slept on it and
changed my mind - to an extent.

Putting aside (for a moment, I will come back to it further on) the
fact that this one is addressing "bad practices" of users. I want to
focus on technicalities first and risks.

I initially contested the idea seeing the PR and raised my doubts
(actually the discussion here is because I asked Raphael to bring it
here): https://github.com/apache/airflow/pull/30259#issuecomment-148161094.
The reasons I raised were based on earlier discussions and attempts to
do similar things (including by myself) when I hit some problems
related to multiprocessing and especially Celery. After looking at the
code and hearing from Raphael I think the solution he came up with
**might** work in a stable and good way for DagFileProcessor, where it
is risky to implement for Celery (because it's billiard <>
multiprocessing weird relation - Celery has it's own multiprocessing
drop-in replacement  - billiard that at times clashes with
multiprocessing). The risk that this caching approach will cause
problems when enabled for Celery is high and while I see how it can
work for our DagFileProcessor, it will be a strong "no" for Celery.
And As Daniel rightfully noticed, in K8S executor it is ineffective at
all.

However, I think (again putting aside the bad practices angle) - I
**think** if we only do that for DagFileProcessor (I think caching in
workers is generally out of question), from the technical point of
view it has only benefits.

* the amount of traffic it generates is WAY smaller in general - I
think there is a nice angle to it that in this way it can decrease the
amount of traffic generated even across multiple DAGs - imagine 100
DAGs using the same variable - with single Scheduler/DagFileProcessor
that might be a really nice resource saving (not only Secrets but just
DB traffic and in some cases even number of connections to establish
to the DB.
* this will also - interestingly - help in case of AIP-44 and db-less
dag file processor (in terms of lower number of requests to the
internal-API
* I think in case of the DagFileProcessor, my initial worry about ttl
(and sometimes seeing outdated version of the variable) is far, far
less than in case of workers. Unlike for tasks /workers, DAGs are
parsed in essentially random intervals and there is never a
"dependency" between DAGs so whether the variable is fresh or stale a
bit - does not matter (because the same DAG can be parsed now or 5
minutes ago and we have essentially no control on it and we cannot
rely on it)(and.

So - technically speaking, to summarize, as long as we test it
thoroughly - also at scale, I see no problem with getting it in - as
long as we only limit it to DagFileProcessor.

Now - coming back to the reason "should we make life easier for our
users who violate best practices" (essentially helping them to violate
those).

I think in general we should not make life easier for users who
violate best practices. And I agree with Elad's points. And as a side
comment I love Ash's idea about not re-parsing - this should be a
discussion started separately and I would be 100% supportive for that
one. It's the first time I hear it from Ash :) and I love it - there
are other things to consider there - like when to trigger reparsing
etc/

However, I think with this change we have the opportunity of doing
something different - i.e. change our best practices.

Why not changing our best practices to (essentially):

"Do not use long running things like API calls. You can use Airflow
Variables though - they are implemented in the way that workaround the
limitations, at the expense of propagation time it takes when you
change the variable" (or something like that).

Philosophical rant here (sorry for those who do not like such analogies):

The nice thing is that for many of our users (who currently violate
them) they will suddenly with **just** upgrading Airflow fall from
"violating best practices" camp to "following best practices" camp. I
think that could be a major win for everyone. Yes, we can preach the
best practices but seeing how often they are violated (for good
reason, alternatives are not as "easy" to use), another good approach
is to embrace the behaviours and let them become good practices. That
worked for major religions for example (we are close to it - so think
Easter being absorbed from Pagan spring equinox celebrations
originally). It's a very effective strategy to bring new followers and
strengthen binding with those who struggle with their practices being
touted as "bad". And this all without making a single change to their
DAG files/structure (this would be necessary if we follow the
"no-reparsing" idea by Ash).

It would also make migration to 2.6 way more attractive for many users
(if it becomes part of 2.6). Think our message to our users "yes you
pay a lot for Secret access, but by migrating to 2.6 your bills will
drop-down dramatically" with just upgrading and no changes to your
DAGs. In the current economic climate, that alone might convince some
decision makers to invest in the migration to 2.6.

This is not a super-strong opinion, I would also be ok if we leave
things as they are, but I think that there are no technical reasons
why this would not work. Those are more "communication" issues - how
we communicate to the users, how to be VERY CLEAR that "using
variables at top level" is only good for 2.6+ and that it is a very
bad idea for pre-2.6.

I think (despite many years of arguing and advocating - including my
own statements/words/rants - differently) we should keep an open mind
to that option.

Thoughts?

J

On Fri, Mar 24, 2023 at 2:42 AM Daniel Standish
<da...@astronomer.io.invalid> wrote:
>
> It would not help with kubernetes executor.  Did you try with local
> executor?
>
> Another option to consider is to implement this specifically on the AWS
> secrets manager secrets backend itself.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: [DISCUSS] a cache for Airflow Variables

Posted by Daniel Standish <da...@astronomer.io.INVALID>.
It would not help with kubernetes executor.  Did you try with local
executor?

Another option to consider is to implement this specifically on the AWS
secrets manager secrets backend itself.

Re: [DISCUSS] a cache for Airflow Variables

Posted by "Mehta, Shubham" <sh...@amazon.com.INVALID>.
Considering the current framing of the problem/solution - as a means to address poorly written DAGs - I agree with Elad's perspective. Nevertheless, I believe that caching for the secrets backend has the potential to enhance Airflow performance even for DAGs adhering to best practices, particularly when applied to connections and configurations in addition to variables. IMO the proposed solution can be easily extended to include connections and configurations - @Vandon, Raphael, please feel free to correct me if I am mistaken. Many Airflow users aim to manage hundreds or even thousands of tasks concurrently. Dynamic Task Mapping greatly simplifies the process of creating scalable tasks, and if each task requires a connection (or a variable), a secrets backend cache could prove beneficial. This is just one example. I am eager to hear opinions on why we should not implement this - are there any drawbacks to this approach? Ultimately, this solution provides users the option to utilize caching if they prefer. 

> I would also propose an alternative fix: let's come up with a way to tell Airflow to not continuously reparse a file!
I wholeheartedly agree with this. @Vandon, Raphael, this could be the next optimization you consider.  

Shubham

On 2023-03-23, 4:34 PM, "Ash Berlin-Taylor" <ash@apache.org <ma...@apache.org>> wrote:


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.






I second Elad's view here.


I would also propose an alternative fix: let's come up with a way to tell Airflow to not continuously reparse a file!


A strawman example:


```
from airflow import ReparseMode, DAG


AIRFLOW_REPARSE = ReparseMode.ON_FILE_CHANGED


with DAG(...):
....
```


We could change things so that if that top level value is seen Airflow stops parsing that file unless a change is detected.


We could also add a configuration option to set a default mode for all dag files that don't have an explicit mode set.


Not only cos this put less load/cost on the secrets backend, it also puts less load on the Dag parser itself! Win win.


-ash


On 23 March 2023 22:56:11 GMT, Elad Kalif <eladkal@apache.org <ma...@apache.org>> wrote:
>To me Airflow is not "open and play". It is not a closed system that guides
>you how to develop. Workflow as code requires more skills.
>There are stuff to learn before authoring dags.
>This discussion is about Variables but I might ask similar question about
>users who query API as part of top level code. This is also very bad and
>not covered by current discussion.
>
>In my prespective users who hit this issue are likely also to hit many
>other issues. Its a good sign that they should try to learn Airflow better
>before continue development. I don't think a solution involves in hinding
>the issue is a good one.
>
>For now I'm -1 for making complecated changes to mitigate this. However I
>am infavor of finding a way to alert users that we detected a possible bad
>practice and we advise to recheck the code (that also requires some
>thought.. I don't know if bumping many warnings in the main UI is the right
>way...)
>
>בתאריך יום ו׳, 24 במרץ 2023, 00:25, מאת Vandon, Raphael
>‏<vandonr@amazon.com.inva <ma...@amazon.com.inva>lid>:
>
>> Hello,
>> I’d like to submit to discussion the idea of having a cache on Airflow
>> Variables.
>> The aim is to reduce DAG parsing time and Secret Manager API bill for
>> users who make a liberal use of Variable.get in their DAG files.
>> The drawback, of course, is that caches introduce a delay in how fast
>> changes are propagated.
>>
>> For improved readability, I kindly invite you to read and comment in the
>> github discussion : https://github.com/apache/airflow/discussions/30265 <https://github.com/apache/airflow/discussions/30265>
>>
>> Cheers
>>
>>




Re: [DISCUSS] a cache for Airflow Variables

Posted by Ash Berlin-Taylor <as...@apache.org>.
I second Elad's view here.

I would also propose an alternative fix: let's come up with a way to tell Airflow to not continuously reparse a file!

A strawman example:

```
from airflow import ReparseMode, DAG

AIRFLOW_REPARSE = ReparseMode.ON_FILE_CHANGED

with DAG(...):
    ....
```

We could change things so that if that top level value is seen Airflow stops parsing that file unless a change is detected.

We could also add a configuration option to set a default mode for all dag files that don't have an explicit mode set.

Not only cos this put less load/cost on the secrets backend, it also puts less load on the Dag parser itself! Win win.

-ash 

On 23 March 2023 22:56:11 GMT, Elad Kalif <el...@apache.org> wrote:
>To me Airflow is not "open and play". It is not a closed system that guides
>you how to develop. Workflow as code requires more skills.
>There are stuff to learn before authoring dags.
>This discussion is about Variables but I might ask similar question about
>users who query API as part of top level code. This is also very bad and
>not covered by current discussion.
>
>In my prespective users who hit this issue are likely also to hit many
>other issues. Its a good sign that they should try to learn Airflow better
>before continue development. I don't think a solution involves in hinding
>the issue is a good one.
>
>For now I'm -1 for making complecated changes to mitigate this. However I
>am infavor of finding a way to alert users that we detected a possible bad
>practice and we advise to recheck the code (that also requires some
>thought.. I don't know if bumping many warnings in the main UI is the right
>way...)
>
>בתאריך יום ו׳, 24 במרץ 2023, 00:25, מאת Vandon, Raphael
>‏<va...@amazon.com.invalid>:
>
>> Hello,
>> I’d like to submit to discussion the idea of having a cache on Airflow
>> Variables.
>> The aim is to reduce DAG parsing time and Secret Manager API bill for
>> users who make a liberal use of Variable.get in their DAG files.
>> The drawback, of course, is that caches introduce a delay in how fast
>> changes are propagated.
>>
>> For improved readability, I kindly invite you to read and comment in the
>> github discussion : https://github.com/apache/airflow/discussions/30265
>>
>> Cheers
>>
>>

Re: [DISCUSS] a cache for Airflow Variables

Posted by Elad Kalif <el...@apache.org>.
To me Airflow is not "open and play". It is not a closed system that guides
you how to develop. Workflow as code requires more skills.
There are stuff to learn before authoring dags.
This discussion is about Variables but I might ask similar question about
users who query API as part of top level code. This is also very bad and
not covered by current discussion.

In my prespective users who hit this issue are likely also to hit many
other issues. Its a good sign that they should try to learn Airflow better
before continue development. I don't think a solution involves in hinding
the issue is a good one.

For now I'm -1 for making complecated changes to mitigate this. However I
am infavor of finding a way to alert users that we detected a possible bad
practice and we advise to recheck the code (that also requires some
thought.. I don't know if bumping many warnings in the main UI is the right
way...)

בתאריך יום ו׳, 24 במרץ 2023, 00:25, מאת Vandon, Raphael
‏<va...@amazon.com.invalid>:

> Hello,
> I’d like to submit to discussion the idea of having a cache on Airflow
> Variables.
> The aim is to reduce DAG parsing time and Secret Manager API bill for
> users who make a liberal use of Variable.get in their DAG files.
> The drawback, of course, is that caches introduce a delay in how fast
> changes are propagated.
>
> For improved readability, I kindly invite you to read and comment in the
> github discussion : https://github.com/apache/airflow/discussions/30265
>
> Cheers
>
>