You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Kaxil Naik <ka...@gmail.com> on 2023/02/02 01:41:09 UTC

Re: [Discussion] Deprecate auto cleanup RenderedTaskInstanceFields and decouple k8s_pod_yaml

> >Users often asked (slack, issues, discussions) to give the ability to
> auto-maintain tables/logs and receive the usual answer: "No, we don't give
> you this opportunity, please use something for Airflow Ecosystem Page". But
> on the other hand we have auto-maintenance only for a single table.


That is not true, we added "airflow db clean" command to give this control
to users.

What about these tables: session, log, job? I expected the answer would be
> "They are not so specific."

The RTIF can vary in size based on what is stored and is really useless
beyond last X for each task. RTIF also has a foreign key constraints
<https://github.com/apache/airflow/blob/d80b583db07197c8c3d0549a805e83ceaaf10d52/airflow/models/renderedtifields.py#L63-L73>
on
TI and that is the most common table that is affected during the migrations
and can affect scheduling decisions. Log and Session tables aren't affected
by it.

I will have a detailed reply if I manage to find the time, it has just been
too difficult

On Tue, 31 Jan 2023 at 21:46, Jarek Potiuk <ja...@potiuk.com> wrote:

> BTW, I would really love to hear what the original authors have to say
> here. I am merely trying to put myself in their shoes and guess what
> the reasoning is :).
>
> I think this is really a question of: Do we want to keep all rendered
> arbitrary size rendered task instance fields in our database forever
> by default, same as other fields.
> I believe the original authors answered the question to be "no". And
> the "num_row" was a way to limit it.
>
> And I really do not want to "win" this argument, I just want to
> protect our users (and environment).
>
> There is (IMHO) currently a big difference between
> session/logg/task_instance fields and rendered task_instance fields
> that justify different behaviour.
>
> The former are generally fixed in max size of rows (this is one of the
> reasons we have limited string sizes in our DB) - to be able to limit
> them growing uncontrollably large. We simply do not keep arbitrary
> size data in those tables.
> On the other hand, the rendered task instance is arbitrary in size
> (JSONField) and the need for deletion accounts for the "worst" case
> scenario.
> Until we get rid of that "property" of the rendered task instance
> table, I think "just" skipping deletion of those fields without fixing
> the "worst" case scenario is not a good idea.
>
> Maybe in your test cases (and many others) those tables are not
> bigger, but I think the protection here is implemented to account for
> the case where the rendered task instance field is "big". I think the
> protection here is done for the cases where the rendered task instance
> fields are really "big".
>
> But very interestingly - if the rendered task instance is "big" then
> likely it is next to useless to be displayed in the Web UI in its
> entirety.
>
> So maybe you are actually right Andrey, Maybe we can skip deleting
> those and maybe we could solve it differently and apply the same rules
> as other tables?
>
> Let me then - constructively - propose another idea which actually
> might solve both yours and my own concerns. Maybe we can fix the
> "worst case" scenario differently? We do not have to limit the number
> of rows, we can limit the size of the row instead.
>
> Why don't we simply limit the size of the rendered task instance JSON
> and if they are too big (we can configure the maximum size), we will
> render something like (probably a bit more unique and not
> "accidentally triggerable"):
>
> {
>    "error": "This task instance has too large rendered task instances
> to display it"
> }
>
> And implement an escape hatch in the web server to handle it properly
> when displaying such "truncated" rendered task instance field.
>
> We should be able come up with a sensible max size that we might think
> makes sense when rendering it in the web UI.  And we could make the
> max size configurable by the user if they have a lot of GB to spare.
> We could even apply it automatically. If there is a max_num_row_limit
> - we allow any size, if we have no limit on the number of rows, we
> limit the maximum row size.
>
> If we have such "upper-bounded" reasonable size of each row in that
> table - then I am perfectly happy with not deleting the rows
> automatically.
> But only if we limit that and handle the "worst" case properly.
>
> That would be my proposal how we can handle it to get both views taken
> into account.
>
> What do you Andrey (and others) think? Does it make sense? Or do we
> think we should not have any such protections in place ?
>
> J.
>
>
> On Tue, Jan 31, 2023 at 9:43 PM Andrey Anshin <an...@taragol.is>
> wrote:
> >
> > > I think that is all something to be tested with explaining plans.
> >
> > I would be really careful with these results. DELETE in Transaction with
> Rollback usually shows more optimistic than actually executed.
> >
> >
> > > I think we would not know before we try - and possibly there are other
> > optimisation approaches. The optimisation I proposed was only first
> > that came to my mind to avoid the "not in" query. The problem  with
> > "not in query" is that there is no way to optimise it by the DB.
> > Effectively you have to get every record (or index entry) and test it.
> > Maybe it can be done better :). And yes locking the index with
> > anti-insert locks and the need to rebalance trees during the delete is
> > a concern.
> >
> > My point is still the same, I would rather remove it in the future or
> make policy about maintenance more consistent: all or nothing. Right now we
> are close to nothing rather than all.
> >
> >
> > > It's not about index size or JSON access. It is about the size of the
> > actual rows and storage it takes - i.e. general size of the database.
> >
> > I'm tired, but I'm not sure that I understand the actual point.
> > Is it not really a matter of size of the table if you always access by
> pattern:
> > 1 request which returns exactly 1 record accessed by a unique index.
> > Basically query travercial by index find reference to single CTID/rowid
> (or whatever name used in other RDBMS).
> > So at this point it really matters how fast your index grows rather than
> table size.
> >
> >
> > > The problem with it is that (especially with dynamic task mapping), it
> > might grow really, really fast. Basically you have NUM_DAGS x
> > NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
> > of records there.
> > Back-of-the envelope calculation Assuming you have a
> > DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
> > fields, each field evaluating to 1K string.  Then you have 10 tasks *
> > 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
> > per one(!) run of one(1) DAG. Run it every 10 minutes and every day
> > your database from a single DAG grows by whooping 1.44 GB of data
> > every single day (from single DAG).
> >
> > Depends on DB. If we talk about Postgres you could easily miss up to 3-4
> times (thanks for inline compression before/instead of TOASTs).
> > I have a couple questions:
> > 1. Do you know how big would be indexes in rendered_task_instance_fields
> after one day? (Spoiler alert I could give estimation in the morning)
> > 2. In this scenario with default settings always would keep up to 30 000
> TI for this DAG.
> > Could someone who wants to optimize the query make it more optimal
> rather than access the table by index (Unique Index -> ctid/rowid - record)
> and where this crosspoint?
> >
> > > This is of course an estimation that assumes a lot, but it's not at
> all unrealistic.
> >
> > 144,000 TI per/day on single DAG (initially I want to put here some
> sarcastic message).
> > How would Airflow feel with 144k Task Instances per day? How
> > I asked because right now I've always had a task_instance table bigger
> than rendered_task_instance_fields.
> >
> >
> > > This table is very specific compared with the other tables. The only
> reason for it being here is to be able
> > to show the rendered fields in the UI if you go to the specific run of a
> task. If you clean-up other tables you basically lose the history of
> execution of the tasks and you cannot really know if the data has been
> > processed, you cannot do backfills effectively, you lose all the
> > context. Cleaning this table is merely about the values that have been
> > rendered for a specific run and the assumption there is that the older
> > it gets, the less interesting it is.
> >
> > What about these tables: session, log, job? I expected the answer would
> be "They are not so specific."
> > For me every table is specific for their purpose.
> > Users often asked (slack, issues, discussions) to give the ability to
> auto-maintain tables/logs and receive the usual answer: "No, we don't give
> you this opportunity, please use something for Airflow Ecosystem Page". But
> on the other hand we have auto-maintenance only for a single table.
> >
> > >> It is opposite of what we have right now, we scan tables (maybe
> multiple times), read all records tuples which contain JSON.
> > > Not sure if I get the point here :). Yes -in my proposal I think the
> records will not be touched - only indexes. So the cleanup should be way
> faster, contentions less of problem, due to the way the delete
> > uses < ordering, deadlocks will not be possible at all (as opposed to
> the current "not in" - there is a very easy way to get into deadlocks when
> two parallel deletes are trying to delete same rows in a different
> sequence. I think my proposal improves all the characteristics of the
> "cleanup" with very little extra penalty on record creation.
> >
> > I was talking about the current solution and why it is also slow (and if
> abstract users use some DBaaS where they also pay for IOPs then it is
> expensive). Let's talk about the benefits of optimised queries for 4
> different DB backends (3 if excluding SQLite) when we have it.
> >
> > > We pay for table/index size linary more records, more size. But other
> operations vary and depend on B-Tree implementation and usually it has
> logarithmic growth. Or do we worry only about table/toast/index size on
> disk?
> > >> Yep. I (and I believe the original author had the same worry) am
> worried a lot about the size of the table and the fact that this table will
> be by far the biggest table in our DB while most of the old records will
> never be touched. And by the fact that this is the only table that our
> users will have to know about to clean up separately from all others pretty
> much always.
> >
> > Same as previous.
> >
> > > If not even worrying about money spent by our users, and performance
> degradation that comes with databases that are bigger - that's a lot of
> environmental effects that we might incur. Airflow is heavily used, if
> suddenly all our users
> > will start having 10 bigger databases that they have now because we will
> deprecate the values and keep all the history, then we have a big number of
> extra disks that will have to be used. I'd strongly prefer a solution where
> we keep the data usage lower in this case.
> >
> > Am I right that this is all about "lets don't delete by default as we do
> for other tables" rather than the current default implementation?
> > Because I get the result which is opposite what you said. And
> rendered_task_instance_fields don't grow faster than other tables that what
> I got.
> > I would like to compare it with other findings and some reproducible
> metrics rather than with hypothetical things.
> >
> > ----
> > Best Wishes
> > Andrey Anshin
> >
> >
> >
> > On Tue, 31 Jan 2023 at 11:12, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>
> >> COMMENT: While writing the answer here, I think I found a deeper
> >> problem (and optimisation needed)  - i.e I think the delete should be
> >> even more fine-grained than it is today and include map_index) -
> >> please take a look at the end (Also maybe TP might comment on that
> >> one).
> >>
> >> > 1. Additional indexes add additional performance degradation on
> Insert but gain potential improvements on delete and unknown on update,
> RDBMS still require rebalance index and make it consistent to the table.
> >> > 2. LIMIT x OFFSET y could easily become full seq scan, especially if
> the user set a huge number for offset (which? unknown).
> >> > 3. Mixing two indexes could improve performance in a single query but
> in concurrent execution might lead to degradation because it needs to
> create a bitmap table for comparison between these two indexes, as result
> it might lead different issues, such as OOM on DB backend, use swaps or
> optimiser decided that better not to use this indexes.
> >>
> >> I think that is all something to be tested with explain plans. I think
> >> we would not know before we try - and possibly there are other
> >> optimisation approaches. The optimisation I proposed was only first
> >> that came to my mind to avoid the "not in" query. The problem  with
> >> "not in query" is that there is no way to optimise it by the DB.
> >> Effectively you have to get every record (or index entry) and test it.
> >> Maybe it can be done better :). And yes locking the index with
> >> anti-insert locks and the need to rebalance trees during the delete is
> >> a concern.
> >>
> >> > Is it a real problem? Until we access only by indexes, which doesn't
> include this JSON, it really doesn't matter. I guess we almost always
> should make a UNIQUE INDEX SCAN for SELECT or DELETE (UPDATE) a single
> record.
> >>
> >> Yes I think so, and while. I was not the author of this "cleanup"
> >> code, I believe I know the intention.
> >>
> >> It's not about index size or JSON access. It is about the size of the
> >> actual rows and storage it takes - i.e. general size of the database.
> >> The problem with it is that (especially with dynamic task mapping), it
> >> might grow really, really fast. Basically you have NUM_DAGS x
> >> NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
> >> of records there. Back-of-the envelope calculation Assuming you have a
> >> DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
> >> fields, each field evaluating to 1K string.  Then you have 10 tasks *
> >> 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
> >> per one(!) run of one(1) DAG. Run it every 10 minutes and every day
> >> your database from a single DAG grows by whooping 1.44 GB of data
> >> every single day (from single DAG).This is of course an estimation
> >> that assumes a lot, but it's not at all unrealistic. That's a lot. And
> >> if we want the user to do the cleanup then a) they need to know it b)
> >> they need to specifically clean up this table only because all the
> >> other data is relatively small. This table is very specific compared
> >> with the other tables. The only reason for it being here is to be able
> >> to show the rendered fields in the UI if you go to the specific run of
> >> a task. If you clean-up other tables you basically lose the history of
> >> execution of the tasks and you cannot really know if the data has been
> >> processed, you cannot do backfills effectively, you lose all the
> >> context. Cleaning this table is merely about the values that have been
> >> rendered for a specific run and the assumption there is that the older
> >> it gets, the less interesting it is.
> >>
> >> > It is opposite of what we have right now, we scan tables (maybe
> multiple times), read all records tuples which contain JSON.
> >>
> >> Not sure if I get the point here :). Yes -in my proposal I think the
> >> records will not be touched - only indexes. So the cleanup should be
> >> way faster, contentions less of problem, due to the way the delete
> >> uses < ordering, deadlocks will not be possible at all (as opposed to
> >> the current "not in" - there is a very easy way to get into deadlocks
> >> when two parallel deletes are trying to delete same rows in a
> >> different sequence. I think my proposal improves all the
> >> characteristics of the "cleanup" with very little extra penalty on
> >> record creation.
> >>
> >> > We pay for table/index size linary more records, more size. But other
> operations vary and depend on B-Tree implementation and usually it has
> logarithmic growth. Or do we worry only about table/toast/index size on
> disk?
> >>
> >> Yep. I (and I believe the original author had the same worry) am
> >> worried a lot about the size of the table and the fact that this table
> >> will be by far the biggest table in our DB while most of the old
> >> records will never be touched. And by the fact that this is the only
> >> table that our users will have to know about to clean up separately
> >> from all others pretty much always. If not even worrying about money
> >> spent by our users, and performance degradation that comes with
> >> databases that are bigger - that's a lot of environmental effects that
> >> we might incur. Airflow is heavily used, if suddenly all our users
> >> will start having 10 bigger databases that they have now because we
> >> will deprecate the values and keep all the history, then we have a big
> >> number of extra disks that will have to be used. I'd strongly prefer a
> >> solution where we keep the data usage lower in this case.
> >>
> >> > If we do not want to grant users the ability to clean up rendered
> templates tables, there could be another option:
> >> > - Do not delete records on every task instance run.
> >> > - Delete once per defined period (hourly, daily, weekly, monthly). In
> this case you really could not care about locks.
> >>
> >> Yes we could come up with a different strategy as to "when" run the
> >> cleanup. This is also a viable option. If you can propose one that
> >> will be equally adaptive as the current solution, I am all ears.
> >> Basically my goal is to keep the usage of the table low, possibly
> >> controlled by the same parameter we had. How we do it - this is a
> >> different story. If we - for example add a thread in the scheduler
> >> (for example) that performs such cleanup effectively in parallel and
> >> scales, I am happy with that.
> >>
> >> But I am trying to get into the head of the author trying to
> >> understand why the original implementation was done this way. I
> >> believe (and maybe those who remember it better could confirm it) that
> >> distributing the deletion to tasks to clean up after itself is a
> >> better idea than centralising the cleanup. This makes each cleanup
> >> smaller, locks are held for a shorter time (at least that was the
> >> assumption where no full table scan was used), it is more "immediate"
> >> and you do not have to decide upfront what should be the cleanup
> >> frequency. It seems this is the best logical approach to keep the
> >> "MAX_NUM_RENDERED_TI_FIELDS_PER_TASK" promise. Simply after task is
> >> complete, you can be sure that there are no more than this number of
> >> fields per task in the DB. With a scheduled run, that would be a much
> >> more "eventual" consistency and it will be potentially fluctuating
> >> much more.
> >>
> >> But there are risks involved in having a single thread doing the
> >> cleanup. I think it has a huge risk of being a "stop-the world" and
> >> "deadlock-prone" kind of event - if in big instances there are a lot
> >> of rows to cleanup in a single pass. When you delete entries from a
> >> table, there are anti-insert locks created for existing index entries,
> >> which makes it possible to rollback the whole DELETE transaction.
> >> Which means that when you try to insert the row with the same index,
> >> the index will be held. And this means that when you run a single huge
> >> DELETE for multiple rows affected with multiple (all?) index keys
> >> matching select query, it will effectively prevent new rows with the
> >> same indexes that are matching the SELECT. It would mean that if you
> >> have some tasks running while deleting existing run_id rendered
> >> fields, then you could REALLY start having deadlocks on those tasks
> >> trying to insert rendered task instance rows. That's why I think the
> >> only viable strategy for single "cleanup" thread is to do such cleanup
> >> as separate DELETE for each of the "dag/task/map_index/run" - in order
> >> to avoid such deadlocks. Which effectively will turn into what have
> >> currently - only that currently those transactions are done by tasks,
> >> not by a single cleanup thread.
> >>
> >> Also using tasks to delete old rows is more "effective" when you have
> >> vast differences in frequency of DAGs. Naturally - when you do it in
> >> task, you will only do it "when needed" for given DAG + Task. If you
> >> try to centralize the cleanup, unless you include somehow schedule and
> >> frequency of each dag, you are going to check every DAG every time
> >> your run the cleanup - no matter if that DAG is run daily or every
> >> minute, you will have to run the cleanup frequently enough to match
> >> your most frequent dags. If you have 1000 dags that run hourly and one
> >> DAG that runs every minue, then you have to run a cleanup job that
> >> scans all DAGs every few minutes. That's a big waste.
> >>
> >> So I am not sure if we gain anything by centralizing the cleanup.
> >> Decentralising it to Task seems to be a well thought and sound
> >> decision (but I think the problem we have now is that we need to
> >> optimize it after Dynamic Task Mapping has been added).
> >>
> >> ANOTHER FINDING:
> >>
> >> While looking at the code and discussing it and looking more closely I
> >> **think** there is another problem that we have to fix regardless of a
> >> solution. I THINK a problem we might have now is that we do not
> >> include map_index in this DELETE. While we are curreently delete all
> >> the rendered task fields without including map_index - and for big
> >> dynamic tasks, it means that exacly the same DELETE query is run by
> >> every single mapped instance of that tasks and that is where a lot of
> >> contention and locking might happen (basically when single task
> >> instance does the delete, anti-insert locks held the other mapped
> >> instances of the same task from inserting rendered fields).
> >>
> >> It does not change much in the optimisation proposal of mine, other
> >> than we should include map_index in those queries. But I think this
> >> might cause a lot of delays in the current implementation.
> >>
> >> J.
> >>
> >> > ----
> >> > Best Wishes
> >> > Andrey Anshin
> >> >
> >> >
> >> >
> >> > On Mon, 30 Jan 2023 at 23:51, Jarek Potiuk <ja...@potiuk.com> wrote:
> >> >>
> >> >> I think there is a good reason to clean those up automatically.
> >> >> rendered task instance fields are almost arbitrary in size. If we try
> >> >> to keep all historical values there by default, there are numerous
> >> >> cases it will grow very fast - far, far too quickly.
> >> >>
> >> >> And I am not worried at all about locks on this table if we do it the
> >> >> way I described it and it uses the indexes. The contention this way
> >> >> might only be between the two deleting tasks. and with the query I
> >> >> proposed, they will only last for a short time - the index will be
> >> >> locked when two DELETES  or SELECT DISTINCT - which should both be
> >> >> fast.
> >> >>
> >> >>
> >> >> On Mon, Jan 30, 2023 at 8:37 PM Andrey Anshin <
> andrey.anshin@taragol.is> wrote:
> >> >> >
> >> >> > I guess two things involved to reduce performance on this query
> through the time: Dynamic Task Mapping and run_id instead of execution date.
> >> >> >
> >> >> > I still personally think that changing the default value from 30
> to 0 might improve performance of multiple concurrent tasks, just because
> this query does not run and there are no locks on multiple records/pages.
> >> >> >
> >> >> > I do not have any proof (yet?) other than simple DAGs. I think
> that there is some cross point exists when keeping this table growth worse
> rather than cleanup for each TI run. But users have ability to cleanup
> table by execute airflow db clean which should improve performance again
> >> >> >
> >> >> > And also there is interesting behavior with this query: if user
> already have more that value specified by
> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK and tried run backfill
> than rendered templates not written to table (or may be inserted and after
> that immediately deleted), the same is valid for cleanup old tasks.
> >> >> >
> >> >> > ----
> >> >> > Best Wishes
> >> >> > Andrey Anshin
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Sun, 29 Jan 2023 at 14:16, Jarek Potiuk <ja...@potiuk.com>
> wrote:
> >> >> >>
> >> >> >> Yep. Agree this is not an efficient query and dynamic task mapping
> >> >> >> makes the effect much worse. Generally speaking, selecting "what
> >> >> >> should be left" and then deleting stuff where the key is "not in"
> is
> >> >> >> never an efficient way of running an sql query.  And the query not
> >> >> >> using index at all makes it rather terrible.
> >> >> >>
> >> >> >> I think we should not deprecate it though, but find a more
> efficient
> >> >> >> way of deleting the old keys. I think we could slightly
> denormalize
> >> >> >> RenderedTaskInstance + DagRun tables, and add
> DAG_RUN_EXECUTION_DATE
> >> >> >> to the RenderedTaskInstance table and that will be enough to
> optimise
> >> >> >> it.
> >> >> >>
> >> >> >> Then we could have either:
> >> >> >>
> >> >> >> * a composite B-TREE indexed (non-unique) index on DAG_ID,
> TASK_ID,
> >> >> >> RUN_ID_EXECUTION_DATE
> >> >> >> * or maybe even regular HASH index on DAG_ID, TASK_ID and separate
> >> >> >> B-TREE index (non-unique) on just RUN_ID_EXECUTION_DATE
> >> >> >>
> >> >> >> Probably the latter is better as I am not sure how < , >
> comparison
> >> >> >> looks like for composite B-TREE indexes when char + date columns
> are
> >> >> >> mixed. Also we could have hit the infamous MySQL index key length
> >> >> >> limit.
> >> >> >>
> >> >> >> Then deletion process would look roughly like:
> >> >> >>
> >> >> >> 1) dag_run_execution_date = SELECT RUN_ID_EXECUTION_DATE FROM
> >> >> >> RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID =<DAG_ID>,
> >> >> >> TASK_ID=<TASK_ID> ORDER BY RUN_ID_EXECUTION_DATE GROUP BY
> >> >> >> RUN_ID_EXECUTION_DATE DESC LIMIT 1 OFFSET
> >> >> >> <MAX_NUM_RENDERED_TI_FIELDS_PER_TASK>
> >> >> >> 2) DELETE FROM RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID
> =<DAG_ID>,
> >> >> >> TASK_ID=<TASK_ID> AND RENDER_TIME < dag_run_execution_date
> >> >> >>
> >> >> >> I believe that would be fast, and it would use the B-TREE index
> >> >> >> features nicely (ordering support)
> >> >> >>
> >> >> >> J
> >> >> >>
> >> >> >> On Sun, Jan 29, 2023 at 2:09 AM Andrey Anshin <
> andrey.anshin@taragol.is> wrote:
> >> >> >> >
> >> >> >> > First of all I want to highlight that this approach I guess
> worked well until Dynamic Task Mappings introduced.
> >> >> >> >
> >> >> >> > > The main reason for adding that cleanup was -- if you don't
> do that, you will have many rows, similar to the TaskInstance table
> >> >> >> >
> >> >> >> > The problem itself is not how big your table/indexes, rather
> then what kind of operation you run.
> >> >> >> >
> >> >> >> > > Do you have any data for locks or performance degradation?
> >> >> >> >
> >> >> >> > In this case if we try to clean up
> rendered_task_instance_fields table when a new TI is created/cleared we
> make almost two full/sequential scans (note: need to check) against the
> table without any index usage, so we pay here a couple times:
> >> >> >> > 1. We scan without indexes - not all parts of the composite key
> are included to query, plus we need to filter everything except 30 records
> with order and distinct
> >> >> >> > 2. After that we make another full scan for find 1 record or
> map_size records
> >> >> >> >
> >> >> >> > And I guess the situation becomes worse if you have a lot of
> tasks, even if we have a small table, we need to do ineffective operations.
> >> >> >> >
> >> >> >> > That how looks like Query Plan (please note without commit
> transaction DELETE operation doesn't have all information):
> https://gist.github.com/Taragolis/3ca7621c51b00f077aa1646401ddf31b
> >> >> >> >
> >> >> >> > In case if we do not clean up the table, we only use these
> operations:
> >> >> >> > 1. SELECT single record by index
> >> >> >> > 2. INSERT new record
> >> >> >> > 3. DELETE old record(s), which were found by index.
> >> >> >> >
> >> >> >> > I have not done any real tests yet, only synthetic DAGs (so we
> should not consider to use any findings as totally truth):
> https://gist.github.com/Taragolis/6eec9f81efdf360c4239fc6ea385a480
> >> >> >> > DAG with parallel tasks: degradation up to 2-3 times
> >> >> >> > DAG with single map tasks: degradation up to 7-10 times
> >> >> >> >
> >> >> >> > I have a plan for more complex and more close to real use cases
> with Database which do not have network latency almost 0 as I have in my
> local.
> >> >> >> > But I will not refuse if someone also does their tests with
> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK=0 vs default value.
> >> >> >> >
> >> >> >> > About deadlock we know that it exists at least in MySQL:
> https://github.com/apache/airflow/pull/18616
> >> >> >> >
> >> >> >> > > And the larger tables create problems during database
> migrations.
> >> >> >> >
> >> >> >> > That is a very good point, so if we found that problem only
> related to migrations we could:
> >> >> >> > 1. Cleanup this table in migration
> >> >> >> > 2. Add cli command to airflow db which could cleanup only
> rendered fields, so it would be user's choice cleanup or not before
> migration, do periodical maintenance or not
> >> >> >> >
> >> >> >> >
> >> >> >> > ----
> >> >> >> > Best Wishes
> >> >> >> > Andrey Anshin
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, 28 Jan 2023 at 23:41, Kaxil Naik <ka...@gmail.com>
> wrote:
> >> >> >> >>>
> >> >> >> >>> Potentially it is a good idea to deprecate this option and
> recommend for users to set it to 0? WDYT? Maybe someone has already tried
> or investigated this?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> The main reason for adding that cleanup was -- if you don't do
> that, you will have many rows, similar to the TaskInstance table. And the
> RenderedTIFields were mainly added for checking rendered TI fields on the
> Webserver only because after DAG Serialization, the webserver won't have
> access to DAG files.
> >> >> >> >>
> >> >> >> >> And the larger tables create problems during database
> migrations.
> >> >> >> >>
> >> >> >> >> Do you have any data for locks or performance degradation?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Sat, 28 Jan 2023 at 13:06, Andrey Anshin <
> andrey.anshin@taragol.is> wrote:
> >> >> >> >>>
> >> >> >> >>> Greetings!
> >> >> >> >>>
> >> >> >> >>> During migrating our ORM syntax to compatible with SQLAlchemy
> 2.0 I probably found skeletons in the closet.
> >> >> >> >>>
> >> >> >> >>> Let's start from the beginning, initially I got this warning
> >> >> >> >>>
> >> >> >> >>> airflow/models/renderedtifields.py:245
> RemovedIn20Warning('ORDER BY columns added implicitly due to DISTINCT is
> deprecated and will be removed in SQLAlchemy 2.0.  SELECT statements with
> DISTINCT should be written to explicitly include the appropriate columns in
> the columns clause (Background on SQLAlchemy 2.0 at:
> https://sqlalche.me/e/b8d9)')
> >> >> >> >>>
> >> >> >> >>> "OK let's fix it!", I thought at first and started to
> investigate RenderedTaskInstanceFields model
> >> >> >> >>>
> >> >> >> >>> Skeleton #1:
> >> >> >> >>>
> >> >> >> >>> When I first time look on the code and comments it got me to
> thinking that part which keep only latest N Rendered Task Fields
> potentially could lead different performance degradation (Locks, Dead
> Locks, Data Bloating): see code
> https://github.com/apache/airflow/blob/6c479437b1aedf74d029463bda56b42950278287/airflow/models/renderedtifields.py#L185-L245
> >> >> >> >>>
> >> >> >> >>> Also this historical part (from Airflow 1.10.10) generate
> this SQL Statement (pg backend)
> >> >> >> >>>
> >> >> >> >>> DELETE FROM rendered_task_instance_fields
> >> >> >> >>> WHERE rendered_task_instance_fields.dag_id = %(dag_id_1) s
> >> >> >> >>>   AND rendered_task_instance_fields.task_id = %(task_id_1) s
> >> >> >> >>>   AND (
> >> >> >> >>>     (
> >> >> >> >>>       rendered_task_instance_fields.dag_id,
> >> >> >> >>>       rendered_task_instance_fields.task_id,
> >> >> >> >>>       rendered_task_instance_fields.run_id
> >> >> >> >>>     ) NOT IN (
> >> >> >> >>>       SELECT
> >> >> >> >>>         anon_1.dag_id,
> >> >> >> >>>         anon_1.task_id,
> >> >> >> >>>         anon_1.run_id
> >> >> >> >>>       FROM
> >> >> >> >>>         (
> >> >> >> >>>           SELECT DISTINCT
> >> >> >> >>>             rendered_task_instance_fields.dag_id AS dag_id,
> >> >> >> >>>             rendered_task_instance_fields.task_id AS task_id,
> >> >> >> >>>             rendered_task_instance_fields.run_id AS run_id,
> >> >> >> >>>             dag_run.execution_date AS execution_date
> >> >> >> >>>           FROM rendered_task_instance_fields
> >> >> >> >>>             JOIN dag_run ON
> rendered_task_instance_fields.dag_id = dag_run.dag_id
> >> >> >> >>>             AND rendered_task_instance_fields.run_id =
> dag_run.run_id
> >> >> >> >>>           WHERE
> >> >> >> >>>             rendered_task_instance_fields.dag_id =
> %(dag_id_2) s
> >> >> >> >>>             AND rendered_task_instance_fields.task_id =
> %(task_id_2) s
> >> >> >> >>>           ORDER BY
> >> >> >> >>>             dag_run.execution_date DESC
> >> >> >> >>>           limit %(param_1) s
> >> >> >> >>>         ) AS anon_1
> >> >> >> >>>     )
> >> >> >> >>>   )
> >> >> >> >>>
> >> >> >> >>> Which is especially not effective in PostgreSQL. When IN
> SUBQUERY could be easily transform internaly into SEMI-JOIN (aka EXISTS
> clause), but it is not working for NOT IN SUBQUERY because it is not
> transformed into ANTI JOIN (aka NOT EXISTS clause) even if it possible,
> see: https://commitfest.postgresql.org/27/2023/
> >> >> >> >>>
> >> >> >> >>> I didn't do any performance benchmarks yet but I guess if
> users set AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK to 0 rather
> than default 30 it could improve performance and reduce number of
> DeadLocks, however the table size will increase but I think we don't do any
> maintenance job for other tables.
> >> >> >> >>>
> >> >> >> >>> Potentially it is a good idea to deprecate this option and
> recommend for users to set it to 0? WDYT? Maybe someone has already tried
> or investigated this?
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> Skeleton #2:
> >> >> >> >>>
> >> >> >> >>> We have a k8s_pod_yaml field which is exclusively used by K8S
> executors.
> >> >> >> >>>
> >> >> >> >>> Should we also decouple this field as part of AIP-51?
> >> >> >> >>>
> >> >> >> >>> ----
> >> >> >> >>> Best Wishes
> >> >> >> >>> Andrey Anshin
> >> >> >> >>>
>

Re: [Discussion] Deprecate auto cleanup RenderedTaskInstanceFields and decouple k8s_pod_yaml

Posted by Andrey Anshin <an...@taragol.is>.
Just for a record, I still try to create representative experiments.
Last time when I ran DAG with pretty big rendered_templates (about 10KB per
TI) with total of 10k TI I'd found after couple of day of execution that
postgres very efficient compress my dummy rendered templates and it didn't
move into the TOAST table (facepalm)

----
Best Wishes
*Andrey Anshin*



On Thu, 2 Feb 2023 at 05:58, Kaxil Naik <ka...@gmail.com> wrote:

> Maybe the “auto-maintenance” is true. Hopefully for not long though :)
>
> On Thu, 2 Feb 2023 at 01:41, Kaxil Naik <ka...@gmail.com> wrote:
>
>>
>> >Users often asked (slack, issues, discussions) to give the ability to
>>> auto-maintain tables/logs and receive the usual answer: "No, we don't give
>>> you this opportunity, please use something for Airflow Ecosystem Page". But
>>> on the other hand we have auto-maintenance only for a single table.
>>
>>
>> That is not true, we added "airflow db clean" command to give this
>> control to users.
>>
>> What about these tables: session, log, job? I expected the answer would
>>> be "They are not so specific."
>>
>> The RTIF can vary in size based on what is stored and is really useless
>> beyond last X for each task. RTIF also has a foreign key constraints
>> <https://github.com/apache/airflow/blob/d80b583db07197c8c3d0549a805e83ceaaf10d52/airflow/models/renderedtifields.py#L63-L73> on
>> TI and that is the most common table that is affected during the migrations
>> and can affect scheduling decisions. Log and Session tables aren't affected
>> by it.
>>
>> I will have a detailed reply if I manage to find the time, it has just
>> been too difficult
>>
>> On Tue, 31 Jan 2023 at 21:46, Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> BTW, I would really love to hear what the original authors have to say
>>> here. I am merely trying to put myself in their shoes and guess what
>>> the reasoning is :).
>>>
>>> I think this is really a question of: Do we want to keep all rendered
>>> arbitrary size rendered task instance fields in our database forever
>>> by default, same as other fields.
>>> I believe the original authors answered the question to be "no". And
>>> the "num_row" was a way to limit it.
>>>
>>> And I really do not want to "win" this argument, I just want to
>>> protect our users (and environment).
>>>
>>> There is (IMHO) currently a big difference between
>>> session/logg/task_instance fields and rendered task_instance fields
>>> that justify different behaviour.
>>>
>>> The former are generally fixed in max size of rows (this is one of the
>>> reasons we have limited string sizes in our DB) - to be able to limit
>>> them growing uncontrollably large. We simply do not keep arbitrary
>>> size data in those tables.
>>> On the other hand, the rendered task instance is arbitrary in size
>>> (JSONField) and the need for deletion accounts for the "worst" case
>>> scenario.
>>> Until we get rid of that "property" of the rendered task instance
>>> table, I think "just" skipping deletion of those fields without fixing
>>> the "worst" case scenario is not a good idea.
>>>
>>> Maybe in your test cases (and many others) those tables are not
>>> bigger, but I think the protection here is implemented to account for
>>> the case where the rendered task instance field is "big". I think the
>>> protection here is done for the cases where the rendered task instance
>>> fields are really "big".
>>>
>>> But very interestingly - if the rendered task instance is "big" then
>>> likely it is next to useless to be displayed in the Web UI in its
>>> entirety.
>>>
>>> So maybe you are actually right Andrey, Maybe we can skip deleting
>>> those and maybe we could solve it differently and apply the same rules
>>> as other tables?
>>>
>>> Let me then - constructively - propose another idea which actually
>>> might solve both yours and my own concerns. Maybe we can fix the
>>> "worst case" scenario differently? We do not have to limit the number
>>> of rows, we can limit the size of the row instead.
>>>
>>> Why don't we simply limit the size of the rendered task instance JSON
>>> and if they are too big (we can configure the maximum size), we will
>>> render something like (probably a bit more unique and not
>>> "accidentally triggerable"):
>>>
>>> {
>>>    "error": "This task instance has too large rendered task instances
>>> to display it"
>>> }
>>>
>>> And implement an escape hatch in the web server to handle it properly
>>> when displaying such "truncated" rendered task instance field.
>>>
>>> We should be able come up with a sensible max size that we might think
>>> makes sense when rendering it in the web UI.  And we could make the
>>> max size configurable by the user if they have a lot of GB to spare.
>>> We could even apply it automatically. If there is a max_num_row_limit
>>> - we allow any size, if we have no limit on the number of rows, we
>>> limit the maximum row size.
>>>
>>> If we have such "upper-bounded" reasonable size of each row in that
>>> table - then I am perfectly happy with not deleting the rows
>>> automatically.
>>> But only if we limit that and handle the "worst" case properly.
>>>
>>> That would be my proposal how we can handle it to get both views taken
>>> into account.
>>>
>>> What do you Andrey (and others) think? Does it make sense? Or do we
>>> think we should not have any such protections in place ?
>>>
>>> J.
>>>
>>>
>>> On Tue, Jan 31, 2023 at 9:43 PM Andrey Anshin <an...@taragol.is>
>>> wrote:
>>> >
>>> > > I think that is all something to be tested with explaining plans.
>>> >
>>> > I would be really careful with these results. DELETE in Transaction
>>> with Rollback usually shows more optimistic than actually executed.
>>> >
>>> >
>>> > > I think we would not know before we try - and possibly there are
>>> other
>>> > optimisation approaches. The optimisation I proposed was only first
>>> > that came to my mind to avoid the "not in" query. The problem  with
>>> > "not in query" is that there is no way to optimise it by the DB.
>>> > Effectively you have to get every record (or index entry) and test it.
>>> > Maybe it can be done better :). And yes locking the index with
>>> > anti-insert locks and the need to rebalance trees during the delete is
>>> > a concern.
>>> >
>>> > My point is still the same, I would rather remove it in the future or
>>> make policy about maintenance more consistent: all or nothing. Right now we
>>> are close to nothing rather than all.
>>> >
>>> >
>>> > > It's not about index size or JSON access. It is about the size of the
>>> > actual rows and storage it takes - i.e. general size of the database.
>>> >
>>> > I'm tired, but I'm not sure that I understand the actual point.
>>> > Is it not really a matter of size of the table if you always access by
>>> pattern:
>>> > 1 request which returns exactly 1 record accessed by a unique index.
>>> > Basically query travercial by index find reference to single
>>> CTID/rowid (or whatever name used in other RDBMS).
>>> > So at this point it really matters how fast your index grows rather
>>> than table size.
>>> >
>>> >
>>> > > The problem with it is that (especially with dynamic task mapping),
>>> it
>>> > might grow really, really fast. Basically you have NUM_DAGS x
>>> > NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
>>> > of records there.
>>> > Back-of-the envelope calculation Assuming you have a
>>> > DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
>>> > fields, each field evaluating to 1K string.  Then you have 10 tasks *
>>> > 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
>>> > per one(!) run of one(1) DAG. Run it every 10 minutes and every day
>>> > your database from a single DAG grows by whooping 1.44 GB of data
>>> > every single day (from single DAG).
>>> >
>>> > Depends on DB. If we talk about Postgres you could easily miss up to
>>> 3-4 times (thanks for inline compression before/instead of TOASTs).
>>> > I have a couple questions:
>>> > 1. Do you know how big would be indexes in
>>> rendered_task_instance_fields after one day? (Spoiler alert I could give
>>> estimation in the morning)
>>> > 2. In this scenario with default settings always would keep up to 30
>>> 000 TI for this DAG.
>>> > Could someone who wants to optimize the query make it more optimal
>>> rather than access the table by index (Unique Index -> ctid/rowid - record)
>>> and where this crosspoint?
>>> >
>>> > > This is of course an estimation that assumes a lot, but it's not at
>>> all unrealistic.
>>> >
>>> > 144,000 TI per/day on single DAG (initially I want to put here some
>>> sarcastic message).
>>> > How would Airflow feel with 144k Task Instances per day? How
>>> > I asked because right now I've always had a task_instance table bigger
>>> than rendered_task_instance_fields.
>>> >
>>> >
>>> > > This table is very specific compared with the other tables. The only
>>> reason for it being here is to be able
>>> > to show the rendered fields in the UI if you go to the specific run of
>>> a task. If you clean-up other tables you basically lose the history of
>>> execution of the tasks and you cannot really know if the data has been
>>> > processed, you cannot do backfills effectively, you lose all the
>>> > context. Cleaning this table is merely about the values that have been
>>> > rendered for a specific run and the assumption there is that the older
>>> > it gets, the less interesting it is.
>>> >
>>> > What about these tables: session, log, job? I expected the answer
>>> would be "They are not so specific."
>>> > For me every table is specific for their purpose.
>>> > Users often asked (slack, issues, discussions) to give the ability to
>>> auto-maintain tables/logs and receive the usual answer: "No, we don't give
>>> you this opportunity, please use something for Airflow Ecosystem Page". But
>>> on the other hand we have auto-maintenance only for a single table.
>>> >
>>> > >> It is opposite of what we have right now, we scan tables (maybe
>>> multiple times), read all records tuples which contain JSON.
>>> > > Not sure if I get the point here :). Yes -in my proposal I think the
>>> records will not be touched - only indexes. So the cleanup should be way
>>> faster, contentions less of problem, due to the way the delete
>>> > uses < ordering, deadlocks will not be possible at all (as opposed to
>>> the current "not in" - there is a very easy way to get into deadlocks when
>>> two parallel deletes are trying to delete same rows in a different
>>> sequence. I think my proposal improves all the characteristics of the
>>> "cleanup" with very little extra penalty on record creation.
>>> >
>>> > I was talking about the current solution and why it is also slow (and
>>> if abstract users use some DBaaS where they also pay for IOPs then it is
>>> expensive). Let's talk about the benefits of optimised queries for 4
>>> different DB backends (3 if excluding SQLite) when we have it.
>>> >
>>> > > We pay for table/index size linary more records, more size. But
>>> other operations vary and depend on B-Tree implementation and usually it
>>> has logarithmic growth. Or do we worry only about table/toast/index size on
>>> disk?
>>> > >> Yep. I (and I believe the original author had the same worry) am
>>> worried a lot about the size of the table and the fact that this table will
>>> be by far the biggest table in our DB while most of the old records will
>>> never be touched. And by the fact that this is the only table that our
>>> users will have to know about to clean up separately from all others pretty
>>> much always.
>>> >
>>> > Same as previous.
>>> >
>>> > > If not even worrying about money spent by our users, and performance
>>> degradation that comes with databases that are bigger - that's a lot of
>>> environmental effects that we might incur. Airflow is heavily used, if
>>> suddenly all our users
>>> > will start having 10 bigger databases that they have now because we
>>> will deprecate the values and keep all the history, then we have a big
>>> number of extra disks that will have to be used. I'd strongly prefer a
>>> solution where we keep the data usage lower in this case.
>>> >
>>> > Am I right that this is all about "lets don't delete by default as we
>>> do for other tables" rather than the current default implementation?
>>> > Because I get the result which is opposite what you said. And
>>> rendered_task_instance_fields don't grow faster than other tables that what
>>> I got.
>>> > I would like to compare it with other findings and some reproducible
>>> metrics rather than with hypothetical things.
>>> >
>>> > ----
>>> > Best Wishes
>>> > Andrey Anshin
>>> >
>>> >
>>> >
>>> > On Tue, 31 Jan 2023 at 11:12, Jarek Potiuk <ja...@potiuk.com> wrote:
>>> >>
>>> >> COMMENT: While writing the answer here, I think I found a deeper
>>> >> problem (and optimisation needed)  - i.e I think the delete should be
>>> >> even more fine-grained than it is today and include map_index) -
>>> >> please take a look at the end (Also maybe TP might comment on that
>>> >> one).
>>> >>
>>> >> > 1. Additional indexes add additional performance degradation on
>>> Insert but gain potential improvements on delete and unknown on update,
>>> RDBMS still require rebalance index and make it consistent to the table.
>>> >> > 2. LIMIT x OFFSET y could easily become full seq scan, especially
>>> if the user set a huge number for offset (which? unknown).
>>> >> > 3. Mixing two indexes could improve performance in a single query
>>> but in concurrent execution might lead to degradation because it needs to
>>> create a bitmap table for comparison between these two indexes, as result
>>> it might lead different issues, such as OOM on DB backend, use swaps or
>>> optimiser decided that better not to use this indexes.
>>> >>
>>> >> I think that is all something to be tested with explain plans. I think
>>> >> we would not know before we try - and possibly there are other
>>> >> optimisation approaches. The optimisation I proposed was only first
>>> >> that came to my mind to avoid the "not in" query. The problem  with
>>> >> "not in query" is that there is no way to optimise it by the DB.
>>> >> Effectively you have to get every record (or index entry) and test it.
>>> >> Maybe it can be done better :). And yes locking the index with
>>> >> anti-insert locks and the need to rebalance trees during the delete is
>>> >> a concern.
>>> >>
>>> >> > Is it a real problem? Until we access only by indexes, which
>>> doesn't include this JSON, it really doesn't matter. I guess we almost
>>> always should make a UNIQUE INDEX SCAN for SELECT or DELETE (UPDATE) a
>>> single record.
>>> >>
>>> >> Yes I think so, and while. I was not the author of this "cleanup"
>>> >> code, I believe I know the intention.
>>> >>
>>> >> It's not about index size or JSON access. It is about the size of the
>>> >> actual rows and storage it takes - i.e. general size of the database.
>>> >> The problem with it is that (especially with dynamic task mapping), it
>>> >> might grow really, really fast. Basically you have NUM_DAGS x
>>> >> NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
>>> >> of records there. Back-of-the envelope calculation Assuming you have a
>>> >> DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
>>> >> fields, each field evaluating to 1K string.  Then you have 10 tasks *
>>> >> 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
>>> >> per one(!) run of one(1) DAG. Run it every 10 minutes and every day
>>> >> your database from a single DAG grows by whooping 1.44 GB of data
>>> >> every single day (from single DAG).This is of course an estimation
>>> >> that assumes a lot, but it's not at all unrealistic. That's a lot. And
>>> >> if we want the user to do the cleanup then a) they need to know it b)
>>> >> they need to specifically clean up this table only because all the
>>> >> other data is relatively small. This table is very specific compared
>>> >> with the other tables. The only reason for it being here is to be able
>>> >> to show the rendered fields in the UI if you go to the specific run of
>>> >> a task. If you clean-up other tables you basically lose the history of
>>> >> execution of the tasks and you cannot really know if the data has been
>>> >> processed, you cannot do backfills effectively, you lose all the
>>> >> context. Cleaning this table is merely about the values that have been
>>> >> rendered for a specific run and the assumption there is that the older
>>> >> it gets, the less interesting it is.
>>> >>
>>> >> > It is opposite of what we have right now, we scan tables (maybe
>>> multiple times), read all records tuples which contain JSON.
>>> >>
>>> >> Not sure if I get the point here :). Yes -in my proposal I think the
>>> >> records will not be touched - only indexes. So the cleanup should be
>>> >> way faster, contentions less of problem, due to the way the delete
>>> >> uses < ordering, deadlocks will not be possible at all (as opposed to
>>> >> the current "not in" - there is a very easy way to get into deadlocks
>>> >> when two parallel deletes are trying to delete same rows in a
>>> >> different sequence. I think my proposal improves all the
>>> >> characteristics of the "cleanup" with very little extra penalty on
>>> >> record creation.
>>> >>
>>> >> > We pay for table/index size linary more records, more size. But
>>> other operations vary and depend on B-Tree implementation and usually it
>>> has logarithmic growth. Or do we worry only about table/toast/index size on
>>> disk?
>>> >>
>>> >> Yep. I (and I believe the original author had the same worry) am
>>> >> worried a lot about the size of the table and the fact that this table
>>> >> will be by far the biggest table in our DB while most of the old
>>> >> records will never be touched. And by the fact that this is the only
>>> >> table that our users will have to know about to clean up separately
>>> >> from all others pretty much always. If not even worrying about money
>>> >> spent by our users, and performance degradation that comes with
>>> >> databases that are bigger - that's a lot of environmental effects that
>>> >> we might incur. Airflow is heavily used, if suddenly all our users
>>> >> will start having 10 bigger databases that they have now because we
>>> >> will deprecate the values and keep all the history, then we have a big
>>> >> number of extra disks that will have to be used. I'd strongly prefer a
>>> >> solution where we keep the data usage lower in this case.
>>> >>
>>> >> > If we do not want to grant users the ability to clean up rendered
>>> templates tables, there could be another option:
>>> >> > - Do not delete records on every task instance run.
>>> >> > - Delete once per defined period (hourly, daily, weekly, monthly).
>>> In this case you really could not care about locks.
>>> >>
>>> >> Yes we could come up with a different strategy as to "when" run the
>>> >> cleanup. This is also a viable option. If you can propose one that
>>> >> will be equally adaptive as the current solution, I am all ears.
>>> >> Basically my goal is to keep the usage of the table low, possibly
>>> >> controlled by the same parameter we had. How we do it - this is a
>>> >> different story. If we - for example add a thread in the scheduler
>>> >> (for example) that performs such cleanup effectively in parallel and
>>> >> scales, I am happy with that.
>>> >>
>>> >> But I am trying to get into the head of the author trying to
>>> >> understand why the original implementation was done this way. I
>>> >> believe (and maybe those who remember it better could confirm it) that
>>> >> distributing the deletion to tasks to clean up after itself is a
>>> >> better idea than centralising the cleanup. This makes each cleanup
>>> >> smaller, locks are held for a shorter time (at least that was the
>>> >> assumption where no full table scan was used), it is more "immediate"
>>> >> and you do not have to decide upfront what should be the cleanup
>>> >> frequency. It seems this is the best logical approach to keep the
>>> >> "MAX_NUM_RENDERED_TI_FIELDS_PER_TASK" promise. Simply after task is
>>> >> complete, you can be sure that there are no more than this number of
>>> >> fields per task in the DB. With a scheduled run, that would be a much
>>> >> more "eventual" consistency and it will be potentially fluctuating
>>> >> much more.
>>> >>
>>> >> But there are risks involved in having a single thread doing the
>>> >> cleanup. I think it has a huge risk of being a "stop-the world" and
>>> >> "deadlock-prone" kind of event - if in big instances there are a lot
>>> >> of rows to cleanup in a single pass. When you delete entries from a
>>> >> table, there are anti-insert locks created for existing index entries,
>>> >> which makes it possible to rollback the whole DELETE transaction.
>>> >> Which means that when you try to insert the row with the same index,
>>> >> the index will be held. And this means that when you run a single huge
>>> >> DELETE for multiple rows affected with multiple (all?) index keys
>>> >> matching select query, it will effectively prevent new rows with the
>>> >> same indexes that are matching the SELECT. It would mean that if you
>>> >> have some tasks running while deleting existing run_id rendered
>>> >> fields, then you could REALLY start having deadlocks on those tasks
>>> >> trying to insert rendered task instance rows. That's why I think the
>>> >> only viable strategy for single "cleanup" thread is to do such cleanup
>>> >> as separate DELETE for each of the "dag/task/map_index/run" - in order
>>> >> to avoid such deadlocks. Which effectively will turn into what have
>>> >> currently - only that currently those transactions are done by tasks,
>>> >> not by a single cleanup thread.
>>> >>
>>> >> Also using tasks to delete old rows is more "effective" when you have
>>> >> vast differences in frequency of DAGs. Naturally - when you do it in
>>> >> task, you will only do it "when needed" for given DAG + Task. If you
>>> >> try to centralize the cleanup, unless you include somehow schedule and
>>> >> frequency of each dag, you are going to check every DAG every time
>>> >> your run the cleanup - no matter if that DAG is run daily or every
>>> >> minute, you will have to run the cleanup frequently enough to match
>>> >> your most frequent dags. If you have 1000 dags that run hourly and one
>>> >> DAG that runs every minue, then you have to run a cleanup job that
>>> >> scans all DAGs every few minutes. That's a big waste.
>>> >>
>>> >> So I am not sure if we gain anything by centralizing the cleanup.
>>> >> Decentralising it to Task seems to be a well thought and sound
>>> >> decision (but I think the problem we have now is that we need to
>>> >> optimize it after Dynamic Task Mapping has been added).
>>> >>
>>> >> ANOTHER FINDING:
>>> >>
>>> >> While looking at the code and discussing it and looking more closely I
>>> >> **think** there is another problem that we have to fix regardless of a
>>> >> solution. I THINK a problem we might have now is that we do not
>>> >> include map_index in this DELETE. While we are curreently delete all
>>> >> the rendered task fields without including map_index - and for big
>>> >> dynamic tasks, it means that exacly the same DELETE query is run by
>>> >> every single mapped instance of that tasks and that is where a lot of
>>> >> contention and locking might happen (basically when single task
>>> >> instance does the delete, anti-insert locks held the other mapped
>>> >> instances of the same task from inserting rendered fields).
>>> >>
>>> >> It does not change much in the optimisation proposal of mine, other
>>> >> than we should include map_index in those queries. But I think this
>>> >> might cause a lot of delays in the current implementation.
>>> >>
>>> >> J.
>>> >>
>>> >> > ----
>>> >> > Best Wishes
>>> >> > Andrey Anshin
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Mon, 30 Jan 2023 at 23:51, Jarek Potiuk <ja...@potiuk.com>
>>> wrote:
>>> >> >>
>>> >> >> I think there is a good reason to clean those up automatically.
>>> >> >> rendered task instance fields are almost arbitrary in size. If we
>>> try
>>> >> >> to keep all historical values there by default, there are numerous
>>> >> >> cases it will grow very fast - far, far too quickly.
>>> >> >>
>>> >> >> And I am not worried at all about locks on this table if we do it
>>> the
>>> >> >> way I described it and it uses the indexes. The contention this way
>>> >> >> might only be between the two deleting tasks. and with the query I
>>> >> >> proposed, they will only last for a short time - the index will be
>>> >> >> locked when two DELETES  or SELECT DISTINCT - which should both be
>>> >> >> fast.
>>> >> >>
>>> >> >>
>>> >> >> On Mon, Jan 30, 2023 at 8:37 PM Andrey Anshin <
>>> andrey.anshin@taragol.is> wrote:
>>> >> >> >
>>> >> >> > I guess two things involved to reduce performance on this query
>>> through the time: Dynamic Task Mapping and run_id instead of execution date.
>>> >> >> >
>>> >> >> > I still personally think that changing the default value from 30
>>> to 0 might improve performance of multiple concurrent tasks, just because
>>> this query does not run and there are no locks on multiple records/pages.
>>> >> >> >
>>> >> >> > I do not have any proof (yet?) other than simple DAGs. I think
>>> that there is some cross point exists when keeping this table growth worse
>>> rather than cleanup for each TI run. But users have ability to cleanup
>>> table by execute airflow db clean which should improve performance again
>>> >> >> >
>>> >> >> > And also there is interesting behavior with this query: if user
>>> already have more that value specified by
>>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK and tried run backfill
>>> than rendered templates not written to table (or may be inserted and after
>>> that immediately deleted), the same is valid for cleanup old tasks.
>>> >> >> >
>>> >> >> > ----
>>> >> >> > Best Wishes
>>> >> >> > Andrey Anshin
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Sun, 29 Jan 2023 at 14:16, Jarek Potiuk <ja...@potiuk.com>
>>> wrote:
>>> >> >> >>
>>> >> >> >> Yep. Agree this is not an efficient query and dynamic task
>>> mapping
>>> >> >> >> makes the effect much worse. Generally speaking, selecting "what
>>> >> >> >> should be left" and then deleting stuff where the key is "not
>>> in" is
>>> >> >> >> never an efficient way of running an sql query.  And the query
>>> not
>>> >> >> >> using index at all makes it rather terrible.
>>> >> >> >>
>>> >> >> >> I think we should not deprecate it though, but find a more
>>> efficient
>>> >> >> >> way of deleting the old keys. I think we could slightly
>>> denormalize
>>> >> >> >> RenderedTaskInstance + DagRun tables, and add
>>> DAG_RUN_EXECUTION_DATE
>>> >> >> >> to the RenderedTaskInstance table and that will be enough to
>>> optimise
>>> >> >> >> it.
>>> >> >> >>
>>> >> >> >> Then we could have either:
>>> >> >> >>
>>> >> >> >> * a composite B-TREE indexed (non-unique) index on DAG_ID,
>>> TASK_ID,
>>> >> >> >> RUN_ID_EXECUTION_DATE
>>> >> >> >> * or maybe even regular HASH index on DAG_ID, TASK_ID and
>>> separate
>>> >> >> >> B-TREE index (non-unique) on just RUN_ID_EXECUTION_DATE
>>> >> >> >>
>>> >> >> >> Probably the latter is better as I am not sure how < , >
>>> comparison
>>> >> >> >> looks like for composite B-TREE indexes when char + date
>>> columns are
>>> >> >> >> mixed. Also we could have hit the infamous MySQL index key
>>> length
>>> >> >> >> limit.
>>> >> >> >>
>>> >> >> >> Then deletion process would look roughly like:
>>> >> >> >>
>>> >> >> >> 1) dag_run_execution_date = SELECT RUN_ID_EXECUTION_DATE FROM
>>> >> >> >> RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID =<DAG_ID>,
>>> >> >> >> TASK_ID=<TASK_ID> ORDER BY RUN_ID_EXECUTION_DATE GROUP BY
>>> >> >> >> RUN_ID_EXECUTION_DATE DESC LIMIT 1 OFFSET
>>> >> >> >> <MAX_NUM_RENDERED_TI_FIELDS_PER_TASK>
>>> >> >> >> 2) DELETE FROM RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID
>>> =<DAG_ID>,
>>> >> >> >> TASK_ID=<TASK_ID> AND RENDER_TIME < dag_run_execution_date
>>> >> >> >>
>>> >> >> >> I believe that would be fast, and it would use the B-TREE index
>>> >> >> >> features nicely (ordering support)
>>> >> >> >>
>>> >> >> >> J
>>> >> >> >>
>>> >> >> >> On Sun, Jan 29, 2023 at 2:09 AM Andrey Anshin <
>>> andrey.anshin@taragol.is> wrote:
>>> >> >> >> >
>>> >> >> >> > First of all I want to highlight that this approach I guess
>>> worked well until Dynamic Task Mappings introduced.
>>> >> >> >> >
>>> >> >> >> > > The main reason for adding that cleanup was -- if you don't
>>> do that, you will have many rows, similar to the TaskInstance table
>>> >> >> >> >
>>> >> >> >> > The problem itself is not how big your table/indexes, rather
>>> then what kind of operation you run.
>>> >> >> >> >
>>> >> >> >> > > Do you have any data for locks or performance degradation?
>>> >> >> >> >
>>> >> >> >> > In this case if we try to clean up
>>> rendered_task_instance_fields table when a new TI is created/cleared we
>>> make almost two full/sequential scans (note: need to check) against the
>>> table without any index usage, so we pay here a couple times:
>>> >> >> >> > 1. We scan without indexes - not all parts of the composite
>>> key are included to query, plus we need to filter everything except 30
>>> records with order and distinct
>>> >> >> >> > 2. After that we make another full scan for find 1 record or
>>> map_size records
>>> >> >> >> >
>>> >> >> >> > And I guess the situation becomes worse if you have a lot of
>>> tasks, even if we have a small table, we need to do ineffective operations.
>>> >> >> >> >
>>> >> >> >> > That how looks like Query Plan (please note without commit
>>> transaction DELETE operation doesn't have all information):
>>> https://gist.github.com/Taragolis/3ca7621c51b00f077aa1646401ddf31b
>>> >> >> >> >
>>> >> >> >> > In case if we do not clean up the table, we only use these
>>> operations:
>>> >> >> >> > 1. SELECT single record by index
>>> >> >> >> > 2. INSERT new record
>>> >> >> >> > 3. DELETE old record(s), which were found by index.
>>> >> >> >> >
>>> >> >> >> > I have not done any real tests yet, only synthetic DAGs (so
>>> we should not consider to use any findings as totally truth):
>>> https://gist.github.com/Taragolis/6eec9f81efdf360c4239fc6ea385a480
>>> >> >> >> > DAG with parallel tasks: degradation up to 2-3 times
>>> >> >> >> > DAG with single map tasks: degradation up to 7-10 times
>>> >> >> >> >
>>> >> >> >> > I have a plan for more complex and more close to real use
>>> cases with Database which do not have network latency almost 0 as I have in
>>> my local.
>>> >> >> >> > But I will not refuse if someone also does their tests with
>>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK=0 vs default value.
>>> >> >> >> >
>>> >> >> >> > About deadlock we know that it exists at least in MySQL:
>>> https://github.com/apache/airflow/pull/18616
>>> >> >> >> >
>>> >> >> >> > > And the larger tables create problems during database
>>> migrations.
>>> >> >> >> >
>>> >> >> >> > That is a very good point, so if we found that problem only
>>> related to migrations we could:
>>> >> >> >> > 1. Cleanup this table in migration
>>> >> >> >> > 2. Add cli command to airflow db which could cleanup only
>>> rendered fields, so it would be user's choice cleanup or not before
>>> migration, do periodical maintenance or not
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > ----
>>> >> >> >> > Best Wishes
>>> >> >> >> > Andrey Anshin
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > On Sat, 28 Jan 2023 at 23:41, Kaxil Naik <ka...@gmail.com>
>>> wrote:
>>> >> >> >> >>>
>>> >> >> >> >>> Potentially it is a good idea to deprecate this option and
>>> recommend for users to set it to 0? WDYT? Maybe someone has already tried
>>> or investigated this?
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> The main reason for adding that cleanup was -- if you don't
>>> do that, you will have many rows, similar to the TaskInstance table. And
>>> the RenderedTIFields were mainly added for checking rendered TI fields on
>>> the Webserver only because after DAG Serialization, the webserver won't
>>> have access to DAG files.
>>> >> >> >> >>
>>> >> >> >> >> And the larger tables create problems during database
>>> migrations.
>>> >> >> >> >>
>>> >> >> >> >> Do you have any data for locks or performance degradation?
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On Sat, 28 Jan 2023 at 13:06, Andrey Anshin <
>>> andrey.anshin@taragol.is> wrote:
>>> >> >> >> >>>
>>> >> >> >> >>> Greetings!
>>> >> >> >> >>>
>>> >> >> >> >>> During migrating our ORM syntax to compatible with
>>> SQLAlchemy 2.0 I probably found skeletons in the closet.
>>> >> >> >> >>>
>>> >> >> >> >>> Let's start from the beginning, initially I got this warning
>>> >> >> >> >>>
>>> >> >> >> >>> airflow/models/renderedtifields.py:245
>>> RemovedIn20Warning('ORDER BY columns added implicitly due to DISTINCT is
>>> deprecated and will be removed in SQLAlchemy 2.0.  SELECT statements with
>>> DISTINCT should be written to explicitly include the appropriate columns in
>>> the columns clause (Background on SQLAlchemy 2.0 at:
>>> https://sqlalche.me/e/b8d9)')
>>> >> >> >> >>>
>>> >> >> >> >>> "OK let's fix it!", I thought at first and started to
>>> investigate RenderedTaskInstanceFields model
>>> >> >> >> >>>
>>> >> >> >> >>> Skeleton #1:
>>> >> >> >> >>>
>>> >> >> >> >>> When I first time look on the code and comments it got me
>>> to thinking that part which keep only latest N Rendered Task Fields
>>> potentially could lead different performance degradation (Locks, Dead
>>> Locks, Data Bloating): see code
>>> https://github.com/apache/airflow/blob/6c479437b1aedf74d029463bda56b42950278287/airflow/models/renderedtifields.py#L185-L245
>>> >> >> >> >>>
>>> >> >> >> >>> Also this historical part (from Airflow 1.10.10) generate
>>> this SQL Statement (pg backend)
>>> >> >> >> >>>
>>> >> >> >> >>> DELETE FROM rendered_task_instance_fields
>>> >> >> >> >>> WHERE rendered_task_instance_fields.dag_id = %(dag_id_1) s
>>> >> >> >> >>>   AND rendered_task_instance_fields.task_id = %(task_id_1) s
>>> >> >> >> >>>   AND (
>>> >> >> >> >>>     (
>>> >> >> >> >>>       rendered_task_instance_fields.dag_id,
>>> >> >> >> >>>       rendered_task_instance_fields.task_id,
>>> >> >> >> >>>       rendered_task_instance_fields.run_id
>>> >> >> >> >>>     ) NOT IN (
>>> >> >> >> >>>       SELECT
>>> >> >> >> >>>         anon_1.dag_id,
>>> >> >> >> >>>         anon_1.task_id,
>>> >> >> >> >>>         anon_1.run_id
>>> >> >> >> >>>       FROM
>>> >> >> >> >>>         (
>>> >> >> >> >>>           SELECT DISTINCT
>>> >> >> >> >>>             rendered_task_instance_fields.dag_id AS dag_id,
>>> >> >> >> >>>             rendered_task_instance_fields.task_id AS
>>> task_id,
>>> >> >> >> >>>             rendered_task_instance_fields.run_id AS run_id,
>>> >> >> >> >>>             dag_run.execution_date AS execution_date
>>> >> >> >> >>>           FROM rendered_task_instance_fields
>>> >> >> >> >>>             JOIN dag_run ON
>>> rendered_task_instance_fields.dag_id = dag_run.dag_id
>>> >> >> >> >>>             AND rendered_task_instance_fields.run_id =
>>> dag_run.run_id
>>> >> >> >> >>>           WHERE
>>> >> >> >> >>>             rendered_task_instance_fields.dag_id =
>>> %(dag_id_2) s
>>> >> >> >> >>>             AND rendered_task_instance_fields.task_id =
>>> %(task_id_2) s
>>> >> >> >> >>>           ORDER BY
>>> >> >> >> >>>             dag_run.execution_date DESC
>>> >> >> >> >>>           limit %(param_1) s
>>> >> >> >> >>>         ) AS anon_1
>>> >> >> >> >>>     )
>>> >> >> >> >>>   )
>>> >> >> >> >>>
>>> >> >> >> >>> Which is especially not effective in PostgreSQL. When IN
>>> SUBQUERY could be easily transform internaly into SEMI-JOIN (aka EXISTS
>>> clause), but it is not working for NOT IN SUBQUERY because it is not
>>> transformed into ANTI JOIN (aka NOT EXISTS clause) even if it possible,
>>> see: https://commitfest.postgresql.org/27/2023/
>>> >> >> >> >>>
>>> >> >> >> >>> I didn't do any performance benchmarks yet but I guess if
>>> users set AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK to 0 rather
>>> than default 30 it could improve performance and reduce number of
>>> DeadLocks, however the table size will increase but I think we don't do any
>>> maintenance job for other tables.
>>> >> >> >> >>>
>>> >> >> >> >>> Potentially it is a good idea to deprecate this option and
>>> recommend for users to set it to 0? WDYT? Maybe someone has already tried
>>> or investigated this?
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>> Skeleton #2:
>>> >> >> >> >>>
>>> >> >> >> >>> We have a k8s_pod_yaml field which is exclusively used by
>>> K8S executors.
>>> >> >> >> >>>
>>> >> >> >> >>> Should we also decouple this field as part of AIP-51?
>>> >> >> >> >>>
>>> >> >> >> >>> ----
>>> >> >> >> >>> Best Wishes
>>> >> >> >> >>> Andrey Anshin
>>> >> >> >> >>>
>>>
>>

Re: [Discussion] Deprecate auto cleanup RenderedTaskInstanceFields and decouple k8s_pod_yaml

Posted by Kaxil Naik <ka...@gmail.com>.
Maybe the “auto-maintenance” is true. Hopefully for not long though :)

On Thu, 2 Feb 2023 at 01:41, Kaxil Naik <ka...@gmail.com> wrote:

>
> >Users often asked (slack, issues, discussions) to give the ability to
>> auto-maintain tables/logs and receive the usual answer: "No, we don't give
>> you this opportunity, please use something for Airflow Ecosystem Page". But
>> on the other hand we have auto-maintenance only for a single table.
>
>
> That is not true, we added "airflow db clean" command to give this control
> to users.
>
> What about these tables: session, log, job? I expected the answer would be
>> "They are not so specific."
>
> The RTIF can vary in size based on what is stored and is really useless
> beyond last X for each task. RTIF also has a foreign key constraints
> <https://github.com/apache/airflow/blob/d80b583db07197c8c3d0549a805e83ceaaf10d52/airflow/models/renderedtifields.py#L63-L73> on
> TI and that is the most common table that is affected during the migrations
> and can affect scheduling decisions. Log and Session tables aren't affected
> by it.
>
> I will have a detailed reply if I manage to find the time, it has just
> been too difficult
>
> On Tue, 31 Jan 2023 at 21:46, Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> BTW, I would really love to hear what the original authors have to say
>> here. I am merely trying to put myself in their shoes and guess what
>> the reasoning is :).
>>
>> I think this is really a question of: Do we want to keep all rendered
>> arbitrary size rendered task instance fields in our database forever
>> by default, same as other fields.
>> I believe the original authors answered the question to be "no". And
>> the "num_row" was a way to limit it.
>>
>> And I really do not want to "win" this argument, I just want to
>> protect our users (and environment).
>>
>> There is (IMHO) currently a big difference between
>> session/logg/task_instance fields and rendered task_instance fields
>> that justify different behaviour.
>>
>> The former are generally fixed in max size of rows (this is one of the
>> reasons we have limited string sizes in our DB) - to be able to limit
>> them growing uncontrollably large. We simply do not keep arbitrary
>> size data in those tables.
>> On the other hand, the rendered task instance is arbitrary in size
>> (JSONField) and the need for deletion accounts for the "worst" case
>> scenario.
>> Until we get rid of that "property" of the rendered task instance
>> table, I think "just" skipping deletion of those fields without fixing
>> the "worst" case scenario is not a good idea.
>>
>> Maybe in your test cases (and many others) those tables are not
>> bigger, but I think the protection here is implemented to account for
>> the case where the rendered task instance field is "big". I think the
>> protection here is done for the cases where the rendered task instance
>> fields are really "big".
>>
>> But very interestingly - if the rendered task instance is "big" then
>> likely it is next to useless to be displayed in the Web UI in its
>> entirety.
>>
>> So maybe you are actually right Andrey, Maybe we can skip deleting
>> those and maybe we could solve it differently and apply the same rules
>> as other tables?
>>
>> Let me then - constructively - propose another idea which actually
>> might solve both yours and my own concerns. Maybe we can fix the
>> "worst case" scenario differently? We do not have to limit the number
>> of rows, we can limit the size of the row instead.
>>
>> Why don't we simply limit the size of the rendered task instance JSON
>> and if they are too big (we can configure the maximum size), we will
>> render something like (probably a bit more unique and not
>> "accidentally triggerable"):
>>
>> {
>>    "error": "This task instance has too large rendered task instances
>> to display it"
>> }
>>
>> And implement an escape hatch in the web server to handle it properly
>> when displaying such "truncated" rendered task instance field.
>>
>> We should be able come up with a sensible max size that we might think
>> makes sense when rendering it in the web UI.  And we could make the
>> max size configurable by the user if they have a lot of GB to spare.
>> We could even apply it automatically. If there is a max_num_row_limit
>> - we allow any size, if we have no limit on the number of rows, we
>> limit the maximum row size.
>>
>> If we have such "upper-bounded" reasonable size of each row in that
>> table - then I am perfectly happy with not deleting the rows
>> automatically.
>> But only if we limit that and handle the "worst" case properly.
>>
>> That would be my proposal how we can handle it to get both views taken
>> into account.
>>
>> What do you Andrey (and others) think? Does it make sense? Or do we
>> think we should not have any such protections in place ?
>>
>> J.
>>
>>
>> On Tue, Jan 31, 2023 at 9:43 PM Andrey Anshin <an...@taragol.is>
>> wrote:
>> >
>> > > I think that is all something to be tested with explaining plans.
>> >
>> > I would be really careful with these results. DELETE in Transaction
>> with Rollback usually shows more optimistic than actually executed.
>> >
>> >
>> > > I think we would not know before we try - and possibly there are other
>> > optimisation approaches. The optimisation I proposed was only first
>> > that came to my mind to avoid the "not in" query. The problem  with
>> > "not in query" is that there is no way to optimise it by the DB.
>> > Effectively you have to get every record (or index entry) and test it.
>> > Maybe it can be done better :). And yes locking the index with
>> > anti-insert locks and the need to rebalance trees during the delete is
>> > a concern.
>> >
>> > My point is still the same, I would rather remove it in the future or
>> make policy about maintenance more consistent: all or nothing. Right now we
>> are close to nothing rather than all.
>> >
>> >
>> > > It's not about index size or JSON access. It is about the size of the
>> > actual rows and storage it takes - i.e. general size of the database.
>> >
>> > I'm tired, but I'm not sure that I understand the actual point.
>> > Is it not really a matter of size of the table if you always access by
>> pattern:
>> > 1 request which returns exactly 1 record accessed by a unique index.
>> > Basically query travercial by index find reference to single CTID/rowid
>> (or whatever name used in other RDBMS).
>> > So at this point it really matters how fast your index grows rather
>> than table size.
>> >
>> >
>> > > The problem with it is that (especially with dynamic task mapping), it
>> > might grow really, really fast. Basically you have NUM_DAGS x
>> > NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
>> > of records there.
>> > Back-of-the envelope calculation Assuming you have a
>> > DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
>> > fields, each field evaluating to 1K string.  Then you have 10 tasks *
>> > 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
>> > per one(!) run of one(1) DAG. Run it every 10 minutes and every day
>> > your database from a single DAG grows by whooping 1.44 GB of data
>> > every single day (from single DAG).
>> >
>> > Depends on DB. If we talk about Postgres you could easily miss up to
>> 3-4 times (thanks for inline compression before/instead of TOASTs).
>> > I have a couple questions:
>> > 1. Do you know how big would be indexes in
>> rendered_task_instance_fields after one day? (Spoiler alert I could give
>> estimation in the morning)
>> > 2. In this scenario with default settings always would keep up to 30
>> 000 TI for this DAG.
>> > Could someone who wants to optimize the query make it more optimal
>> rather than access the table by index (Unique Index -> ctid/rowid - record)
>> and where this crosspoint?
>> >
>> > > This is of course an estimation that assumes a lot, but it's not at
>> all unrealistic.
>> >
>> > 144,000 TI per/day on single DAG (initially I want to put here some
>> sarcastic message).
>> > How would Airflow feel with 144k Task Instances per day? How
>> > I asked because right now I've always had a task_instance table bigger
>> than rendered_task_instance_fields.
>> >
>> >
>> > > This table is very specific compared with the other tables. The only
>> reason for it being here is to be able
>> > to show the rendered fields in the UI if you go to the specific run of
>> a task. If you clean-up other tables you basically lose the history of
>> execution of the tasks and you cannot really know if the data has been
>> > processed, you cannot do backfills effectively, you lose all the
>> > context. Cleaning this table is merely about the values that have been
>> > rendered for a specific run and the assumption there is that the older
>> > it gets, the less interesting it is.
>> >
>> > What about these tables: session, log, job? I expected the answer would
>> be "They are not so specific."
>> > For me every table is specific for their purpose.
>> > Users often asked (slack, issues, discussions) to give the ability to
>> auto-maintain tables/logs and receive the usual answer: "No, we don't give
>> you this opportunity, please use something for Airflow Ecosystem Page". But
>> on the other hand we have auto-maintenance only for a single table.
>> >
>> > >> It is opposite of what we have right now, we scan tables (maybe
>> multiple times), read all records tuples which contain JSON.
>> > > Not sure if I get the point here :). Yes -in my proposal I think the
>> records will not be touched - only indexes. So the cleanup should be way
>> faster, contentions less of problem, due to the way the delete
>> > uses < ordering, deadlocks will not be possible at all (as opposed to
>> the current "not in" - there is a very easy way to get into deadlocks when
>> two parallel deletes are trying to delete same rows in a different
>> sequence. I think my proposal improves all the characteristics of the
>> "cleanup" with very little extra penalty on record creation.
>> >
>> > I was talking about the current solution and why it is also slow (and
>> if abstract users use some DBaaS where they also pay for IOPs then it is
>> expensive). Let's talk about the benefits of optimised queries for 4
>> different DB backends (3 if excluding SQLite) when we have it.
>> >
>> > > We pay for table/index size linary more records, more size. But other
>> operations vary and depend on B-Tree implementation and usually it has
>> logarithmic growth. Or do we worry only about table/toast/index size on
>> disk?
>> > >> Yep. I (and I believe the original author had the same worry) am
>> worried a lot about the size of the table and the fact that this table will
>> be by far the biggest table in our DB while most of the old records will
>> never be touched. And by the fact that this is the only table that our
>> users will have to know about to clean up separately from all others pretty
>> much always.
>> >
>> > Same as previous.
>> >
>> > > If not even worrying about money spent by our users, and performance
>> degradation that comes with databases that are bigger - that's a lot of
>> environmental effects that we might incur. Airflow is heavily used, if
>> suddenly all our users
>> > will start having 10 bigger databases that they have now because we
>> will deprecate the values and keep all the history, then we have a big
>> number of extra disks that will have to be used. I'd strongly prefer a
>> solution where we keep the data usage lower in this case.
>> >
>> > Am I right that this is all about "lets don't delete by default as we
>> do for other tables" rather than the current default implementation?
>> > Because I get the result which is opposite what you said. And
>> rendered_task_instance_fields don't grow faster than other tables that what
>> I got.
>> > I would like to compare it with other findings and some reproducible
>> metrics rather than with hypothetical things.
>> >
>> > ----
>> > Best Wishes
>> > Andrey Anshin
>> >
>> >
>> >
>> > On Tue, 31 Jan 2023 at 11:12, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>
>> >> COMMENT: While writing the answer here, I think I found a deeper
>> >> problem (and optimisation needed)  - i.e I think the delete should be
>> >> even more fine-grained than it is today and include map_index) -
>> >> please take a look at the end (Also maybe TP might comment on that
>> >> one).
>> >>
>> >> > 1. Additional indexes add additional performance degradation on
>> Insert but gain potential improvements on delete and unknown on update,
>> RDBMS still require rebalance index and make it consistent to the table.
>> >> > 2. LIMIT x OFFSET y could easily become full seq scan, especially if
>> the user set a huge number for offset (which? unknown).
>> >> > 3. Mixing two indexes could improve performance in a single query
>> but in concurrent execution might lead to degradation because it needs to
>> create a bitmap table for comparison between these two indexes, as result
>> it might lead different issues, such as OOM on DB backend, use swaps or
>> optimiser decided that better not to use this indexes.
>> >>
>> >> I think that is all something to be tested with explain plans. I think
>> >> we would not know before we try - and possibly there are other
>> >> optimisation approaches. The optimisation I proposed was only first
>> >> that came to my mind to avoid the "not in" query. The problem  with
>> >> "not in query" is that there is no way to optimise it by the DB.
>> >> Effectively you have to get every record (or index entry) and test it.
>> >> Maybe it can be done better :). And yes locking the index with
>> >> anti-insert locks and the need to rebalance trees during the delete is
>> >> a concern.
>> >>
>> >> > Is it a real problem? Until we access only by indexes, which doesn't
>> include this JSON, it really doesn't matter. I guess we almost always
>> should make a UNIQUE INDEX SCAN for SELECT or DELETE (UPDATE) a single
>> record.
>> >>
>> >> Yes I think so, and while. I was not the author of this "cleanup"
>> >> code, I believe I know the intention.
>> >>
>> >> It's not about index size or JSON access. It is about the size of the
>> >> actual rows and storage it takes - i.e. general size of the database.
>> >> The problem with it is that (especially with dynamic task mapping), it
>> >> might grow really, really fast. Basically you have NUM_DAGS x
>> >> NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
>> >> of records there. Back-of-the envelope calculation Assuming you have a
>> >> DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
>> >> fields, each field evaluating to 1K string.  Then you have 10 tasks *
>> >> 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
>> >> per one(!) run of one(1) DAG. Run it every 10 minutes and every day
>> >> your database from a single DAG grows by whooping 1.44 GB of data
>> >> every single day (from single DAG).This is of course an estimation
>> >> that assumes a lot, but it's not at all unrealistic. That's a lot. And
>> >> if we want the user to do the cleanup then a) they need to know it b)
>> >> they need to specifically clean up this table only because all the
>> >> other data is relatively small. This table is very specific compared
>> >> with the other tables. The only reason for it being here is to be able
>> >> to show the rendered fields in the UI if you go to the specific run of
>> >> a task. If you clean-up other tables you basically lose the history of
>> >> execution of the tasks and you cannot really know if the data has been
>> >> processed, you cannot do backfills effectively, you lose all the
>> >> context. Cleaning this table is merely about the values that have been
>> >> rendered for a specific run and the assumption there is that the older
>> >> it gets, the less interesting it is.
>> >>
>> >> > It is opposite of what we have right now, we scan tables (maybe
>> multiple times), read all records tuples which contain JSON.
>> >>
>> >> Not sure if I get the point here :). Yes -in my proposal I think the
>> >> records will not be touched - only indexes. So the cleanup should be
>> >> way faster, contentions less of problem, due to the way the delete
>> >> uses < ordering, deadlocks will not be possible at all (as opposed to
>> >> the current "not in" - there is a very easy way to get into deadlocks
>> >> when two parallel deletes are trying to delete same rows in a
>> >> different sequence. I think my proposal improves all the
>> >> characteristics of the "cleanup" with very little extra penalty on
>> >> record creation.
>> >>
>> >> > We pay for table/index size linary more records, more size. But
>> other operations vary and depend on B-Tree implementation and usually it
>> has logarithmic growth. Or do we worry only about table/toast/index size on
>> disk?
>> >>
>> >> Yep. I (and I believe the original author had the same worry) am
>> >> worried a lot about the size of the table and the fact that this table
>> >> will be by far the biggest table in our DB while most of the old
>> >> records will never be touched. And by the fact that this is the only
>> >> table that our users will have to know about to clean up separately
>> >> from all others pretty much always. If not even worrying about money
>> >> spent by our users, and performance degradation that comes with
>> >> databases that are bigger - that's a lot of environmental effects that
>> >> we might incur. Airflow is heavily used, if suddenly all our users
>> >> will start having 10 bigger databases that they have now because we
>> >> will deprecate the values and keep all the history, then we have a big
>> >> number of extra disks that will have to be used. I'd strongly prefer a
>> >> solution where we keep the data usage lower in this case.
>> >>
>> >> > If we do not want to grant users the ability to clean up rendered
>> templates tables, there could be another option:
>> >> > - Do not delete records on every task instance run.
>> >> > - Delete once per defined period (hourly, daily, weekly, monthly).
>> In this case you really could not care about locks.
>> >>
>> >> Yes we could come up with a different strategy as to "when" run the
>> >> cleanup. This is also a viable option. If you can propose one that
>> >> will be equally adaptive as the current solution, I am all ears.
>> >> Basically my goal is to keep the usage of the table low, possibly
>> >> controlled by the same parameter we had. How we do it - this is a
>> >> different story. If we - for example add a thread in the scheduler
>> >> (for example) that performs such cleanup effectively in parallel and
>> >> scales, I am happy with that.
>> >>
>> >> But I am trying to get into the head of the author trying to
>> >> understand why the original implementation was done this way. I
>> >> believe (and maybe those who remember it better could confirm it) that
>> >> distributing the deletion to tasks to clean up after itself is a
>> >> better idea than centralising the cleanup. This makes each cleanup
>> >> smaller, locks are held for a shorter time (at least that was the
>> >> assumption where no full table scan was used), it is more "immediate"
>> >> and you do not have to decide upfront what should be the cleanup
>> >> frequency. It seems this is the best logical approach to keep the
>> >> "MAX_NUM_RENDERED_TI_FIELDS_PER_TASK" promise. Simply after task is
>> >> complete, you can be sure that there are no more than this number of
>> >> fields per task in the DB. With a scheduled run, that would be a much
>> >> more "eventual" consistency and it will be potentially fluctuating
>> >> much more.
>> >>
>> >> But there are risks involved in having a single thread doing the
>> >> cleanup. I think it has a huge risk of being a "stop-the world" and
>> >> "deadlock-prone" kind of event - if in big instances there are a lot
>> >> of rows to cleanup in a single pass. When you delete entries from a
>> >> table, there are anti-insert locks created for existing index entries,
>> >> which makes it possible to rollback the whole DELETE transaction.
>> >> Which means that when you try to insert the row with the same index,
>> >> the index will be held. And this means that when you run a single huge
>> >> DELETE for multiple rows affected with multiple (all?) index keys
>> >> matching select query, it will effectively prevent new rows with the
>> >> same indexes that are matching the SELECT. It would mean that if you
>> >> have some tasks running while deleting existing run_id rendered
>> >> fields, then you could REALLY start having deadlocks on those tasks
>> >> trying to insert rendered task instance rows. That's why I think the
>> >> only viable strategy for single "cleanup" thread is to do such cleanup
>> >> as separate DELETE for each of the "dag/task/map_index/run" - in order
>> >> to avoid such deadlocks. Which effectively will turn into what have
>> >> currently - only that currently those transactions are done by tasks,
>> >> not by a single cleanup thread.
>> >>
>> >> Also using tasks to delete old rows is more "effective" when you have
>> >> vast differences in frequency of DAGs. Naturally - when you do it in
>> >> task, you will only do it "when needed" for given DAG + Task. If you
>> >> try to centralize the cleanup, unless you include somehow schedule and
>> >> frequency of each dag, you are going to check every DAG every time
>> >> your run the cleanup - no matter if that DAG is run daily or every
>> >> minute, you will have to run the cleanup frequently enough to match
>> >> your most frequent dags. If you have 1000 dags that run hourly and one
>> >> DAG that runs every minue, then you have to run a cleanup job that
>> >> scans all DAGs every few minutes. That's a big waste.
>> >>
>> >> So I am not sure if we gain anything by centralizing the cleanup.
>> >> Decentralising it to Task seems to be a well thought and sound
>> >> decision (but I think the problem we have now is that we need to
>> >> optimize it after Dynamic Task Mapping has been added).
>> >>
>> >> ANOTHER FINDING:
>> >>
>> >> While looking at the code and discussing it and looking more closely I
>> >> **think** there is another problem that we have to fix regardless of a
>> >> solution. I THINK a problem we might have now is that we do not
>> >> include map_index in this DELETE. While we are curreently delete all
>> >> the rendered task fields without including map_index - and for big
>> >> dynamic tasks, it means that exacly the same DELETE query is run by
>> >> every single mapped instance of that tasks and that is where a lot of
>> >> contention and locking might happen (basically when single task
>> >> instance does the delete, anti-insert locks held the other mapped
>> >> instances of the same task from inserting rendered fields).
>> >>
>> >> It does not change much in the optimisation proposal of mine, other
>> >> than we should include map_index in those queries. But I think this
>> >> might cause a lot of delays in the current implementation.
>> >>
>> >> J.
>> >>
>> >> > ----
>> >> > Best Wishes
>> >> > Andrey Anshin
>> >> >
>> >> >
>> >> >
>> >> > On Mon, 30 Jan 2023 at 23:51, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >> >>
>> >> >> I think there is a good reason to clean those up automatically.
>> >> >> rendered task instance fields are almost arbitrary in size. If we
>> try
>> >> >> to keep all historical values there by default, there are numerous
>> >> >> cases it will grow very fast - far, far too quickly.
>> >> >>
>> >> >> And I am not worried at all about locks on this table if we do it
>> the
>> >> >> way I described it and it uses the indexes. The contention this way
>> >> >> might only be between the two deleting tasks. and with the query I
>> >> >> proposed, they will only last for a short time - the index will be
>> >> >> locked when two DELETES  or SELECT DISTINCT - which should both be
>> >> >> fast.
>> >> >>
>> >> >>
>> >> >> On Mon, Jan 30, 2023 at 8:37 PM Andrey Anshin <
>> andrey.anshin@taragol.is> wrote:
>> >> >> >
>> >> >> > I guess two things involved to reduce performance on this query
>> through the time: Dynamic Task Mapping and run_id instead of execution date.
>> >> >> >
>> >> >> > I still personally think that changing the default value from 30
>> to 0 might improve performance of multiple concurrent tasks, just because
>> this query does not run and there are no locks on multiple records/pages.
>> >> >> >
>> >> >> > I do not have any proof (yet?) other than simple DAGs. I think
>> that there is some cross point exists when keeping this table growth worse
>> rather than cleanup for each TI run. But users have ability to cleanup
>> table by execute airflow db clean which should improve performance again
>> >> >> >
>> >> >> > And also there is interesting behavior with this query: if user
>> already have more that value specified by
>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK and tried run backfill
>> than rendered templates not written to table (or may be inserted and after
>> that immediately deleted), the same is valid for cleanup old tasks.
>> >> >> >
>> >> >> > ----
>> >> >> > Best Wishes
>> >> >> > Andrey Anshin
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Sun, 29 Jan 2023 at 14:16, Jarek Potiuk <ja...@potiuk.com>
>> wrote:
>> >> >> >>
>> >> >> >> Yep. Agree this is not an efficient query and dynamic task
>> mapping
>> >> >> >> makes the effect much worse. Generally speaking, selecting "what
>> >> >> >> should be left" and then deleting stuff where the key is "not
>> in" is
>> >> >> >> never an efficient way of running an sql query.  And the query
>> not
>> >> >> >> using index at all makes it rather terrible.
>> >> >> >>
>> >> >> >> I think we should not deprecate it though, but find a more
>> efficient
>> >> >> >> way of deleting the old keys. I think we could slightly
>> denormalize
>> >> >> >> RenderedTaskInstance + DagRun tables, and add
>> DAG_RUN_EXECUTION_DATE
>> >> >> >> to the RenderedTaskInstance table and that will be enough to
>> optimise
>> >> >> >> it.
>> >> >> >>
>> >> >> >> Then we could have either:
>> >> >> >>
>> >> >> >> * a composite B-TREE indexed (non-unique) index on DAG_ID,
>> TASK_ID,
>> >> >> >> RUN_ID_EXECUTION_DATE
>> >> >> >> * or maybe even regular HASH index on DAG_ID, TASK_ID and
>> separate
>> >> >> >> B-TREE index (non-unique) on just RUN_ID_EXECUTION_DATE
>> >> >> >>
>> >> >> >> Probably the latter is better as I am not sure how < , >
>> comparison
>> >> >> >> looks like for composite B-TREE indexes when char + date columns
>> are
>> >> >> >> mixed. Also we could have hit the infamous MySQL index key length
>> >> >> >> limit.
>> >> >> >>
>> >> >> >> Then deletion process would look roughly like:
>> >> >> >>
>> >> >> >> 1) dag_run_execution_date = SELECT RUN_ID_EXECUTION_DATE FROM
>> >> >> >> RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID =<DAG_ID>,
>> >> >> >> TASK_ID=<TASK_ID> ORDER BY RUN_ID_EXECUTION_DATE GROUP BY
>> >> >> >> RUN_ID_EXECUTION_DATE DESC LIMIT 1 OFFSET
>> >> >> >> <MAX_NUM_RENDERED_TI_FIELDS_PER_TASK>
>> >> >> >> 2) DELETE FROM RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID
>> =<DAG_ID>,
>> >> >> >> TASK_ID=<TASK_ID> AND RENDER_TIME < dag_run_execution_date
>> >> >> >>
>> >> >> >> I believe that would be fast, and it would use the B-TREE index
>> >> >> >> features nicely (ordering support)
>> >> >> >>
>> >> >> >> J
>> >> >> >>
>> >> >> >> On Sun, Jan 29, 2023 at 2:09 AM Andrey Anshin <
>> andrey.anshin@taragol.is> wrote:
>> >> >> >> >
>> >> >> >> > First of all I want to highlight that this approach I guess
>> worked well until Dynamic Task Mappings introduced.
>> >> >> >> >
>> >> >> >> > > The main reason for adding that cleanup was -- if you don't
>> do that, you will have many rows, similar to the TaskInstance table
>> >> >> >> >
>> >> >> >> > The problem itself is not how big your table/indexes, rather
>> then what kind of operation you run.
>> >> >> >> >
>> >> >> >> > > Do you have any data for locks or performance degradation?
>> >> >> >> >
>> >> >> >> > In this case if we try to clean up
>> rendered_task_instance_fields table when a new TI is created/cleared we
>> make almost two full/sequential scans (note: need to check) against the
>> table without any index usage, so we pay here a couple times:
>> >> >> >> > 1. We scan without indexes - not all parts of the composite
>> key are included to query, plus we need to filter everything except 30
>> records with order and distinct
>> >> >> >> > 2. After that we make another full scan for find 1 record or
>> map_size records
>> >> >> >> >
>> >> >> >> > And I guess the situation becomes worse if you have a lot of
>> tasks, even if we have a small table, we need to do ineffective operations.
>> >> >> >> >
>> >> >> >> > That how looks like Query Plan (please note without commit
>> transaction DELETE operation doesn't have all information):
>> https://gist.github.com/Taragolis/3ca7621c51b00f077aa1646401ddf31b
>> >> >> >> >
>> >> >> >> > In case if we do not clean up the table, we only use these
>> operations:
>> >> >> >> > 1. SELECT single record by index
>> >> >> >> > 2. INSERT new record
>> >> >> >> > 3. DELETE old record(s), which were found by index.
>> >> >> >> >
>> >> >> >> > I have not done any real tests yet, only synthetic DAGs (so we
>> should not consider to use any findings as totally truth):
>> https://gist.github.com/Taragolis/6eec9f81efdf360c4239fc6ea385a480
>> >> >> >> > DAG with parallel tasks: degradation up to 2-3 times
>> >> >> >> > DAG with single map tasks: degradation up to 7-10 times
>> >> >> >> >
>> >> >> >> > I have a plan for more complex and more close to real use
>> cases with Database which do not have network latency almost 0 as I have in
>> my local.
>> >> >> >> > But I will not refuse if someone also does their tests with
>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK=0 vs default value.
>> >> >> >> >
>> >> >> >> > About deadlock we know that it exists at least in MySQL:
>> https://github.com/apache/airflow/pull/18616
>> >> >> >> >
>> >> >> >> > > And the larger tables create problems during database
>> migrations.
>> >> >> >> >
>> >> >> >> > That is a very good point, so if we found that problem only
>> related to migrations we could:
>> >> >> >> > 1. Cleanup this table in migration
>> >> >> >> > 2. Add cli command to airflow db which could cleanup only
>> rendered fields, so it would be user's choice cleanup or not before
>> migration, do periodical maintenance or not
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > ----
>> >> >> >> > Best Wishes
>> >> >> >> > Andrey Anshin
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Sat, 28 Jan 2023 at 23:41, Kaxil Naik <ka...@gmail.com>
>> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Potentially it is a good idea to deprecate this option and
>> recommend for users to set it to 0? WDYT? Maybe someone has already tried
>> or investigated this?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> The main reason for adding that cleanup was -- if you don't
>> do that, you will have many rows, similar to the TaskInstance table. And
>> the RenderedTIFields were mainly added for checking rendered TI fields on
>> the Webserver only because after DAG Serialization, the webserver won't
>> have access to DAG files.
>> >> >> >> >>
>> >> >> >> >> And the larger tables create problems during database
>> migrations.
>> >> >> >> >>
>> >> >> >> >> Do you have any data for locks or performance degradation?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Sat, 28 Jan 2023 at 13:06, Andrey Anshin <
>> andrey.anshin@taragol.is> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Greetings!
>> >> >> >> >>>
>> >> >> >> >>> During migrating our ORM syntax to compatible with
>> SQLAlchemy 2.0 I probably found skeletons in the closet.
>> >> >> >> >>>
>> >> >> >> >>> Let's start from the beginning, initially I got this warning
>> >> >> >> >>>
>> >> >> >> >>> airflow/models/renderedtifields.py:245
>> RemovedIn20Warning('ORDER BY columns added implicitly due to DISTINCT is
>> deprecated and will be removed in SQLAlchemy 2.0.  SELECT statements with
>> DISTINCT should be written to explicitly include the appropriate columns in
>> the columns clause (Background on SQLAlchemy 2.0 at:
>> https://sqlalche.me/e/b8d9)')
>> >> >> >> >>>
>> >> >> >> >>> "OK let's fix it!", I thought at first and started to
>> investigate RenderedTaskInstanceFields model
>> >> >> >> >>>
>> >> >> >> >>> Skeleton #1:
>> >> >> >> >>>
>> >> >> >> >>> When I first time look on the code and comments it got me to
>> thinking that part which keep only latest N Rendered Task Fields
>> potentially could lead different performance degradation (Locks, Dead
>> Locks, Data Bloating): see code
>> https://github.com/apache/airflow/blob/6c479437b1aedf74d029463bda56b42950278287/airflow/models/renderedtifields.py#L185-L245
>> >> >> >> >>>
>> >> >> >> >>> Also this historical part (from Airflow 1.10.10) generate
>> this SQL Statement (pg backend)
>> >> >> >> >>>
>> >> >> >> >>> DELETE FROM rendered_task_instance_fields
>> >> >> >> >>> WHERE rendered_task_instance_fields.dag_id = %(dag_id_1) s
>> >> >> >> >>>   AND rendered_task_instance_fields.task_id = %(task_id_1) s
>> >> >> >> >>>   AND (
>> >> >> >> >>>     (
>> >> >> >> >>>       rendered_task_instance_fields.dag_id,
>> >> >> >> >>>       rendered_task_instance_fields.task_id,
>> >> >> >> >>>       rendered_task_instance_fields.run_id
>> >> >> >> >>>     ) NOT IN (
>> >> >> >> >>>       SELECT
>> >> >> >> >>>         anon_1.dag_id,
>> >> >> >> >>>         anon_1.task_id,
>> >> >> >> >>>         anon_1.run_id
>> >> >> >> >>>       FROM
>> >> >> >> >>>         (
>> >> >> >> >>>           SELECT DISTINCT
>> >> >> >> >>>             rendered_task_instance_fields.dag_id AS dag_id,
>> >> >> >> >>>             rendered_task_instance_fields.task_id AS task_id,
>> >> >> >> >>>             rendered_task_instance_fields.run_id AS run_id,
>> >> >> >> >>>             dag_run.execution_date AS execution_date
>> >> >> >> >>>           FROM rendered_task_instance_fields
>> >> >> >> >>>             JOIN dag_run ON
>> rendered_task_instance_fields.dag_id = dag_run.dag_id
>> >> >> >> >>>             AND rendered_task_instance_fields.run_id =
>> dag_run.run_id
>> >> >> >> >>>           WHERE
>> >> >> >> >>>             rendered_task_instance_fields.dag_id =
>> %(dag_id_2) s
>> >> >> >> >>>             AND rendered_task_instance_fields.task_id =
>> %(task_id_2) s
>> >> >> >> >>>           ORDER BY
>> >> >> >> >>>             dag_run.execution_date DESC
>> >> >> >> >>>           limit %(param_1) s
>> >> >> >> >>>         ) AS anon_1
>> >> >> >> >>>     )
>> >> >> >> >>>   )
>> >> >> >> >>>
>> >> >> >> >>> Which is especially not effective in PostgreSQL. When IN
>> SUBQUERY could be easily transform internaly into SEMI-JOIN (aka EXISTS
>> clause), but it is not working for NOT IN SUBQUERY because it is not
>> transformed into ANTI JOIN (aka NOT EXISTS clause) even if it possible,
>> see: https://commitfest.postgresql.org/27/2023/
>> >> >> >> >>>
>> >> >> >> >>> I didn't do any performance benchmarks yet but I guess if
>> users set AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK to 0 rather
>> than default 30 it could improve performance and reduce number of
>> DeadLocks, however the table size will increase but I think we don't do any
>> maintenance job for other tables.
>> >> >> >> >>>
>> >> >> >> >>> Potentially it is a good idea to deprecate this option and
>> recommend for users to set it to 0? WDYT? Maybe someone has already tried
>> or investigated this?
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> Skeleton #2:
>> >> >> >> >>>
>> >> >> >> >>> We have a k8s_pod_yaml field which is exclusively used by
>> K8S executors.
>> >> >> >> >>>
>> >> >> >> >>> Should we also decouple this field as part of AIP-51?
>> >> >> >> >>>
>> >> >> >> >>> ----
>> >> >> >> >>> Best Wishes
>> >> >> >> >>> Andrey Anshin
>> >> >> >> >>>
>>
>