You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Stefan Seelmann <ma...@stefan-seelmann.de> on 2018/09/17 18:56:40 UTC

Database referral integrity

Hi,

looking into the DB schema there is almost no referral integrity
enforced at the database level. Many foreign key constraints between
dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make sense
IMO.

Is there a particular reason why that's not implemented?

Introducing it now will be hard, probably any real-world setup has some
violations. But I'm still in favor of this additional safety net.

Kind Regards,
Stefan

Re: Database referral integrity

Posted by Kevin Yang <yr...@gmail.com>.
Ya I've benchmarked query performance and ended up with this PR
<https://github.com/apache/airflow/pull/4235> to reduce the DB load.

About the DB load, I do see the DB being the bottler neck for us to scale
further( currently at ~3000 DAGs and ~20k TI at peak hours, we have 5 mins
SLA so scheduler was not an issue yet, nor will be in the near future IMO).
In happy days DB does remain calm(CPU, connections, IO and etc.) but the
two major incidents we had on Airflow this year were both from the DB( one
caused by this query
<https://github.com/apache/airflow/blob/master/airflow/api/common/experimental/delete_dag.py#L57>
which we disabled later). Some tables are quite busy with small
updates/insertions like task_instance and job and adding FKs IMO will make
these fragile tables more instable.

I agree to park this for now and when we next time looking at this we can
maybe first do all the benchmarking we were talking about and even try
sharding/revisit our DB isolation level to get more confidence on
performance of the DB. For now some errors caused by missing related rows
can probly be handled at app level.

My 2 cents.

Cheers,
Kevin Y

On Sun, Apr 14, 2019 at 1:57 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Thanks all for the input.
>
> After some testing, I found out that the possibilities of having FK's are
> also quite limited, for SQLite at least. For example, xcom.task_id =
> task_instance.task_id isn't possible because task_id on the task_instance
> isn't a key right now. This would involve adding another key to the field.
>
> The DB access won't be much slower, but the insertion and updating will be
> much slower because it will check the referential integrity on the commit
> step. On the other hand, I expect the database to be much cleaner because
> we make sure that old rows are being cleaned up, by cascading deletions.
> I'm curious which queries would be slow. I'll park this, for now, as I
> don't have too much time to work on this anyway :'(
>
> Cheers, Fokko
>
>
> Op wo 10 apr. 2019 om 23:50 schreef Alex Guziel
> <al...@airbnb.com.invalid>:
>
> > Kevin Yang has done some benchmarking as well.
> >
> > On Wed, Apr 10, 2019 at 2:32 PM Gabriel Silk <gs...@dropbox.com.invalid>
> > wrote:
> >
> > > The main concern I have with FK's is the potential performance impact.
> > >
> > > When evaluating Airflow for use at Dropbox, I ran benchmarks that
> showed
> > > several bottlenecks in the query patterns being used by the scheduler,
> > > which led me to open
> https://issues.apache.org/jira/browse/AIRFLOW-2430
> > > and
> > > submit a PR.
> > >
> > > If we move forward with FKs, we need to do a robust set of benchmarks
> to
> > > understand how this will affect users who are running large Airflow
> > > clusters.
> > >
> > > On Wed, Apr 10, 2019 at 1:05 PM Alex Guziel <alex.guziel@airbnb.com
> > > .invalid>
> > > wrote:
> > >
> > > > I'm not a huge fan of having foreign keys. I know Airbnb has and
> > > definitely
> > > > still has problems with DB load. I don't see any real convincing
> > > arguments
> > > > for how adding referential integrity will improve Airflow
> meaningfully
> > > > (yet).
> > > >
> > > > On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak <
> > > > basharenslak@godatadriven.com> wrote:
> > > >
> > > > > In my experience it could go either way; in some cases FKs could
> > impair
> > > > > the performance and in some cases FKs can help the query optimizer
> > > > improve
> > > > > query performance. Each case is different and without testing it’s
> > just
> > > > > guessing.
> > > > >
> > > > > I’m in favour of adding FKs and value referential integrity over
> > > > > performance. If you’re sacrificing integrity for performance you’re
> > > doing
> > > > > either advanced funky stuff or the wrong thing. I haven’t seen the
> > > > database
> > > > > being a bottleneck in Airflow, even with large setups (+-5000
> DAGs).
> > So
> > > > why
> > > > > not add FKs and performance test some Airflow queries, to know for
> > sure
> > > > :-)
> > > > >
> > > > > Bas
> > > > >
> > > > > On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbruin@gmail.com
> <mailto:
> > > > > bdbruin@gmail.com>> wrote:
> > > > >
> > > > > Please not that Fks can be quite slow...
> > > > >
> > > > > Verstuurd vanaf mijn iPad
> > > > >
> > > > > Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <ash@apache.org
> > > <mailto:
> > > > > ash@apache.org>> het volgende geschreven:
> > > > >
> > > > > I am all for FKs.
> > > > >
> > > > > How do you think we should handle the case of "Xcom but missing
> TIs"
> > > (or
> > > > > similar) that we might run into on installs when a user runs
> `airflow
> > > > > upgradedb`?
> > > > >
> > > > > -a
> > > > >
> > > > > On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl
> > > > <mailto:
> > > > > fokko@driesprong.frl>> wrote:
> > > > >
> > > > > Reviving this discussion again :-)
> > > > >
> > > > > Lately, I was digging into the PR of Julian regarding adding FK's
> to
> > > the
> > > > > database: https://github.com/apache/airflow/pull/4922
> > > > >
> > > > > After digging into the details, I've noticed that the current
> > situation
> > > > > regarding referential integrity is bad. It is not uncommon the have
> > > > > DagRun's without having the DAG in the database. For example, you
> can
> > > do
> > > > a
> > > > > backfill job before the scheduler persisted the DAG in the
> database.
> > I
> > > > also
> > > > > think this is often the case in the UI, where we the nuke when some
> > of
> > > > the
> > > > > models haven't been persisted in the database. Therefore I'd like
> to
> > > > > suggest to enforce consistency by foreign keys. This will prevent
> us
> > > from
> > > > > having DagRuns without DAGs, but also removing xcom objects of
> > > > > TaskInstances that are already removed.
> > > > >
> > > > > To create an overview of the FK's, I've created subtasks the ticket
> > of
> > > > > Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > >
> > > > >
> > > > > Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> > > > > maximebeauchemin@gmail.com<ma...@gmail.com>>:
> > > > >
> > > > > The database migration creating the FK will/would need to have
> > > something
> > > > > that either creates dummy missing PKs first, or delete the orphaned
> > > keys
> > > > to
> > > > > insure the operation of creating the FK doesn't error out. Seems
> like
> > > > > adding dummy keys is a better approach. Then you'll have to make
> sure
> > > > that
> > > > > everywhere where FKs are created that there are no edge cases of
> > > missing
> > > > > PKs. Then some delete operations in some cases may have to
> "cascade"
> > > > > properly.
> > > > >
> > > > > The Django Admin had this nice confirm screen on delete that would
> > show
> > > > you
> > > > > clearly the scope of the cascading operation when deleting objects.
> > To
> > > my
> > > > > knowledge Flask-Admin and FAB don't have such a feature.
> Personally I
> > > > > wouldn't allow cascade unless such a feature is implemented in some
> > > way.
> > > > > Note that SQLAlchemy has builtin semantics for specifying
> how/whether
> > > > > cascading should take place.
> > > > >
> > > > > Personally I think referential integrity is overrated in some
> cases,
> > > > > especially when using meaningful "business keys" (as opposed to
> > > > > auto-increted "surrogate" keys) as PKs. It also slows down insert
> > > > > operations. For data warehousing (this clearly doesn't apply to the
> > > > Airflow
> > > > > metadata store), the best practice on most db engine is to **not**
> > > > enforce
> > > > > FK constraints as it slows down inserts in fact tables and straight
> > out
> > > > > prevents bulk loading.
> > > > >
> > > > > Another approach is to avoid deleting in general, especially
> > referenced
> > > > > fks, and setting up some activity/visibility flag to false instead.
> > > > >
> > > > > Max
> > > > >
> > > > > On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko
> > > <fokko@driesprong.frl
> > > > > <ma...@driesprong.frl>>
> > > > > wrote:
> > > > >
> > > > > I'm in favor of having referential integrity. It will add some load
> > in
> > > > > having to enforce the referential integrity, but it will also make
> > sure
> > > > > that the database stays clean. Also in Airflow we use transactions
> > > which
> > > > > will make sure that the integrity checks are not validated on every
> > > > > statement, but after the commit. I'm happy to help with this as
> well.
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > > Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <
> > bdbruin@gmail.com
> > > > > <ma...@gmail.com>>:
> > > > >
> > > > > Adding these kind of checks which work for integrity well make
> > database
> > > > > access pretty slow. In addition it isnt there because in the past
> > there
> > > > > was
> > > > > no strong connection between for example tasks and dagruns, it was
> > more
> > > > > or
> > > > > less just coincidental. There also so some bisecting tools that
> > > > > probably
> > > > > have difficulty functioning in a new regime. In other words it is
> not
> > > > > an
> > > > > easy change and it will have operational challenges.
> > > > >
> > > > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org
> <mailto:
> > > > > ash@apache.org>> wrote:
> > > > >
> > > > > Ooh good spot.
> > > > >
> > > > > Yes I would be in favour of adding these, but as you say we need to
> > > > > thing about how we might migrate old data.
> > > > >
> > > > > Doing this at 2.0.0 and providing a cleanup script (or doing it as
> > > > > part
> > > > > of the migration?) is probably the way to go.
> > > > >
> > > > > -ash-
> > > > >
> > > > > On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de
> > > > <mailto:
> > > > > mail@stefan-seelmann.de>>
> > > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > looking into the DB schema there is almost no referral integrity
> > > > > enforced at the database level. Many foreign key constraints
> between
> > > > > dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> > > > > sense
> > > > > IMO.
> > > > >
> > > > > Is there a particular reason why that's not implemented?
> > > > >
> > > > > Introducing it now will be hard, probably any real-world setup has
> > > > > some
> > > > > violations. But I'm still in favor of this additional safety net.
> > > > >
> > > > > Kind Regards,
> > > > > Stefan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Database referral integrity

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Thanks all for the input.

After some testing, I found out that the possibilities of having FK's are
also quite limited, for SQLite at least. For example, xcom.task_id =
task_instance.task_id isn't possible because task_id on the task_instance
isn't a key right now. This would involve adding another key to the field.

The DB access won't be much slower, but the insertion and updating will be
much slower because it will check the referential integrity on the commit
step. On the other hand, I expect the database to be much cleaner because
we make sure that old rows are being cleaned up, by cascading deletions.
I'm curious which queries would be slow. I'll park this, for now, as I
don't have too much time to work on this anyway :'(

Cheers, Fokko


Op wo 10 apr. 2019 om 23:50 schreef Alex Guziel
<al...@airbnb.com.invalid>:

> Kevin Yang has done some benchmarking as well.
>
> On Wed, Apr 10, 2019 at 2:32 PM Gabriel Silk <gs...@dropbox.com.invalid>
> wrote:
>
> > The main concern I have with FK's is the potential performance impact.
> >
> > When evaluating Airflow for use at Dropbox, I ran benchmarks that showed
> > several bottlenecks in the query patterns being used by the scheduler,
> > which led me to open https://issues.apache.org/jira/browse/AIRFLOW-2430
> > and
> > submit a PR.
> >
> > If we move forward with FKs, we need to do a robust set of benchmarks to
> > understand how this will affect users who are running large Airflow
> > clusters.
> >
> > On Wed, Apr 10, 2019 at 1:05 PM Alex Guziel <alex.guziel@airbnb.com
> > .invalid>
> > wrote:
> >
> > > I'm not a huge fan of having foreign keys. I know Airbnb has and
> > definitely
> > > still has problems with DB load. I don't see any real convincing
> > arguments
> > > for how adding referential integrity will improve Airflow meaningfully
> > > (yet).
> > >
> > > On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak <
> > > basharenslak@godatadriven.com> wrote:
> > >
> > > > In my experience it could go either way; in some cases FKs could
> impair
> > > > the performance and in some cases FKs can help the query optimizer
> > > improve
> > > > query performance. Each case is different and without testing it’s
> just
> > > > guessing.
> > > >
> > > > I’m in favour of adding FKs and value referential integrity over
> > > > performance. If you’re sacrificing integrity for performance you’re
> > doing
> > > > either advanced funky stuff or the wrong thing. I haven’t seen the
> > > database
> > > > being a bottleneck in Airflow, even with large setups (+-5000 DAGs).
> So
> > > why
> > > > not add FKs and performance test some Airflow queries, to know for
> sure
> > > :-)
> > > >
> > > > Bas
> > > >
> > > > On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbruin@gmail.com<mailto:
> > > > bdbruin@gmail.com>> wrote:
> > > >
> > > > Please not that Fks can be quite slow...
> > > >
> > > > Verstuurd vanaf mijn iPad
> > > >
> > > > Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <ash@apache.org
> > <mailto:
> > > > ash@apache.org>> het volgende geschreven:
> > > >
> > > > I am all for FKs.
> > > >
> > > > How do you think we should handle the case of "Xcom but missing TIs"
> > (or
> > > > similar) that we might run into on installs when a user runs `airflow
> > > > upgradedb`?
> > > >
> > > > -a
> > > >
> > > > On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl
> > > <mailto:
> > > > fokko@driesprong.frl>> wrote:
> > > >
> > > > Reviving this discussion again :-)
> > > >
> > > > Lately, I was digging into the PR of Julian regarding adding FK's to
> > the
> > > > database: https://github.com/apache/airflow/pull/4922
> > > >
> > > > After digging into the details, I've noticed that the current
> situation
> > > > regarding referential integrity is bad. It is not uncommon the have
> > > > DagRun's without having the DAG in the database. For example, you can
> > do
> > > a
> > > > backfill job before the scheduler persisted the DAG in the database.
> I
> > > also
> > > > think this is often the case in the UI, where we the nuke when some
> of
> > > the
> > > > models haven't been persisted in the database. Therefore I'd like to
> > > > suggest to enforce consistency by foreign keys. This will prevent us
> > from
> > > > having DagRuns without DAGs, but also removing xcom objects of
> > > > TaskInstances that are already removed.
> > > >
> > > > To create an overview of the FK's, I've created subtasks the ticket
> of
> > > > Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
> > > >
> > > > WDYT?
> > > >
> > > > Cheers, Fokko
> > > >
> > > >
> > > >
> > > > Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> > > > maximebeauchemin@gmail.com<ma...@gmail.com>>:
> > > >
> > > > The database migration creating the FK will/would need to have
> > something
> > > > that either creates dummy missing PKs first, or delete the orphaned
> > keys
> > > to
> > > > insure the operation of creating the FK doesn't error out. Seems like
> > > > adding dummy keys is a better approach. Then you'll have to make sure
> > > that
> > > > everywhere where FKs are created that there are no edge cases of
> > missing
> > > > PKs. Then some delete operations in some cases may have to "cascade"
> > > > properly.
> > > >
> > > > The Django Admin had this nice confirm screen on delete that would
> show
> > > you
> > > > clearly the scope of the cascading operation when deleting objects.
> To
> > my
> > > > knowledge Flask-Admin and FAB don't have such a feature. Personally I
> > > > wouldn't allow cascade unless such a feature is implemented in some
> > way.
> > > > Note that SQLAlchemy has builtin semantics for specifying how/whether
> > > > cascading should take place.
> > > >
> > > > Personally I think referential integrity is overrated in some cases,
> > > > especially when using meaningful "business keys" (as opposed to
> > > > auto-increted "surrogate" keys) as PKs. It also slows down insert
> > > > operations. For data warehousing (this clearly doesn't apply to the
> > > Airflow
> > > > metadata store), the best practice on most db engine is to **not**
> > > enforce
> > > > FK constraints as it slows down inserts in fact tables and straight
> out
> > > > prevents bulk loading.
> > > >
> > > > Another approach is to avoid deleting in general, especially
> referenced
> > > > fks, and setting up some activity/visibility flag to false instead.
> > > >
> > > > Max
> > > >
> > > > On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko
> > <fokko@driesprong.frl
> > > > <ma...@driesprong.frl>>
> > > > wrote:
> > > >
> > > > I'm in favor of having referential integrity. It will add some load
> in
> > > > having to enforce the referential integrity, but it will also make
> sure
> > > > that the database stays clean. Also in Airflow we use transactions
> > which
> > > > will make sure that the integrity checks are not validated on every
> > > > statement, but after the commit. I'm happy to help with this as well.
> > > >
> > > > Cheers, Fokko
> > > >
> > > > Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <
> bdbruin@gmail.com
> > > > <ma...@gmail.com>>:
> > > >
> > > > Adding these kind of checks which work for integrity well make
> database
> > > > access pretty slow. In addition it isnt there because in the past
> there
> > > > was
> > > > no strong connection between for example tasks and dagruns, it was
> more
> > > > or
> > > > less just coincidental. There also so some bisecting tools that
> > > > probably
> > > > have difficulty functioning in a new regime. In other words it is not
> > > > an
> > > > easy change and it will have operational challenges.
> > > >
> > > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org<mailto:
> > > > ash@apache.org>> wrote:
> > > >
> > > > Ooh good spot.
> > > >
> > > > Yes I would be in favour of adding these, but as you say we need to
> > > > thing about how we might migrate old data.
> > > >
> > > > Doing this at 2.0.0 and providing a cleanup script (or doing it as
> > > > part
> > > > of the migration?) is probably the way to go.
> > > >
> > > > -ash-
> > > >
> > > > On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de
> > > <mailto:
> > > > mail@stefan-seelmann.de>>
> > > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > looking into the DB schema there is almost no referral integrity
> > > > enforced at the database level. Many foreign key constraints between
> > > > dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> > > > sense
> > > > IMO.
> > > >
> > > > Is there a particular reason why that's not implemented?
> > > >
> > > > Introducing it now will be hard, probably any real-world setup has
> > > > some
> > > > violations. But I'm still in favor of this additional safety net.
> > > >
> > > > Kind Regards,
> > > > Stefan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Database referral integrity

Posted by Alex Guziel <al...@airbnb.com.INVALID>.
Kevin Yang has done some benchmarking as well.

On Wed, Apr 10, 2019 at 2:32 PM Gabriel Silk <gs...@dropbox.com.invalid>
wrote:

> The main concern I have with FK's is the potential performance impact.
>
> When evaluating Airflow for use at Dropbox, I ran benchmarks that showed
> several bottlenecks in the query patterns being used by the scheduler,
> which led me to open https://issues.apache.org/jira/browse/AIRFLOW-2430
> and
> submit a PR.
>
> If we move forward with FKs, we need to do a robust set of benchmarks to
> understand how this will affect users who are running large Airflow
> clusters.
>
> On Wed, Apr 10, 2019 at 1:05 PM Alex Guziel <alex.guziel@airbnb.com
> .invalid>
> wrote:
>
> > I'm not a huge fan of having foreign keys. I know Airbnb has and
> definitely
> > still has problems with DB load. I don't see any real convincing
> arguments
> > for how adding referential integrity will improve Airflow meaningfully
> > (yet).
> >
> > On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak <
> > basharenslak@godatadriven.com> wrote:
> >
> > > In my experience it could go either way; in some cases FKs could impair
> > > the performance and in some cases FKs can help the query optimizer
> > improve
> > > query performance. Each case is different and without testing it’s just
> > > guessing.
> > >
> > > I’m in favour of adding FKs and value referential integrity over
> > > performance. If you’re sacrificing integrity for performance you’re
> doing
> > > either advanced funky stuff or the wrong thing. I haven’t seen the
> > database
> > > being a bottleneck in Airflow, even with large setups (+-5000 DAGs). So
> > why
> > > not add FKs and performance test some Airflow queries, to know for sure
> > :-)
> > >
> > > Bas
> > >
> > > On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbruin@gmail.com<mailto:
> > > bdbruin@gmail.com>> wrote:
> > >
> > > Please not that Fks can be quite slow...
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > > Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <ash@apache.org
> <mailto:
> > > ash@apache.org>> het volgende geschreven:
> > >
> > > I am all for FKs.
> > >
> > > How do you think we should handle the case of "Xcom but missing TIs"
> (or
> > > similar) that we might run into on installs when a user runs `airflow
> > > upgradedb`?
> > >
> > > -a
> > >
> > > On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl
> > <mailto:
> > > fokko@driesprong.frl>> wrote:
> > >
> > > Reviving this discussion again :-)
> > >
> > > Lately, I was digging into the PR of Julian regarding adding FK's to
> the
> > > database: https://github.com/apache/airflow/pull/4922
> > >
> > > After digging into the details, I've noticed that the current situation
> > > regarding referential integrity is bad. It is not uncommon the have
> > > DagRun's without having the DAG in the database. For example, you can
> do
> > a
> > > backfill job before the scheduler persisted the DAG in the database. I
> > also
> > > think this is often the case in the UI, where we the nuke when some of
> > the
> > > models haven't been persisted in the database. Therefore I'd like to
> > > suggest to enforce consistency by foreign keys. This will prevent us
> from
> > > having DagRuns without DAGs, but also removing xcom objects of
> > > TaskInstances that are already removed.
> > >
> > > To create an overview of the FK's, I've created subtasks the ticket of
> > > Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
> > >
> > > WDYT?
> > >
> > > Cheers, Fokko
> > >
> > >
> > >
> > > Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> > > maximebeauchemin@gmail.com<ma...@gmail.com>>:
> > >
> > > The database migration creating the FK will/would need to have
> something
> > > that either creates dummy missing PKs first, or delete the orphaned
> keys
> > to
> > > insure the operation of creating the FK doesn't error out. Seems like
> > > adding dummy keys is a better approach. Then you'll have to make sure
> > that
> > > everywhere where FKs are created that there are no edge cases of
> missing
> > > PKs. Then some delete operations in some cases may have to "cascade"
> > > properly.
> > >
> > > The Django Admin had this nice confirm screen on delete that would show
> > you
> > > clearly the scope of the cascading operation when deleting objects. To
> my
> > > knowledge Flask-Admin and FAB don't have such a feature. Personally I
> > > wouldn't allow cascade unless such a feature is implemented in some
> way.
> > > Note that SQLAlchemy has builtin semantics for specifying how/whether
> > > cascading should take place.
> > >
> > > Personally I think referential integrity is overrated in some cases,
> > > especially when using meaningful "business keys" (as opposed to
> > > auto-increted "surrogate" keys) as PKs. It also slows down insert
> > > operations. For data warehousing (this clearly doesn't apply to the
> > Airflow
> > > metadata store), the best practice on most db engine is to **not**
> > enforce
> > > FK constraints as it slows down inserts in fact tables and straight out
> > > prevents bulk loading.
> > >
> > > Another approach is to avoid deleting in general, especially referenced
> > > fks, and setting up some activity/visibility flag to false instead.
> > >
> > > Max
> > >
> > > On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko
> <fokko@driesprong.frl
> > > <ma...@driesprong.frl>>
> > > wrote:
> > >
> > > I'm in favor of having referential integrity. It will add some load in
> > > having to enforce the referential integrity, but it will also make sure
> > > that the database stays clean. Also in Airflow we use transactions
> which
> > > will make sure that the integrity checks are not validated on every
> > > statement, but after the commit. I'm happy to help with this as well.
> > >
> > > Cheers, Fokko
> > >
> > > Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbruin@gmail.com
> > > <ma...@gmail.com>>:
> > >
> > > Adding these kind of checks which work for integrity well make database
> > > access pretty slow. In addition it isnt there because in the past there
> > > was
> > > no strong connection between for example tasks and dagruns, it was more
> > > or
> > > less just coincidental. There also so some bisecting tools that
> > > probably
> > > have difficulty functioning in a new regime. In other words it is not
> > > an
> > > easy change and it will have operational challenges.
> > >
> > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org<mailto:
> > > ash@apache.org>> wrote:
> > >
> > > Ooh good spot.
> > >
> > > Yes I would be in favour of adding these, but as you say we need to
> > > thing about how we might migrate old data.
> > >
> > > Doing this at 2.0.0 and providing a cleanup script (or doing it as
> > > part
> > > of the migration?) is probably the way to go.
> > >
> > > -ash-
> > >
> > > On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de
> > <mailto:
> > > mail@stefan-seelmann.de>>
> > > wrote:
> > >
> > > Hi,
> > >
> > > looking into the DB schema there is almost no referral integrity
> > > enforced at the database level. Many foreign key constraints between
> > > dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> > > sense
> > > IMO.
> > >
> > > Is there a particular reason why that's not implemented?
> > >
> > > Introducing it now will be hard, probably any real-world setup has
> > > some
> > > violations. But I'm still in favor of this additional safety net.
> > >
> > > Kind Regards,
> > > Stefan
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Re: Database referral integrity

Posted by Gabriel Silk <gs...@dropbox.com.INVALID>.
The main concern I have with FK's is the potential performance impact.

When evaluating Airflow for use at Dropbox, I ran benchmarks that showed
several bottlenecks in the query patterns being used by the scheduler,
which led me to open https://issues.apache.org/jira/browse/AIRFLOW-2430 and
submit a PR.

If we move forward with FKs, we need to do a robust set of benchmarks to
understand how this will affect users who are running large Airflow
clusters.

On Wed, Apr 10, 2019 at 1:05 PM Alex Guziel <al...@airbnb.com.invalid>
wrote:

> I'm not a huge fan of having foreign keys. I know Airbnb has and definitely
> still has problems with DB load. I don't see any real convincing arguments
> for how adding referential integrity will improve Airflow meaningfully
> (yet).
>
> On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak <
> basharenslak@godatadriven.com> wrote:
>
> > In my experience it could go either way; in some cases FKs could impair
> > the performance and in some cases FKs can help the query optimizer
> improve
> > query performance. Each case is different and without testing it’s just
> > guessing.
> >
> > I’m in favour of adding FKs and value referential integrity over
> > performance. If you’re sacrificing integrity for performance you’re doing
> > either advanced funky stuff or the wrong thing. I haven’t seen the
> database
> > being a bottleneck in Airflow, even with large setups (+-5000 DAGs). So
> why
> > not add FKs and performance test some Airflow queries, to know for sure
> :-)
> >
> > Bas
> >
> > On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbruin@gmail.com<mailto:
> > bdbruin@gmail.com>> wrote:
> >
> > Please not that Fks can be quite slow...
> >
> > Verstuurd vanaf mijn iPad
> >
> > Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <ash@apache.org<mailto:
> > ash@apache.org>> het volgende geschreven:
> >
> > I am all for FKs.
> >
> > How do you think we should handle the case of "Xcom but missing TIs" (or
> > similar) that we might run into on installs when a user runs `airflow
> > upgradedb`?
> >
> > -a
> >
> > On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl
> <mailto:
> > fokko@driesprong.frl>> wrote:
> >
> > Reviving this discussion again :-)
> >
> > Lately, I was digging into the PR of Julian regarding adding FK's to the
> > database: https://github.com/apache/airflow/pull/4922
> >
> > After digging into the details, I've noticed that the current situation
> > regarding referential integrity is bad. It is not uncommon the have
> > DagRun's without having the DAG in the database. For example, you can do
> a
> > backfill job before the scheduler persisted the DAG in the database. I
> also
> > think this is often the case in the UI, where we the nuke when some of
> the
> > models haven't been persisted in the database. Therefore I'd like to
> > suggest to enforce consistency by foreign keys. This will prevent us from
> > having DagRuns without DAGs, but also removing xcom objects of
> > TaskInstances that are already removed.
> >
> > To create an overview of the FK's, I've created subtasks the ticket of
> > Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
> >
> > WDYT?
> >
> > Cheers, Fokko
> >
> >
> >
> > Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> > maximebeauchemin@gmail.com<ma...@gmail.com>>:
> >
> > The database migration creating the FK will/would need to have something
> > that either creates dummy missing PKs first, or delete the orphaned keys
> to
> > insure the operation of creating the FK doesn't error out. Seems like
> > adding dummy keys is a better approach. Then you'll have to make sure
> that
> > everywhere where FKs are created that there are no edge cases of missing
> > PKs. Then some delete operations in some cases may have to "cascade"
> > properly.
> >
> > The Django Admin had this nice confirm screen on delete that would show
> you
> > clearly the scope of the cascading operation when deleting objects. To my
> > knowledge Flask-Admin and FAB don't have such a feature. Personally I
> > wouldn't allow cascade unless such a feature is implemented in some way.
> > Note that SQLAlchemy has builtin semantics for specifying how/whether
> > cascading should take place.
> >
> > Personally I think referential integrity is overrated in some cases,
> > especially when using meaningful "business keys" (as opposed to
> > auto-increted "surrogate" keys) as PKs. It also slows down insert
> > operations. For data warehousing (this clearly doesn't apply to the
> Airflow
> > metadata store), the best practice on most db engine is to **not**
> enforce
> > FK constraints as it slows down inserts in fact tables and straight out
> > prevents bulk loading.
> >
> > Another approach is to avoid deleting in general, especially referenced
> > fks, and setting up some activity/visibility flag to false instead.
> >
> > Max
> >
> > On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fokko@driesprong.frl
> > <ma...@driesprong.frl>>
> > wrote:
> >
> > I'm in favor of having referential integrity. It will add some load in
> > having to enforce the referential integrity, but it will also make sure
> > that the database stays clean. Also in Airflow we use transactions which
> > will make sure that the integrity checks are not validated on every
> > statement, but after the commit. I'm happy to help with this as well.
> >
> > Cheers, Fokko
> >
> > Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbruin@gmail.com
> > <ma...@gmail.com>>:
> >
> > Adding these kind of checks which work for integrity well make database
> > access pretty slow. In addition it isnt there because in the past there
> > was
> > no strong connection between for example tasks and dagruns, it was more
> > or
> > less just coincidental. There also so some bisecting tools that
> > probably
> > have difficulty functioning in a new regime. In other words it is not
> > an
> > easy change and it will have operational challenges.
> >
> > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org<mailto:
> > ash@apache.org>> wrote:
> >
> > Ooh good spot.
> >
> > Yes I would be in favour of adding these, but as you say we need to
> > thing about how we might migrate old data.
> >
> > Doing this at 2.0.0 and providing a cleanup script (or doing it as
> > part
> > of the migration?) is probably the way to go.
> >
> > -ash-
> >
> > On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de
> <mailto:
> > mail@stefan-seelmann.de>>
> > wrote:
> >
> > Hi,
> >
> > looking into the DB schema there is almost no referral integrity
> > enforced at the database level. Many foreign key constraints between
> > dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> > sense
> > IMO.
> >
> > Is there a particular reason why that's not implemented?
> >
> > Introducing it now will be hard, probably any real-world setup has
> > some
> > violations. But I'm still in favor of this additional safety net.
> >
> > Kind Regards,
> > Stefan
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Database referral integrity

Posted by Alex Guziel <al...@airbnb.com.INVALID>.
I'm not a huge fan of having foreign keys. I know Airbnb has and definitely
still has problems with DB load. I don't see any real convincing arguments
for how adding referential integrity will improve Airflow meaningfully
(yet).

On Wed, Apr 10, 2019 at 12:38 PM Bas Harenslak <
basharenslak@godatadriven.com> wrote:

> In my experience it could go either way; in some cases FKs could impair
> the performance and in some cases FKs can help the query optimizer improve
> query performance. Each case is different and without testing it’s just
> guessing.
>
> I’m in favour of adding FKs and value referential integrity over
> performance. If you’re sacrificing integrity for performance you’re doing
> either advanced funky stuff or the wrong thing. I haven’t seen the database
> being a bottleneck in Airflow, even with large setups (+-5000 DAGs). So why
> not add FKs and performance test some Airflow queries, to know for sure :-)
>
> Bas
>
> On 10 Apr 2019, at 20:39, Bolke de Bruin <bdbruin@gmail.com<mailto:
> bdbruin@gmail.com>> wrote:
>
> Please not that Fks can be quite slow...
>
> Verstuurd vanaf mijn iPad
>
> Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <ash@apache.org<mailto:
> ash@apache.org>> het volgende geschreven:
>
> I am all for FKs.
>
> How do you think we should handle the case of "Xcom but missing TIs" (or
> similar) that we might run into on installs when a user runs `airflow
> upgradedb`?
>
> -a
>
> On 10 Apr 2019, at 18:44, Driesprong, Fokko <fokko@driesprong.frl<mailto:
> fokko@driesprong.frl>> wrote:
>
> Reviving this discussion again :-)
>
> Lately, I was digging into the PR of Julian regarding adding FK's to the
> database: https://github.com/apache/airflow/pull/4922
>
> After digging into the details, I've noticed that the current situation
> regarding referential integrity is bad. It is not uncommon the have
> DagRun's without having the DAG in the database. For example, you can do a
> backfill job before the scheduler persisted the DAG in the database. I also
> think this is often the case in the UI, where we the nuke when some of the
> models haven't been persisted in the database. Therefore I'd like to
> suggest to enforce consistency by foreign keys. This will prevent us from
> having DagRuns without DAGs, but also removing xcom objects of
> TaskInstances that are already removed.
>
> To create an overview of the FK's, I've created subtasks the ticket of
> Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
>
> WDYT?
>
> Cheers, Fokko
>
>
>
> Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> maximebeauchemin@gmail.com<ma...@gmail.com>>:
>
> The database migration creating the FK will/would need to have something
> that either creates dummy missing PKs first, or delete the orphaned keys to
> insure the operation of creating the FK doesn't error out. Seems like
> adding dummy keys is a better approach. Then you'll have to make sure that
> everywhere where FKs are created that there are no edge cases of missing
> PKs. Then some delete operations in some cases may have to "cascade"
> properly.
>
> The Django Admin had this nice confirm screen on delete that would show you
> clearly the scope of the cascading operation when deleting objects. To my
> knowledge Flask-Admin and FAB don't have such a feature. Personally I
> wouldn't allow cascade unless such a feature is implemented in some way.
> Note that SQLAlchemy has builtin semantics for specifying how/whether
> cascading should take place.
>
> Personally I think referential integrity is overrated in some cases,
> especially when using meaningful "business keys" (as opposed to
> auto-increted "surrogate" keys) as PKs. It also slows down insert
> operations. For data warehousing (this clearly doesn't apply to the Airflow
> metadata store), the best practice on most db engine is to **not** enforce
> FK constraints as it slows down inserts in fact tables and straight out
> prevents bulk loading.
>
> Another approach is to avoid deleting in general, especially referenced
> fks, and setting up some activity/visibility flag to false instead.
>
> Max
>
> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fokko@driesprong.frl
> <ma...@driesprong.frl>>
> wrote:
>
> I'm in favor of having referential integrity. It will add some load in
> having to enforce the referential integrity, but it will also make sure
> that the database stays clean. Also in Airflow we use transactions which
> will make sure that the integrity checks are not validated on every
> statement, but after the commit. I'm happy to help with this as well.
>
> Cheers, Fokko
>
> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>>:
>
> Adding these kind of checks which work for integrity well make database
> access pretty slow. In addition it isnt there because in the past there
> was
> no strong connection between for example tasks and dagruns, it was more
> or
> less just coincidental. There also so some bisecting tools that
> probably
> have difficulty functioning in a new regime. In other words it is not
> an
> easy change and it will have operational challenges.
>
> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <ash@apache.org<mailto:
> ash@apache.org>> wrote:
>
> Ooh good spot.
>
> Yes I would be in favour of adding these, but as you say we need to
> thing about how we might migrate old data.
>
> Doing this at 2.0.0 and providing a cleanup script (or doing it as
> part
> of the migration?) is probably the way to go.
>
> -ash-
>
> On 17 Sep 2018, at 19:56, Stefan Seelmann <mail@stefan-seelmann.de<mailto:
> mail@stefan-seelmann.de>>
> wrote:
>
> Hi,
>
> looking into the DB schema there is almost no referral integrity
> enforced at the database level. Many foreign key constraints between
> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> sense
> IMO.
>
> Is there a particular reason why that's not implemented?
>
> Introducing it now will be hard, probably any real-world setup has
> some
> violations. But I'm still in favor of this additional safety net.
>
> Kind Regards,
> Stefan
>
>
>
>
>
>
>
>

Re: Database referral integrity

Posted by Bas Harenslak <ba...@godatadriven.com>.
In my experience it could go either way; in some cases FKs could impair the performance and in some cases FKs can help the query optimizer improve query performance. Each case is different and without testing it’s just guessing.

I’m in favour of adding FKs and value referential integrity over performance. If you’re sacrificing integrity for performance you’re doing either advanced funky stuff or the wrong thing. I haven’t seen the database being a bottleneck in Airflow, even with large setups (+-5000 DAGs). So why not add FKs and performance test some Airflow queries, to know for sure :-)

Bas

On 10 Apr 2019, at 20:39, Bolke de Bruin <bd...@gmail.com>> wrote:

Please not that Fks can be quite slow...

Verstuurd vanaf mijn iPad

Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <as...@apache.org>> het volgende geschreven:

I am all for FKs.

How do you think we should handle the case of "Xcom but missing TIs" (or similar) that we might run into on installs when a user runs `airflow upgradedb`?

-a

On 10 Apr 2019, at 18:44, Driesprong, Fokko <fo...@driesprong.frl>> wrote:

Reviving this discussion again :-)

Lately, I was digging into the PR of Julian regarding adding FK's to the
database: https://github.com/apache/airflow/pull/4922

After digging into the details, I've noticed that the current situation
regarding referential integrity is bad. It is not uncommon the have
DagRun's without having the DAG in the database. For example, you can do a
backfill job before the scheduler persisted the DAG in the database. I also
think this is often the case in the UI, where we the nuke when some of the
models haven't been persisted in the database. Therefore I'd like to
suggest to enforce consistency by foreign keys. This will prevent us from
having DagRuns without DAGs, but also removing xcom objects of
TaskInstances that are already removed.

To create an overview of the FK's, I've created subtasks the ticket of
Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904

WDYT?

Cheers, Fokko



Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
maximebeauchemin@gmail.com<ma...@gmail.com>>:

The database migration creating the FK will/would need to have something
that either creates dummy missing PKs first, or delete the orphaned keys to
insure the operation of creating the FK doesn't error out. Seems like
adding dummy keys is a better approach. Then you'll have to make sure that
everywhere where FKs are created that there are no edge cases of missing
PKs. Then some delete operations in some cases may have to "cascade"
properly.

The Django Admin had this nice confirm screen on delete that would show you
clearly the scope of the cascading operation when deleting objects. To my
knowledge Flask-Admin and FAB don't have such a feature. Personally I
wouldn't allow cascade unless such a feature is implemented in some way.
Note that SQLAlchemy has builtin semantics for specifying how/whether
cascading should take place.

Personally I think referential integrity is overrated in some cases,
especially when using meaningful "business keys" (as opposed to
auto-increted "surrogate" keys) as PKs. It also slows down insert
operations. For data warehousing (this clearly doesn't apply to the Airflow
metadata store), the best practice on most db engine is to **not** enforce
FK constraints as it slows down inserts in fact tables and straight out
prevents bulk loading.

Another approach is to avoid deleting in general, especially referenced
fks, and setting up some activity/visibility flag to false instead.

Max

On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fo...@driesprong.frl>>
wrote:

I'm in favor of having referential integrity. It will add some load in
having to enforce the referential integrity, but it will also make sure
that the database stays clean. Also in Airflow we use transactions which
will make sure that the integrity checks are not validated on every
statement, but after the commit. I'm happy to help with this as well.

Cheers, Fokko

Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bd...@gmail.com>>:

Adding these kind of checks which work for integrity well make database
access pretty slow. In addition it isnt there because in the past there
was
no strong connection between for example tasks and dagruns, it was more
or
less just coincidental. There also so some bisecting tools that
probably
have difficulty functioning in a new regime. In other words it is not
an
easy change and it will have operational challenges.

On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <as...@apache.org>> wrote:

Ooh good spot.

Yes I would be in favour of adding these, but as you say we need to
thing about how we might migrate old data.

Doing this at 2.0.0 and providing a cleanup script (or doing it as
part
of the migration?) is probably the way to go.

-ash-

On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de>>
wrote:

Hi,

looking into the DB schema there is almost no referral integrity
enforced at the database level. Many foreign key constraints between
dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
sense
IMO.

Is there a particular reason why that's not implemented?

Introducing it now will be hard, probably any real-world setup has
some
violations. But I'm still in favor of this additional safety net.

Kind Regards,
Stefan








Re: Database referral integrity

Posted by Bolke de Bruin <bd...@gmail.com>.
Please not that Fks can be quite slow...

Verstuurd vanaf mijn iPad

> Op 10 apr. 2019 om 19:55 heeft Ash Berlin-Taylor <as...@apache.org> het volgende geschreven:
> 
> I am all for FKs.
> 
> How do you think we should handle the case of "Xcom but missing TIs" (or similar) that we might run into on installs when a user runs `airflow upgradedb`?
> 
> -a
> 
>> On 10 Apr 2019, at 18:44, Driesprong, Fokko <fo...@driesprong.frl> wrote:
>> 
>> Reviving this discussion again :-)
>> 
>> Lately, I was digging into the PR of Julian regarding adding FK's to the
>> database: https://github.com/apache/airflow/pull/4922
>> 
>> After digging into the details, I've noticed that the current situation
>> regarding referential integrity is bad. It is not uncommon the have
>> DagRun's without having the DAG in the database. For example, you can do a
>> backfill job before the scheduler persisted the DAG in the database. I also
>> think this is often the case in the UI, where we the nuke when some of the
>> models haven't been persisted in the database. Therefore I'd like to
>> suggest to enforce consistency by foreign keys. This will prevent us from
>> having DagRuns without DAGs, but also removing xcom objects of
>> TaskInstances that are already removed.
>> 
>> To create an overview of the FK's, I've created subtasks the ticket of
>> Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
>> 
>> WDYT?
>> 
>> Cheers, Fokko
>> 
>> 
>> 
>> Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
>> maximebeauchemin@gmail.com>:
>> 
>>> The database migration creating the FK will/would need to have something
>>> that either creates dummy missing PKs first, or delete the orphaned keys to
>>> insure the operation of creating the FK doesn't error out. Seems like
>>> adding dummy keys is a better approach. Then you'll have to make sure that
>>> everywhere where FKs are created that there are no edge cases of missing
>>> PKs. Then some delete operations in some cases may have to "cascade"
>>> properly.
>>> 
>>> The Django Admin had this nice confirm screen on delete that would show you
>>> clearly the scope of the cascading operation when deleting objects. To my
>>> knowledge Flask-Admin and FAB don't have such a feature. Personally I
>>> wouldn't allow cascade unless such a feature is implemented in some way.
>>> Note that SQLAlchemy has builtin semantics for specifying how/whether
>>> cascading should take place.
>>> 
>>> Personally I think referential integrity is overrated in some cases,
>>> especially when using meaningful "business keys" (as opposed to
>>> auto-increted "surrogate" keys) as PKs. It also slows down insert
>>> operations. For data warehousing (this clearly doesn't apply to the Airflow
>>> metadata store), the best practice on most db engine is to **not** enforce
>>> FK constraints as it slows down inserts in fact tables and straight out
>>> prevents bulk loading.
>>> 
>>> Another approach is to avoid deleting in general, especially referenced
>>> fks, and setting up some activity/visibility flag to false instead.
>>> 
>>> Max
>>> 
>>> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fo...@driesprong.frl>
>>> wrote:
>>> 
>>>> I'm in favor of having referential integrity. It will add some load in
>>>> having to enforce the referential integrity, but it will also make sure
>>>> that the database stays clean. Also in Airflow we use transactions which
>>>> will make sure that the integrity checks are not validated on every
>>>> statement, but after the commit. I'm happy to help with this as well.
>>>> 
>>>> Cheers, Fokko
>>>> 
>>>> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bd...@gmail.com>:
>>>> 
>>>>> Adding these kind of checks which work for integrity well make database
>>>>> access pretty slow. In addition it isnt there because in the past there
>>>> was
>>>>> no strong connection between for example tasks and dagruns, it was more
>>>> or
>>>>> less just coincidental. There also so some bisecting tools that
>>> probably
>>>>> have difficulty functioning in a new regime. In other words it is not
>>> an
>>>>> easy change and it will have operational challenges.
>>>>> 
>>>>>> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <as...@apache.org> wrote:
>>>>>> 
>>>>>> Ooh good spot.
>>>>>> 
>>>>>> Yes I would be in favour of adding these, but as you say we need to
>>>>> thing about how we might migrate old data.
>>>>>> 
>>>>>> Doing this at 2.0.0 and providing a cleanup script (or doing it as
>>> part
>>>>> of the migration?) is probably the way to go.
>>>>>> 
>>>>>> -ash-
>>>>>> 
>>>>>>> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de>
>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> looking into the DB schema there is almost no referral integrity
>>>>>>> enforced at the database level. Many foreign key constraints between
>>>>>>> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
>>>> sense
>>>>>>> IMO.
>>>>>>> 
>>>>>>> Is there a particular reason why that's not implemented?
>>>>>>> 
>>>>>>> Introducing it now will be hard, probably any real-world setup has
>>>> some
>>>>>>> violations. But I'm still in favor of this additional safety net.
>>>>>>> 
>>>>>>> Kind Regards,
>>>>>>> Stefan
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
> 

Re: Database referral integrity

Posted by Ash Berlin-Taylor <as...@apache.org>.
I am all for FKs.

How do you think we should handle the case of "Xcom but missing TIs" (or similar) that we might run into on installs when a user runs `airflow upgradedb`?

-a

> On 10 Apr 2019, at 18:44, Driesprong, Fokko <fo...@driesprong.frl> wrote:
> 
> Reviving this discussion again :-)
> 
> Lately, I was digging into the PR of Julian regarding adding FK's to the
> database: https://github.com/apache/airflow/pull/4922
> 
> After digging into the details, I've noticed that the current situation
> regarding referential integrity is bad. It is not uncommon the have
> DagRun's without having the DAG in the database. For example, you can do a
> backfill job before the scheduler persisted the DAG in the database. I also
> think this is often the case in the UI, where we the nuke when some of the
> models haven't been persisted in the database. Therefore I'd like to
> suggest to enforce consistency by foreign keys. This will prevent us from
> having DagRuns without DAGs, but also removing xcom objects of
> TaskInstances that are already removed.
> 
> To create an overview of the FK's, I've created subtasks the ticket of
> Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904
> 
> WDYT?
> 
> Cheers, Fokko
> 
> 
> 
> Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
> maximebeauchemin@gmail.com>:
> 
>> The database migration creating the FK will/would need to have something
>> that either creates dummy missing PKs first, or delete the orphaned keys to
>> insure the operation of creating the FK doesn't error out. Seems like
>> adding dummy keys is a better approach. Then you'll have to make sure that
>> everywhere where FKs are created that there are no edge cases of missing
>> PKs. Then some delete operations in some cases may have to "cascade"
>> properly.
>> 
>> The Django Admin had this nice confirm screen on delete that would show you
>> clearly the scope of the cascading operation when deleting objects. To my
>> knowledge Flask-Admin and FAB don't have such a feature. Personally I
>> wouldn't allow cascade unless such a feature is implemented in some way.
>> Note that SQLAlchemy has builtin semantics for specifying how/whether
>> cascading should take place.
>> 
>> Personally I think referential integrity is overrated in some cases,
>> especially when using meaningful "business keys" (as opposed to
>> auto-increted "surrogate" keys) as PKs. It also slows down insert
>> operations. For data warehousing (this clearly doesn't apply to the Airflow
>> metadata store), the best practice on most db engine is to **not** enforce
>> FK constraints as it slows down inserts in fact tables and straight out
>> prevents bulk loading.
>> 
>> Another approach is to avoid deleting in general, especially referenced
>> fks, and setting up some activity/visibility flag to false instead.
>> 
>> Max
>> 
>> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fo...@driesprong.frl>
>> wrote:
>> 
>>> I'm in favor of having referential integrity. It will add some load in
>>> having to enforce the referential integrity, but it will also make sure
>>> that the database stays clean. Also in Airflow we use transactions which
>>> will make sure that the integrity checks are not validated on every
>>> statement, but after the commit. I'm happy to help with this as well.
>>> 
>>> Cheers, Fokko
>>> 
>>> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bd...@gmail.com>:
>>> 
>>>> Adding these kind of checks which work for integrity well make database
>>>> access pretty slow. In addition it isnt there because in the past there
>>> was
>>>> no strong connection between for example tasks and dagruns, it was more
>>> or
>>>> less just coincidental. There also so some bisecting tools that
>> probably
>>>> have difficulty functioning in a new regime. In other words it is not
>> an
>>>> easy change and it will have operational challenges.
>>>> 
>>>>> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <as...@apache.org> wrote:
>>>>> 
>>>>> Ooh good spot.
>>>>> 
>>>>> Yes I would be in favour of adding these, but as you say we need to
>>>> thing about how we might migrate old data.
>>>>> 
>>>>> Doing this at 2.0.0 and providing a cleanup script (or doing it as
>> part
>>>> of the migration?) is probably the way to go.
>>>>> 
>>>>> -ash-
>>>>> 
>>>>>> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de>
>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> looking into the DB schema there is almost no referral integrity
>>>>>> enforced at the database level. Many foreign key constraints between
>>>>>> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
>>> sense
>>>>>> IMO.
>>>>>> 
>>>>>> Is there a particular reason why that's not implemented?
>>>>>> 
>>>>>> Introducing it now will be hard, probably any real-world setup has
>>> some
>>>>>> violations. But I'm still in favor of this additional safety net.
>>>>>> 
>>>>>> Kind Regards,
>>>>>> Stefan
>>>>> 
>>>> 
>>>> 
>>> 
>> 


Re: Database referral integrity

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Reviving this discussion again :-)

Lately, I was digging into the PR of Julian regarding adding FK's to the
database: https://github.com/apache/airflow/pull/4922

After digging into the details, I've noticed that the current situation
regarding referential integrity is bad. It is not uncommon the have
DagRun's without having the DAG in the database. For example, you can do a
backfill job before the scheduler persisted the DAG in the database. I also
think this is often the case in the UI, where we the nuke when some of the
models haven't been persisted in the database. Therefore I'd like to
suggest to enforce consistency by foreign keys. This will prevent us from
having DagRuns without DAGs, but also removing xcom objects of
TaskInstances that are already removed.

To create an overview of the FK's, I've created subtasks the ticket of
Peter: https://jira.apache.org/jira/browse/AIRFLOW-3904

WDYT?

Cheers, Fokko



Op di 18 sep. 2018 om 21:51 schreef Maxime Beauchemin <
maximebeauchemin@gmail.com>:

> The database migration creating the FK will/would need to have something
> that either creates dummy missing PKs first, or delete the orphaned keys to
> insure the operation of creating the FK doesn't error out. Seems like
> adding dummy keys is a better approach. Then you'll have to make sure that
> everywhere where FKs are created that there are no edge cases of missing
> PKs. Then some delete operations in some cases may have to "cascade"
> properly.
>
> The Django Admin had this nice confirm screen on delete that would show you
> clearly the scope of the cascading operation when deleting objects. To my
> knowledge Flask-Admin and FAB don't have such a feature. Personally I
> wouldn't allow cascade unless such a feature is implemented in some way.
> Note that SQLAlchemy has builtin semantics for specifying how/whether
> cascading should take place.
>
> Personally I think referential integrity is overrated in some cases,
> especially when using meaningful "business keys" (as opposed to
> auto-increted "surrogate" keys) as PKs. It also slows down insert
> operations. For data warehousing (this clearly doesn't apply to the Airflow
> metadata store), the best practice on most db engine is to **not** enforce
> FK constraints as it slows down inserts in fact tables and straight out
> prevents bulk loading.
>
> Another approach is to avoid deleting in general, especially referenced
> fks, and setting up some activity/visibility flag to false instead.
>
> Max
>
> On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> > I'm in favor of having referential integrity. It will add some load in
> > having to enforce the referential integrity, but it will also make sure
> > that the database stays clean. Also in Airflow we use transactions which
> > will make sure that the integrity checks are not validated on every
> > statement, but after the commit. I'm happy to help with this as well.
> >
> > Cheers, Fokko
> >
> > Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bd...@gmail.com>:
> >
> > > Adding these kind of checks which work for integrity well make database
> > > access pretty slow. In addition it isnt there because in the past there
> > was
> > > no strong connection between for example tasks and dagruns, it was more
> > or
> > > less just coincidental. There also so some bisecting tools that
> probably
> > > have difficulty functioning in a new regime. In other words it is not
> an
> > > easy change and it will have operational challenges.
> > >
> > > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <as...@apache.org> wrote:
> > > >
> > > > Ooh good spot.
> > > >
> > > > Yes I would be in favour of adding these, but as you say we need to
> > > thing about how we might migrate old data.
> > > >
> > > > Doing this at 2.0.0 and providing a cleanup script (or doing it as
> part
> > > of the migration?) is probably the way to go.
> > > >
> > > > -ash-
> > > >
> > > >> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de>
> > > wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> looking into the DB schema there is almost no referral integrity
> > > >> enforced at the database level. Many foreign key constraints between
> > > >> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> > sense
> > > >> IMO.
> > > >>
> > > >> Is there a particular reason why that's not implemented?
> > > >>
> > > >> Introducing it now will be hard, probably any real-world setup has
> > some
> > > >> violations. But I'm still in favor of this additional safety net.
> > > >>
> > > >> Kind Regards,
> > > >> Stefan
> > > >
> > >
> > >
> >
>

Re: Database referral integrity

Posted by Maxime Beauchemin <ma...@gmail.com>.
The database migration creating the FK will/would need to have something
that either creates dummy missing PKs first, or delete the orphaned keys to
insure the operation of creating the FK doesn't error out. Seems like
adding dummy keys is a better approach. Then you'll have to make sure that
everywhere where FKs are created that there are no edge cases of missing
PKs. Then some delete operations in some cases may have to "cascade"
properly.

The Django Admin had this nice confirm screen on delete that would show you
clearly the scope of the cascading operation when deleting objects. To my
knowledge Flask-Admin and FAB don't have such a feature. Personally I
wouldn't allow cascade unless such a feature is implemented in some way.
Note that SQLAlchemy has builtin semantics for specifying how/whether
cascading should take place.

Personally I think referential integrity is overrated in some cases,
especially when using meaningful "business keys" (as opposed to
auto-increted "surrogate" keys) as PKs. It also slows down insert
operations. For data warehousing (this clearly doesn't apply to the Airflow
metadata store), the best practice on most db engine is to **not** enforce
FK constraints as it slows down inserts in fact tables and straight out
prevents bulk loading.

Another approach is to avoid deleting in general, especially referenced
fks, and setting up some activity/visibility flag to false instead.

Max

On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> I'm in favor of having referential integrity. It will add some load in
> having to enforce the referential integrity, but it will also make sure
> that the database stays clean. Also in Airflow we use transactions which
> will make sure that the integrity checks are not validated on every
> statement, but after the commit. I'm happy to help with this as well.
>
> Cheers, Fokko
>
> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bd...@gmail.com>:
>
> > Adding these kind of checks which work for integrity well make database
> > access pretty slow. In addition it isnt there because in the past there
> was
> > no strong connection between for example tasks and dagruns, it was more
> or
> > less just coincidental. There also so some bisecting tools that probably
> > have difficulty functioning in a new regime. In other words it is not an
> > easy change and it will have operational challenges.
> >
> > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <as...@apache.org> wrote:
> > >
> > > Ooh good spot.
> > >
> > > Yes I would be in favour of adding these, but as you say we need to
> > thing about how we might migrate old data.
> > >
> > > Doing this at 2.0.0 and providing a cleanup script (or doing it as part
> > of the migration?) is probably the way to go.
> > >
> > > -ash-
> > >
> > >> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de>
> > wrote:
> > >>
> > >> Hi,
> > >>
> > >> looking into the DB schema there is almost no referral integrity
> > >> enforced at the database level. Many foreign key constraints between
> > >> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> sense
> > >> IMO.
> > >>
> > >> Is there a particular reason why that's not implemented?
> > >>
> > >> Introducing it now will be hard, probably any real-world setup has
> some
> > >> violations. But I'm still in favor of this additional safety net.
> > >>
> > >> Kind Regards,
> > >> Stefan
> > >
> >
> >
>

Re: Database referral integrity

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
I'm in favor of having referential integrity. It will add some load in
having to enforce the referential integrity, but it will also make sure
that the database stays clean. Also in Airflow we use transactions which
will make sure that the integrity checks are not validated on every
statement, but after the commit. I'm happy to help with this as well.

Cheers, Fokko

Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin <bd...@gmail.com>:

> Adding these kind of checks which work for integrity well make database
> access pretty slow. In addition it isnt there because in the past there was
> no strong connection between for example tasks and dagruns, it was more or
> less just coincidental. There also so some bisecting tools that probably
> have difficulty functioning in a new regime. In other words it is not an
> easy change and it will have operational challenges.
>
> > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <as...@apache.org> wrote:
> >
> > Ooh good spot.
> >
> > Yes I would be in favour of adding these, but as you say we need to
> thing about how we might migrate old data.
> >
> > Doing this at 2.0.0 and providing a cleanup script (or doing it as part
> of the migration?) is probably the way to go.
> >
> > -ash-
> >
> >> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de>
> wrote:
> >>
> >> Hi,
> >>
> >> looking into the DB schema there is almost no referral integrity
> >> enforced at the database level. Many foreign key constraints between
> >> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make sense
> >> IMO.
> >>
> >> Is there a particular reason why that's not implemented?
> >>
> >> Introducing it now will be hard, probably any real-world setup has some
> >> violations. But I'm still in favor of this additional safety net.
> >>
> >> Kind Regards,
> >> Stefan
> >
>
>

Re: Database referral integrity

Posted by Bolke de Bruin <bd...@gmail.com>.
Adding these kind of checks which work for integrity well make database access pretty slow. In addition it isnt there because in the past there was no strong connection between for example tasks and dagruns, it was more or less just coincidental. There also so some bisecting tools that probably have difficulty functioning in a new regime. In other words it is not an easy change and it will have operational challenges.

> On 18 Sep 2018, at 11:03, Ash Berlin-Taylor <as...@apache.org> wrote:
> 
> Ooh good spot.
> 
> Yes I would be in favour of adding these, but as you say we need to thing about how we might migrate old data.
> 
> Doing this at 2.0.0 and providing a cleanup script (or doing it as part of the migration?) is probably the way to go.
> 
> -ash-
> 
>> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de> wrote:
>> 
>> Hi,
>> 
>> looking into the DB schema there is almost no referral integrity
>> enforced at the database level. Many foreign key constraints between
>> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make sense
>> IMO.
>> 
>> Is there a particular reason why that's not implemented?
>> 
>> Introducing it now will be hard, probably any real-world setup has some
>> violations. But I'm still in favor of this additional safety net.
>> 
>> Kind Regards,
>> Stefan
> 


Re: Database referral integrity

Posted by Jeff Payne <jp...@bombora.com>.
Yes, and data that is in violation of the logical FKs can probably simply be deleted, since they must be orphaned etc anyway...

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Ash Berlin-Taylor <as...@apache.org>
Sent: Tuesday, September 18, 2018 2:03:50 AM
To: dev@airflow.incubator.apache.org
Subject: Re: Database referral integrity

Ooh good spot.

Yes I would be in favour of adding these, but as you say we need to thing about how we might migrate old data.

Doing this at 2.0.0 and providing a cleanup script (or doing it as part of the migration?) is probably the way to go.

-ash-

> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de> wrote:
>
> Hi,
>
> looking into the DB schema there is almost no referral integrity
> enforced at the database level. Many foreign key constraints between
> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make sense
> IMO.
>
> Is there a particular reason why that's not implemented?
>
> Introducing it now will be hard, probably any real-world setup has some
> violations. But I'm still in favor of this additional safety net.
>
> Kind Regards,
> Stefan


Re: Database referral integrity

Posted by Ash Berlin-Taylor <as...@apache.org>.
Ooh good spot.

Yes I would be in favour of adding these, but as you say we need to thing about how we might migrate old data.

Doing this at 2.0.0 and providing a cleanup script (or doing it as part of the migration?) is probably the way to go.

-ash-

> On 17 Sep 2018, at 19:56, Stefan Seelmann <ma...@stefan-seelmann.de> wrote:
> 
> Hi,
> 
> looking into the DB schema there is almost no referral integrity
> enforced at the database level. Many foreign key constraints between
> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make sense
> IMO.
> 
> Is there a particular reason why that's not implemented?
> 
> Introducing it now will be hard, probably any real-world setup has some
> violations. But I'm still in favor of this additional safety net.
> 
> Kind Regards,
> Stefan