You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Malthe <mb...@gmail.com> on 2022/04/27 08:49:50 UTC

Implicit DAG registration

DAGs must be registered at the module top-level. During dagbag
processing we have:

     top_level_dags = ((o, m) for m in mods for o in
m.__dict__.values() if isinstance(o, DAG))

It makes sense of course – we can't have DAGs floating in space, they
need an anchor.

Or do they?

It would be entirely possible for the DAG constructor to register
itself with some global registry, obviating for the follow pattern:

with DAG(...) as dag:
    ...

I think most users would prefer:

with DAG(...):
    ...

Since there is generally speaking no need for the alias.

Downsides? You have less control of what DAGs are made available – but
really, does it make sense to create a DAG object only to drop it
immediately after?

I volunteer to implement this if there's a positive feedback.

Cheers

Re: Implicit DAG registration

Posted by Kaxil Naik <ka...@gmail.com>.
Yeah agree with Ash. I think I got bitten by it a couple of times at the
very least where the dag didn't show up as I just had "with DAG(...): "

On Wed, 27 Apr 2022 at 21:28, Jarek Potiuk <ja...@potiuk.com> wrote:

> That looks like a nice improvement.
>
> J.
>
> On Wed, Apr 27, 2022 at 10:04 PM Malthe <mb...@gmail.com> wrote:
> >
> > On Wed, 27 Apr 2022 at 15:27, Constance Martineau
> > <co...@astronomer.io.invalid> wrote:
> > > Am intrigued. Curious about dynamic dag pattern, where you create the
> DAG object in a create_dag function and adding the DAG to globals. Would
> this new way prevent someone from modifying the dag object within the
> function, or returning it?
> >
> > It would not change the current way of programming DAGs. The only
> > change is that if you call the DAG constructor (or a function that
> > does so including one using the @dag decorator), then that DAG will be
> > made available regardless of whether you assign it a top-level
> > variable or not.
> >
> > Cheers
>

Re: Implicit DAG registration

Posted by Ash Berlin-Taylor <as...@apache.org>.
Check out <https://github.com/apache/airflow/pull/23592>

On Fri, Apr 29 2022 at 09:18:46 +0000, Malthe <mb...@gmail.com> wrote:
> On Wed, 27 Apr 2022 at 20:28, Jarek Potiuk <jarek@potiuk.com 
> <ma...@potiuk.com>> wrote:
>>  That looks like a nice improvement.
> 
> I did some hacking and realized that for the time being, there's no
> good way to implement this as originally proposed.
> 
> However, we can do something about the situation nonetheless.
> 
> What we can do is to detect and report when a DAG is dropped during
> loading. That is, if a DAG instance is created, but not used (either
> exposed at the top-level or used indirectly, for example in a
> SubDagOperator), we can log a message as well as report it as an
> import error (i.e. `DagBag.import_errors`).
> 
> I think it is fair to call this an import error because if you didn't
> want to expose the DAG, then why create it?
> 
> That said, there are two possible ways to implement this:
> 
> 1. The easiest is simply to record the DAGs that have been created
> during loading (or rather, the DAG ids) and subtract them from the
> top-level ones during processing to see if there are some that are not
> accounted for. We need to add a special provision in SubDagOperator to
> avoid having it reported as an error here.
> 
> 2. The more elegant way is to untangle the reference cycle between DAG
> and BaseOperator such that dropped DAGs are actually dropped
> (finalized).
> 
> I have experimented with both and the only issue besides additional
> weakref-based code and logic in (2) is that we have quite a few tests
> which use the following pattern:
> 
> with DAG(...):
>    op1 = ...
>    op2 = ...
> 
> The DAG instance here isn't assigned to the local scope, but each of
> the tasks directly reference it. The changes required in (2) mean that
> tasks (and task groups) would now weakly reference the DAG and so
> actually, the DAG would be finalized immediately after leaving the
> with construct.
> 
> The fix is simply to add "as dag" in these tests (it's about a dozen
> of them all in all).
> 
> I think reference cycles are objectively bad, but there are some
> non-trivial changes to the code in order to make it work. For example,
> there is some deep-copying action happening where we need to handle
> the weakrefs carefully and the pickling logic also requires some
> updating.
> 
> Option (1) is much simpler, but less elegant.
> 
> Cheers


Re: Implicit DAG registration

Posted by Malthe <mb...@gmail.com>.
On Wed, 27 Apr 2022 at 20:28, Jarek Potiuk <ja...@potiuk.com> wrote:
> That looks like a nice improvement.

I did some hacking and realized that for the time being, there's no
good way to implement this as originally proposed.

However, we can do something about the situation nonetheless.

What we can do is to detect and report when a DAG is dropped during
loading. That is, if a DAG instance is created, but not used (either
exposed at the top-level or used indirectly, for example in a
SubDagOperator), we can log a message as well as report it as an
import error (i.e. `DagBag.import_errors`).

I think it is fair to call this an import error because if you didn't
want to expose the DAG, then why create it?

That said, there are two possible ways to implement this:

1. The easiest is simply to record the DAGs that have been created
during loading (or rather, the DAG ids) and subtract them from the
top-level ones during processing to see if there are some that are not
accounted for. We need to add a special provision in SubDagOperator to
avoid having it reported as an error here.

2. The more elegant way is to untangle the reference cycle between DAG
and BaseOperator such that dropped DAGs are actually dropped
(finalized).

I have experimented with both and the only issue besides additional
weakref-based code and logic in (2) is that we have quite a few tests
which use the following pattern:

with DAG(...):
   op1 = ...
   op2 = ...

The DAG instance here isn't assigned to the local scope, but each of
the tasks directly reference it. The changes required in (2) mean that
tasks (and task groups) would now weakly reference the DAG and so
actually, the DAG would be finalized immediately after leaving the
with construct.

The fix is simply to add "as dag" in these tests (it's about a dozen
of them all in all).

I think reference cycles are objectively bad, but there are some
non-trivial changes to the code in order to make it work. For example,
there is some deep-copying action happening where we need to handle
the weakrefs carefully and the pickling logic also requires some
updating.

Option (1) is much simpler, but less elegant.

Cheers

Re: Implicit DAG registration

Posted by Jarek Potiuk <ja...@potiuk.com>.
That looks like a nice improvement.

J.

On Wed, Apr 27, 2022 at 10:04 PM Malthe <mb...@gmail.com> wrote:
>
> On Wed, 27 Apr 2022 at 15:27, Constance Martineau
> <co...@astronomer.io.invalid> wrote:
> > Am intrigued. Curious about dynamic dag pattern, where you create the DAG object in a create_dag function and adding the DAG to globals. Would this new way prevent someone from modifying the dag object within the function, or returning it?
>
> It would not change the current way of programming DAGs. The only
> change is that if you call the DAG constructor (or a function that
> does so including one using the @dag decorator), then that DAG will be
> made available regardless of whether you assign it a top-level
> variable or not.
>
> Cheers

Re: Implicit DAG registration

Posted by Malthe <mb...@gmail.com>.
On Wed, 27 Apr 2022 at 15:27, Constance Martineau
<co...@astronomer.io.invalid> wrote:
> Am intrigued. Curious about dynamic dag pattern, where you create the DAG object in a create_dag function and adding the DAG to globals. Would this new way prevent someone from modifying the dag object within the function, or returning it?

It would not change the current way of programming DAGs. The only
change is that if you call the DAG constructor (or a function that
does so including one using the @dag decorator), then that DAG will be
made available regardless of whether you assign it a top-level
variable or not.

Cheers

Re: Implicit DAG registration

Posted by Constance Martineau <co...@astronomer.io.INVALID>.
Am intrigued. Curious about dynamic dag pattern, where you create the DAG object in a a create_dag function and adding the DAG to globals. Would this new way prevent someone from modifying the dag object within the function, or returning it?

> On Apr 27, 2022, at 11:20 AM, Ferruzzi, Dennis <fe...@amazon.com.INVALID> wrote:
> 
> I don't know what it would take under the hood, but I'm intrigued.  From a user perspective, anything to make the DAG DRYer is a win IMHO.
> 
> ________________________________________
> From: Malthe <mb...@gmail.com>
> Sent: Wednesday, April 27, 2022 1:49 AM
> To: dev@airflow.apache.org
> Subject: [EXTERNAL] Implicit DAG registration
> 
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> DAGs must be registered at the module top-level. During dagbag
> processing we have:
> 
>     top_level_dags = ((o, m) for m in mods for o in
> m.__dict__.values() if isinstance(o, DAG))
> 
> It makes sense of course – we can't have DAGs floating in space, they
> need an anchor.
> 
> Or do they?
> 
> It would be entirely possible for the DAG constructor to register
> itself with some global registry, obviating for the follow pattern:
> 
> with DAG(...) as dag:
>    ...
> 
> I think most users would prefer:
> 
> with DAG(...):
>    ...
> 
> Since there is generally speaking no need for the alias.
> 
> Downsides? You have less control of what DAGs are made available – but
> really, does it make sense to create a DAG object only to drop it
> immediately after?
> 
> I volunteer to implement this if there's a positive feedback.
> 
> Cheers


Re: Implicit DAG registration

Posted by "Ferruzzi, Dennis" <fe...@amazon.com.INVALID>.
I don't know what it would take under the hood, but I'm intrigued.  From a user perspective, anything to make the DAG DRYer is a win IMHO.

________________________________________
From: Malthe <mb...@gmail.com>
Sent: Wednesday, April 27, 2022 1:49 AM
To: dev@airflow.apache.org
Subject: [EXTERNAL] Implicit DAG registration

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



DAGs must be registered at the module top-level. During dagbag
processing we have:

     top_level_dags = ((o, m) for m in mods for o in
m.__dict__.values() if isinstance(o, DAG))

It makes sense of course – we can't have DAGs floating in space, they
need an anchor.

Or do they?

It would be entirely possible for the DAG constructor to register
itself with some global registry, obviating for the follow pattern:

with DAG(...) as dag:
    ...

I think most users would prefer:

with DAG(...):
    ...

Since there is generally speaking no need for the alias.

Downsides? You have less control of what DAGs are made available – but
really, does it make sense to create a DAG object only to drop it
immediately after?

I volunteer to implement this if there's a positive feedback.

Cheers

Re: Implicit DAG registration

Posted by Ash Berlin-Taylor <as...@apache.org>.
This has often caught me out. And I think so long as there's an option to disable the auto-registration (for power users) this sounds like a great feature.

Hash

On 27 April 2022 09:49:50 BST, Malthe <mb...@gmail.com> wrote:
>DAGs must be registered at the module top-level. During dagbag
>processing we have:
>
>     top_level_dags = ((o, m) for m in mods for o in
>m.__dict__.values() if isinstance(o, DAG))
>
>It makes sense of course – we can't have DAGs floating in space, they
>need an anchor.
>
>Or do they?
>
>It would be entirely possible for the DAG constructor to register
>itself with some global registry, obviating for the follow pattern:
>
>with DAG(...) as dag:
>    ...
>
>I think most users would prefer:
>
>with DAG(...):
>    ...
>
>Since there is generally speaking no need for the alias.
>
>Downsides? You have less control of what DAGs are made available – but
>really, does it make sense to create a DAG object only to drop it
>immediately after?
>
>I volunteer to implement this if there's a positive feedback.
>
>Cheers