You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jacob Ferriero <jf...@google.com.INVALID> on 2020/09/16 01:48:34 UTC

[DISCUSS] Refactoring Import Errors

Hello Airflow Dev List,

I'm considering proposing a refactor to import errors in order to support
sending alert emails when the scheduler finds an import error (but not
every time the scheduler finds the same import error). This is currently
not possible because the import errors are cleared during each scheduler
loop.

I'd like to poll the community for perspectives on other short commings of
the import error model before proposing a refactor or other use cases folks
might have for such a refactor (e.g. supporting an arbitrary callback
function similar to SLA miss).

My current thought is to just add an import_errors_history table to the
database that is not cleared on each scheduler loop and does keep track of
if an email was sent in a boolean field. The primary key could be
constructed from a file hash and exception classname.

Does this one use case warrant a new table? Should we just replace the
import_errors table in place?

If I can get a sense of high-level direction I can put together an AIP / PR.

Cheers,
Jake

-- 

*Jacob Ferriero*

Strategic Cloud Engineer: Data Engineering

jferriero@google.com

617-714-2509

Re: [DISCUSS] Refactoring Import Errors

Posted by Jacob Ferriero <jf...@google.com.INVALID>.
> One of the most important changes is to ensure a stable ID

I think  file path + file hash + stacktrace hash might be a good candidate
(as this seems to uniquely qualify a new error).

> Why do we need a new table - import_errors_history? Can't we use the
current table?

Good question! I think we can use the existing table (longer explanation of
thought process below).

I was thinking with ClusterPolicyViolation you can have a task rule that
fails non-deterministically for a given file name / hash (e.g. jinja
renders to something your rule deems illegal like for a DAG must have
non-default owners rule and a task that sets owner = '{{
random.choice({'airflow', 'not_airflow'}) }}' or if you're asserting
something at DAG level (e.g. too many tasks) this could fail only sometimes
in the case of DAGs with dynamic number of tasks.

The naive email notifications implementation would be to send an email
every time we encounter an error we didn't encounter in the previous
scheduler loop.
However, this still could be spammy for file that violate a policy ~50% of
the time they are parsed (resulting in likely rapid consecutive add /
deletes).
The benefit of having a history table is that we could have logic to
provide a "digest" per file path / hash / stacktrace.
I think we can achieve this more elegantly with the current table by adding
a "last_seen" and / or "last_emailed" timestamps and a "times_seen"
counter. This way we could email the first time we see a new error or every
(configurable) N minutes the "times_seen" for that error.


On Thu, Sep 17, 2020 at 2:53 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> True Kamil.  We can do it now and the file vs. dag is an important
> difference indeed!
>
> J
>
>
> On Thu, Sep 17, 2020 at 10:44 AM Kamil Breguła <ka...@polidea.com>
> wrote:
>
> > Only in some cases, we do not have a DAG ID, but only have the path to
> the
> > file.
> >
> > In my opinion, we can do it now. One of the most important changes is to
> > ensure a stable ID. Now we delete and add the errors again, and we should
> > check which errors should be deleted and which should be added. If we
> have
> > a stable ID then we can add new metadata to the row.
> >
> > Why do we need a new table - import_errors_history? Can't we use the
> > current table?
> >
> >
> >
> > On Thu, Sep 17, 2020 at 10:01 AM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > I am all for it.
> > >
> > > This should be - likely - connected with the future versioning of DAGs
> > > (currently deferred to 2.1).
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning
> > > - possibly, rather than being a separate AIP, it should be
> > > incorporated there.
> > >
> > > I believe in the versioning implementation we will already have a
> > > table where we will keep information about DAGs together with their
> > > hash, so it seems natural that such "errors" should be connected to
> > > such "DAG_ID" "HASH" combination.
> > >
> > > And I would love to change the name of it to "parse errors". "Import
> > > errors" suggests that those are errors that come from a wrong "import"
> > > statement. But we are really talking about any kind of parsing error.
> > >
> > > J.
> > >
> > > On Wed, Sep 16, 2020 at 3:48 AM Jacob Ferriero
> > > <jf...@google.com.invalid> wrote:
> > > >
> > > > Hello Airflow Dev List,
> > > >
> > > > I'm considering proposing a refactor to import errors in order to
> > support
> > > > sending alert emails when the scheduler finds an import error (but
> not
> > > > every time the scheduler finds the same import error). This is
> > currently
> > > > not possible because the import errors are cleared during each
> > scheduler
> > > > loop.
> > > >
> > > > I'd like to poll the community for perspectives on other short
> commings
> > > of
> > > > the import error model before proposing a refactor or other use cases
> > > folks
> > > > might have for such a refactor (e.g. supporting an arbitrary callback
> > > > function similar to SLA miss).
> > > >
> > > > My current thought is to just add an import_errors_history table to
> the
> > > > database that is not cleared on each scheduler loop and does keep
> track
> > > of
> > > > if an email was sent in a boolean field. The primary key could be
> > > > constructed from a file hash and exception classname.
> > > >
> > > > Does this one use case warrant a new table? Should we just replace
> the
> > > > import_errors table in place?
> > > >
> > > > If I can get a sense of high-level direction I can put together an
> AIP
> > /
> > > PR.
> > > >
> > > > Cheers,
> > > > Jake
> > > >
> > > > --
> > > >
> > > > *Jacob Ferriero*
> > > >
> > > > Strategic Cloud Engineer: Data Engineering
> > > >
> > > > jferriero@google.com
> > > >
> > > > 617-714-2509 <(617)%20714-2509>
> > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48%20660%20796%20129>
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48%20660%20796%20129> <+48660796129
> <+48%20660%20796%20129>>
> [image: Polidea] <https://www.polidea.com/>
>


-- 

*Jacob Ferriero*

Strategic Cloud Engineer: Data Engineering

jferriero@google.com

617-714-2509

Re: [DISCUSS] Refactoring Import Errors

Posted by Jarek Potiuk <Ja...@polidea.com>.
True Kamil.  We can do it now and the file vs. dag is an important
difference indeed!

J


On Thu, Sep 17, 2020 at 10:44 AM Kamil Breguła <ka...@polidea.com>
wrote:

> Only in some cases, we do not have a DAG ID, but only have the path to the
> file.
>
> In my opinion, we can do it now. One of the most important changes is to
> ensure a stable ID. Now we delete and add the errors again, and we should
> check which errors should be deleted and which should be added. If we have
> a stable ID then we can add new metadata to the row.
>
> Why do we need a new table - import_errors_history? Can't we use the
> current table?
>
>
>
> On Thu, Sep 17, 2020 at 10:01 AM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > I am all for it.
> >
> > This should be - likely - connected with the future versioning of DAGs
> > (currently deferred to 2.1).
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning
> > - possibly, rather than being a separate AIP, it should be
> > incorporated there.
> >
> > I believe in the versioning implementation we will already have a
> > table where we will keep information about DAGs together with their
> > hash, so it seems natural that such "errors" should be connected to
> > such "DAG_ID" "HASH" combination.
> >
> > And I would love to change the name of it to "parse errors". "Import
> > errors" suggests that those are errors that come from a wrong "import"
> > statement. But we are really talking about any kind of parsing error.
> >
> > J.
> >
> > On Wed, Sep 16, 2020 at 3:48 AM Jacob Ferriero
> > <jf...@google.com.invalid> wrote:
> > >
> > > Hello Airflow Dev List,
> > >
> > > I'm considering proposing a refactor to import errors in order to
> support
> > > sending alert emails when the scheduler finds an import error (but not
> > > every time the scheduler finds the same import error). This is
> currently
> > > not possible because the import errors are cleared during each
> scheduler
> > > loop.
> > >
> > > I'd like to poll the community for perspectives on other short commings
> > of
> > > the import error model before proposing a refactor or other use cases
> > folks
> > > might have for such a refactor (e.g. supporting an arbitrary callback
> > > function similar to SLA miss).
> > >
> > > My current thought is to just add an import_errors_history table to the
> > > database that is not cleared on each scheduler loop and does keep track
> > of
> > > if an email was sent in a boolean field. The primary key could be
> > > constructed from a file hash and exception classname.
> > >
> > > Does this one use case warrant a new table? Should we just replace the
> > > import_errors table in place?
> > >
> > > If I can get a sense of high-level direction I can put together an AIP
> /
> > PR.
> > >
> > > Cheers,
> > > Jake
> > >
> > > --
> > >
> > > *Jacob Ferriero*
> > >
> > > Strategic Cloud Engineer: Data Engineering
> > >
> > > jferriero@google.com
> > >
> > > 617-714-2509
> >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea | Principal Software Engineer
> >
> > M: +48 660 796 129
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [DISCUSS] Refactoring Import Errors

Posted by Kamil Breguła <ka...@polidea.com>.
Only in some cases, we do not have a DAG ID, but only have the path to the
file.

In my opinion, we can do it now. One of the most important changes is to
ensure a stable ID. Now we delete and add the errors again, and we should
check which errors should be deleted and which should be added. If we have
a stable ID then we can add new metadata to the row.

Why do we need a new table - import_errors_history? Can't we use the
current table?



On Thu, Sep 17, 2020 at 10:01 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> I am all for it.
>
> This should be - likely - connected with the future versioning of DAGs
> (currently deferred to 2.1).
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning
> - possibly, rather than being a separate AIP, it should be
> incorporated there.
>
> I believe in the versioning implementation we will already have a
> table where we will keep information about DAGs together with their
> hash, so it seems natural that such "errors" should be connected to
> such "DAG_ID" "HASH" combination.
>
> And I would love to change the name of it to "parse errors". "Import
> errors" suggests that those are errors that come from a wrong "import"
> statement. But we are really talking about any kind of parsing error.
>
> J.
>
> On Wed, Sep 16, 2020 at 3:48 AM Jacob Ferriero
> <jf...@google.com.invalid> wrote:
> >
> > Hello Airflow Dev List,
> >
> > I'm considering proposing a refactor to import errors in order to support
> > sending alert emails when the scheduler finds an import error (but not
> > every time the scheduler finds the same import error). This is currently
> > not possible because the import errors are cleared during each scheduler
> > loop.
> >
> > I'd like to poll the community for perspectives on other short commings
> of
> > the import error model before proposing a refactor or other use cases
> folks
> > might have for such a refactor (e.g. supporting an arbitrary callback
> > function similar to SLA miss).
> >
> > My current thought is to just add an import_errors_history table to the
> > database that is not cleared on each scheduler loop and does keep track
> of
> > if an email was sent in a boolean field. The primary key could be
> > constructed from a file hash and exception classname.
> >
> > Does this one use case warrant a new table? Should we just replace the
> > import_errors table in place?
> >
> > If I can get a sense of high-level direction I can put together an AIP /
> PR.
> >
> > Cheers,
> > Jake
> >
> > --
> >
> > *Jacob Ferriero*
> >
> > Strategic Cloud Engineer: Data Engineering
> >
> > jferriero@google.com
> >
> > 617-714-2509
>
>
>
> --
>
> Jarek Potiuk
> Polidea | Principal Software Engineer
>
> M: +48 660 796 129
>

Re: [DISCUSS] Refactoring Import Errors

Posted by Jarek Potiuk <Ja...@polidea.com>.
I am all for it.

This should be - likely - connected with the future versioning of DAGs
(currently deferred to 2.1).
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning
- possibly, rather than being a separate AIP, it should be
incorporated there.

I believe in the versioning implementation we will already have a
table where we will keep information about DAGs together with their
hash, so it seems natural that such "errors" should be connected to
such "DAG_ID" "HASH" combination.

And I would love to change the name of it to "parse errors". "Import
errors" suggests that those are errors that come from a wrong "import"
statement. But we are really talking about any kind of parsing error.

J.

On Wed, Sep 16, 2020 at 3:48 AM Jacob Ferriero
<jf...@google.com.invalid> wrote:
>
> Hello Airflow Dev List,
>
> I'm considering proposing a refactor to import errors in order to support
> sending alert emails when the scheduler finds an import error (but not
> every time the scheduler finds the same import error). This is currently
> not possible because the import errors are cleared during each scheduler
> loop.
>
> I'd like to poll the community for perspectives on other short commings of
> the import error model before proposing a refactor or other use cases folks
> might have for such a refactor (e.g. supporting an arbitrary callback
> function similar to SLA miss).
>
> My current thought is to just add an import_errors_history table to the
> database that is not cleared on each scheduler loop and does keep track of
> if an email was sent in a boolean field. The primary key could be
> constructed from a file hash and exception classname.
>
> Does this one use case warrant a new table? Should we just replace the
> import_errors table in place?
>
> If I can get a sense of high-level direction I can put together an AIP / PR.
>
> Cheers,
> Jake
>
> --
>
> *Jacob Ferriero*
>
> Strategic Cloud Engineer: Data Engineering
>
> jferriero@google.com
>
> 617-714-2509



-- 

Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129