You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Chao-Han Tsai <mi...@gmail.com> on 2019/03/31 02:57:08 UTC

[DISCUSS] Utility DAGs

Hi all,

I have been thinking about adding some DAGs that are for the purpose of
AIrflow cluster operation, DAG schedule delay instrumentation and log
retention for instance. Currently we have example_dags, should we add
another directory utility_dags in the repo? We can have a flag in
airflow.cfg to let user decide whether to load the utility_dags (just like
what we did for example_dags).

What do you think?

-- 
Chao-Han Tsai

Re: [DISCUSS] Utility DAGs

Posted by Ash Berlin-Taylor <as...@apache.org>.
> For example, say we want to support retention in some Airflow tables such
> as task_instance, dag_run and log, it seems reasonable to me to create a
> DAG to periodically clean up the tables

I guess you mean something like https://github.com/teamclairvoyant/airflow-maintenance-dags/tree/master/db-cleanup but just shipped with Airflow?

The reason I don't think this should be best as a DAG in Airflow is that we can do it better/cleaner if it is core to Airflow:

- We don't need to speculatively run a DAG that does nothing
- We don't need to "waste" an executor slot
- It could automatically be done before/after running another task in the dag
- We don't create extra task instance rows/logs that we than have to clean up too.

That is my thinking of why I don't think this sort of built-in functionality should be a DAG if it is shipped _with_ Airflow.

-ash


> On 31 Mar 2019, at 20:42, Chao-Han Tsai <mi...@gmail.com> wrote:
> 
> Thanks Ash and Kevin for the feedback.
> 
> I think there are some utilities that can be solved easily with a DAG
> without introducing more logic to complicate the scheduler code. Also,
> these utilities may run periodically and can be abstracted out with a DAG.
> For example, say we want to support retention in some Airflow tables such
> as task_instance, dag_run and log, it seems reasonable to me to create a
> DAG to periodically clean up the tables.
> 
> Would like to learn more about the concerns about introducing these utility
> DAGs.
> 
> On Sun, Mar 31, 2019 at 1:17 AM Kevin Yang <yr...@gmail.com> wrote:
> 
>> Agree on having core airflow related stuff built into airflow( like
>> schedule delay instrumentation) and leave the others to cluster maintainer
>> to set up( like log retention). How people handle log retention might be
>> quite different depends on the logging backend. E.g. we use ElasticSearch
>> and we don't even manage the log retention ourselves. Same for stuff like
>> metrics/ alert submitting.
>> 
>> Just my $0.02
>> 
>> Cheers,
>> Kevin Y
>> 
>> On Sun, Mar 31, 2019 at 12:48 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>> 
>>> Do these need to me dags of they are built in to Airflow, or could/should
>>> they be just handled internally by the scheduler?
>>> 
>>> -a
>>> 
>>> On 31 March 2019 03:57:08 BST, Chao-Han Tsai <mi...@gmail.com>
>> wrote:
>>>> Hi all,
>>>> 
>>>> I have been thinking about adding some DAGs that are for the purpose of
>>>> AIrflow cluster operation, DAG schedule delay instrumentation and log
>>>> retention for instance. Currently we have example_dags, should we add
>>>> another directory utility_dags in the repo? We can have a flag in
>>>> airflow.cfg to let user decide whether to load the utility_dags (just
>>>> like
>>>> what we did for example_dags).
>>>> 
>>>> What do you think?
>>>> 
>>>> --
>>>> Chao-Han Tsai
>>> 
>> 
> 
> 
> -- 
> 
> Chao-Han Tsai


Re: [DISCUSS] Utility DAGs

Posted by Chao-Han Tsai <mi...@gmail.com>.
Thanks Ash and Kevin for the feedback.

I think there are some utilities that can be solved easily with a DAG
without introducing more logic to complicate the scheduler code. Also,
these utilities may run periodically and can be abstracted out with a DAG.
For example, say we want to support retention in some Airflow tables such
as task_instance, dag_run and log, it seems reasonable to me to create a
DAG to periodically clean up the tables.

Would like to learn more about the concerns about introducing these utility
DAGs.

On Sun, Mar 31, 2019 at 1:17 AM Kevin Yang <yr...@gmail.com> wrote:

> Agree on having core airflow related stuff built into airflow( like
> schedule delay instrumentation) and leave the others to cluster maintainer
> to set up( like log retention). How people handle log retention might be
> quite different depends on the logging backend. E.g. we use ElasticSearch
> and we don't even manage the log retention ourselves. Same for stuff like
> metrics/ alert submitting.
>
> Just my $0.02
>
> Cheers,
> Kevin Y
>
> On Sun, Mar 31, 2019 at 12:48 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> > Do these need to me dags of they are built in to Airflow, or could/should
> > they be just handled internally by the scheduler?
> >
> > -a
> >
> > On 31 March 2019 03:57:08 BST, Chao-Han Tsai <mi...@gmail.com>
> wrote:
> > >Hi all,
> > >
> > >I have been thinking about adding some DAGs that are for the purpose of
> > >AIrflow cluster operation, DAG schedule delay instrumentation and log
> > >retention for instance. Currently we have example_dags, should we add
> > >another directory utility_dags in the repo? We can have a flag in
> > >airflow.cfg to let user decide whether to load the utility_dags (just
> > >like
> > >what we did for example_dags).
> > >
> > >What do you think?
> > >
> > >--
> > >Chao-Han Tsai
> >
>


-- 

Chao-Han Tsai

Re: [DISCUSS] Utility DAGs

Posted by Kevin Yang <yr...@gmail.com>.
Agree on having core airflow related stuff built into airflow( like
schedule delay instrumentation) and leave the others to cluster maintainer
to set up( like log retention). How people handle log retention might be
quite different depends on the logging backend. E.g. we use ElasticSearch
and we don't even manage the log retention ourselves. Same for stuff like
metrics/ alert submitting.

Just my $0.02

Cheers,
Kevin Y

On Sun, Mar 31, 2019 at 12:48 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> Do these need to me dags of they are built in to Airflow, or could/should
> they be just handled internally by the scheduler?
>
> -a
>
> On 31 March 2019 03:57:08 BST, Chao-Han Tsai <mi...@gmail.com> wrote:
> >Hi all,
> >
> >I have been thinking about adding some DAGs that are for the purpose of
> >AIrflow cluster operation, DAG schedule delay instrumentation and log
> >retention for instance. Currently we have example_dags, should we add
> >another directory utility_dags in the repo? We can have a flag in
> >airflow.cfg to let user decide whether to load the utility_dags (just
> >like
> >what we did for example_dags).
> >
> >What do you think?
> >
> >--
> >Chao-Han Tsai
>

Re: [DISCUSS] Utility DAGs

Posted by Ash Berlin-Taylor <as...@apache.org>.
Do these need to me dags of they are built in to Airflow, or could/should they be just handled internally by the scheduler?

-a

On 31 March 2019 03:57:08 BST, Chao-Han Tsai <mi...@gmail.com> wrote:
>Hi all,
>
>I have been thinking about adding some DAGs that are for the purpose of
>AIrflow cluster operation, DAG schedule delay instrumentation and log
>retention for instance. Currently we have example_dags, should we add
>another directory utility_dags in the repo? We can have a flag in
>airflow.cfg to let user decide whether to load the utility_dags (just
>like
>what we did for example_dags).
>
>What do you think?
>
>-- 
>Chao-Han Tsai