You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Axovision Team <da...@axovision.com> on 2020/09/03 09:10:19 UTC

[DISCUSS] Airflow concept of full stack scheduling system

With the already very deep focus on ETL processes, there is a rising
interest in applying apache airflow also as a full stack scheduling system
for all types of processes. We are applying apache airflow as a scheduling
system at AXOVISION not only for ETL use cases, but also as a full
replacement of cron jobs. But still there is functionality missing for a
full stack scheduling system. But we think that adding these feature is not
to complex, because the fundamentals are already there. So it is like
adding simple new features to airflow to reach this stage of full stack
scheduling system, aside of ETL process management. The most important
features to add would be: *Dynamic schedule intervals *and *Unpause/Pause
without any catchup (no run of last recent) *

Description
There are several question (issues) on stack overflow, asking for the need
of a dynamic schedule interval. This means, the ability to change the
schedule interval after DAG creation programmatically via API or CLI.
Further, user alo asking for a detailed discussion on some stackoverflow
questions, which could not be find.
With the ability to dynamically change DAG schedule intervals, airflow can
increase user satisfaction and fully replace other custom cron like
schedule systems.

reference stack overflow links:

   -
   https://stackoverflow.com/questions/63494560/airflow-schedule-interval-change
   -
   https://stackoverflow.com/questions/63271671/can-we-parameterize-the-airflow-schedule-interval-dynamically-reading-from-the-v
   -
   https://stackoverflow.com/questions/37294560/airflow-changing-the-crontab-time-for-a-dag-in-airflow

Within the Pitfalls it is also mentioned:

   - https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls

Use case / motivation
As a user I want to change the schedule interval of an already defined DAG,
so that the DAG can run dynamically at different time points depended on an
external condition.

The motivation is simple: there are a lot of use cases not fitting the only
once cron like created definition of a schedule interval of a DAG. (e.g.
event driven schedules)

Use Case:
In the morning there was a database update announced, due to ongoing work
in the backend, the database should be updated always 2 hours after the
announcement. Which could be always a different point in time, according to
when the event happened. So the database update DAG needs to be scheduled
on a new interval to run.

Related Issues
I could not find any directly related issues to that.

Re: [DISCUSS] Airflow concept of full stack scheduling system

Posted by Zikun Zhu <ku...@gmail.com>.
Recently I looked into the topic of *Unpause/Pause without any catchup (no
run of last recent) * and I have written some comments in github:
https://github.com/apache/airflow/issues/9914#issuecomment-680934697

Basically, last-recent catchup appears to be a catchup but it is just a
longer-delayed scheduling. More detailed elaboration is in the comment for
anyone interested. I do not have a clean solution in mind yet.

Zikun

On Thu, 3 Sep 2020 at 19:44, Axovision Team <da...@axovision.com>
wrote:

> With the already very deep focus on ETL processes, there is a rising
> interest in applying apache airflow also as a full stack scheduling system
> for all types of processes. We are applying apache airflow as a scheduling
> system at AXOVISION not only for ETL use cases, but also as a full
> replacement of cron jobs. But still there is functionality missing for a
> full stack scheduling system. But we think that adding these feature is not
> to complex, because the fundamentals are already there. So it is like
> adding simple new features to airflow to reach this stage of full stack
> scheduling system, aside of ETL process management. The most important
> features to add would be: *Dynamic schedule intervals *and *Unpause/Pause
> without any catchup (no run of last recent) *
>
> Description
> There are several question (issues) on stack overflow, asking for the need
> of a dynamic schedule interval. This means, the ability to change the
> schedule interval after DAG creation programmatically via API or CLI.
> Further, user alo asking for a detailed discussion on some stackoverflow
> questions, which could not be find.
> With the ability to dynamically change DAG schedule intervals, airflow can
> increase user satisfaction and fully replace other custom cron like
> schedule systems.
>
> reference stack overflow links:
>
>    -
>
> https://stackoverflow.com/questions/63494560/airflow-schedule-interval-change
>    -
>
> https://stackoverflow.com/questions/63271671/can-we-parameterize-the-airflow-schedule-interval-dynamically-reading-from-the-v
>    -
>
> https://stackoverflow.com/questions/37294560/airflow-changing-the-crontab-time-for-a-dag-in-airflow
>
> Within the Pitfalls it is also mentioned:
>
>    - https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
>
> Use case / motivation
> As a user I want to change the schedule interval of an already defined DAG,
> so that the DAG can run dynamically at different time points depended on an
> external condition.
>
> The motivation is simple: there are a lot of use cases not fitting the only
> once cron like created definition of a schedule interval of a DAG. (e.g.
> event driven schedules)
>
> Use Case:
> In the morning there was a database update announced, due to ongoing work
> in the backend, the database should be updated always 2 hours after the
> announcement. Which could be always a different point in time, according to
> when the event happened. So the database update DAG needs to be scheduled
> on a new interval to run.
>
> Related Issues
> I could not find any directly related issues to that.
>

Re: [DISCUSS] Airflow concept of full stack scheduling system

Posted by Kaxil Naik <ka...@gmail.com>.
We have an on-ongoing discussion at
https://lists.apache.org/thread.html/2b12ae265795ff2e655a5161c972f5c7bbe60722a12849a0e2c5c55f%40%3Cdev.airflow.apache.org%3E
if
you'd like to add some thoughts over there.



On Thu, Sep 3, 2020 at 1:07 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

> I start thinking more and more that we also should support this kind
> of case. We are not very far from it and people are using it this way
> anyway, so it would be a bit turning a blind eye if we don't try to
> accommodate - especially that it is not as difficult to do.
>
> I think maybe we then should have the ETL (or rather Data interval)
> and non-Data Interval type of DAGs?
>
> And this actually goes hand-in-hand with the discussion about
> "schedule at the end or beginning of the interval".
>
> Maybe introducing the two kinds of Dags might kill two birds with the
> same stone?
>
> We could have "Data Interval DAGs" with cron specs pointing to the end
> of the interval, and "non-data-interval" ones triggered according to
> cron schedule.  Those two seem related.
>
> Just a wild thought :)
>
> J.
>
> On Thu, Sep 3, 2020 at 1:44 PM Axovision Team
> <da...@axovision.com> wrote:
> >
> > With the already very deep focus on ETL processes, there is a rising
> > interest in applying apache airflow also as a full stack scheduling
> system
> > for all types of processes. We are applying apache airflow as a
> scheduling
> > system at AXOVISION not only for ETL use cases, but also as a full
> > replacement of cron jobs. But still there is functionality missing for a
> > full stack scheduling system. But we think that adding these feature is
> not
> > to complex, because the fundamentals are already there. So it is like
> > adding simple new features to airflow to reach this stage of full stack
> > scheduling system, aside of ETL process management. The most important
> > features to add would be: *Dynamic schedule intervals *and *Unpause/Pause
> > without any catchup (no run of last recent) *
> >
> > Description
> > There are several question (issues) on stack overflow, asking for the
> need
> > of a dynamic schedule interval. This means, the ability to change the
> > schedule interval after DAG creation programmatically via API or CLI.
> > Further, user alo asking for a detailed discussion on some stackoverflow
> > questions, which could not be find.
> > With the ability to dynamically change DAG schedule intervals, airflow
> can
> > increase user satisfaction and fully replace other custom cron like
> > schedule systems.
> >
> > reference stack overflow links:
> >
> >    -
> >
> https://stackoverflow.com/questions/63494560/airflow-schedule-interval-change
> >    -
> >
> https://stackoverflow.com/questions/63271671/can-we-parameterize-the-airflow-schedule-interval-dynamically-reading-from-the-v
> >    -
> >
> https://stackoverflow.com/questions/37294560/airflow-changing-the-crontab-time-for-a-dag-in-airflow
> >
> > Within the Pitfalls it is also mentioned:
> >
> >    - https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
> >
> > Use case / motivation
> > As a user I want to change the schedule interval of an already defined
> DAG,
> > so that the DAG can run dynamically at different time points depended on
> an
> > external condition.
> >
> > The motivation is simple: there are a lot of use cases not fitting the
> only
> > once cron like created definition of a schedule interval of a DAG. (e.g.
> > event driven schedules)
> >
> > Use Case:
> > In the morning there was a database update announced, due to ongoing work
> > in the backend, the database should be updated always 2 hours after the
> > announcement. Which could be always a different point in time, according
> to
> > when the event happened. So the database update DAG needs to be scheduled
> > on a new interval to run.
> >
> > Related Issues
> > I could not find any directly related issues to that.
>
>
>
> --
>
> Jarek Potiuk
> Polidea | Principal Software Engineer
>
> M: +48 660 796 129
>

Re: [DISCUSS] Airflow concept of full stack scheduling system

Posted by Jarek Potiuk <Ja...@polidea.com>.
I start thinking more and more that we also should support this kind
of case. We are not very far from it and people are using it this way
anyway, so it would be a bit turning a blind eye if we don't try to
accommodate - especially that it is not as difficult to do.

I think maybe we then should have the ETL (or rather Data interval)
and non-Data Interval type of DAGs?

And this actually goes hand-in-hand with the discussion about
"schedule at the end or beginning of the interval".

Maybe introducing the two kinds of Dags might kill two birds with the
same stone?

We could have "Data Interval DAGs" with cron specs pointing to the end
of the interval, and "non-data-interval" ones triggered according to
cron schedule.  Those two seem related.

Just a wild thought :)

J.

On Thu, Sep 3, 2020 at 1:44 PM Axovision Team
<da...@axovision.com> wrote:
>
> With the already very deep focus on ETL processes, there is a rising
> interest in applying apache airflow also as a full stack scheduling system
> for all types of processes. We are applying apache airflow as a scheduling
> system at AXOVISION not only for ETL use cases, but also as a full
> replacement of cron jobs. But still there is functionality missing for a
> full stack scheduling system. But we think that adding these feature is not
> to complex, because the fundamentals are already there. So it is like
> adding simple new features to airflow to reach this stage of full stack
> scheduling system, aside of ETL process management. The most important
> features to add would be: *Dynamic schedule intervals *and *Unpause/Pause
> without any catchup (no run of last recent) *
>
> Description
> There are several question (issues) on stack overflow, asking for the need
> of a dynamic schedule interval. This means, the ability to change the
> schedule interval after DAG creation programmatically via API or CLI.
> Further, user alo asking for a detailed discussion on some stackoverflow
> questions, which could not be find.
> With the ability to dynamically change DAG schedule intervals, airflow can
> increase user satisfaction and fully replace other custom cron like
> schedule systems.
>
> reference stack overflow links:
>
>    -
>    https://stackoverflow.com/questions/63494560/airflow-schedule-interval-change
>    -
>    https://stackoverflow.com/questions/63271671/can-we-parameterize-the-airflow-schedule-interval-dynamically-reading-from-the-v
>    -
>    https://stackoverflow.com/questions/37294560/airflow-changing-the-crontab-time-for-a-dag-in-airflow
>
> Within the Pitfalls it is also mentioned:
>
>    - https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
>
> Use case / motivation
> As a user I want to change the schedule interval of an already defined DAG,
> so that the DAG can run dynamically at different time points depended on an
> external condition.
>
> The motivation is simple: there are a lot of use cases not fitting the only
> once cron like created definition of a schedule interval of a DAG. (e.g.
> event driven schedules)
>
> Use Case:
> In the morning there was a database update announced, due to ongoing work
> in the backend, the database should be updated always 2 hours after the
> announcement. Which could be always a different point in time, according to
> when the event happened. So the database update DAG needs to be scheduled
> on a new interval to run.
>
> Related Issues
> I could not find any directly related issues to that.



-- 

Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129