You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Malthe <mb...@gmail.com> on 2022/02/23 12:44:10 UTC

Simplifying the timetable interface

Hi all,

I was going to take a stab at adding some custom timetable
functionality to address two requirements:

1. The ability to temporarily switch to an alternative timetable for
an interim period.
2. The ability to introduce relatively custom holiday scheduling which
is well outside the functionality of cron expressions.

I could add that while (1) could be done using Python at the
DAG-level, I would like to use the timetable interface to allow
accurate predictions into the future. That's for another post, but to
give some context, I have floated a proposal on Slack to show
tentative scheduled days in the calendar view using a "grey dot"
indication to denote that we expect at least on scheduled run (see
attachment).

Now, I looked at the "afterwork" example to see what's required in
order to implement a custom timetable. And I must admit that I find it
rather daunting given that it's so easy to express what that timetable
is about:

     MON-FRI, daily, run after midnight

Intuitively, that should be a couple of lines of Python code. It ends
up being quite a lot more than that and that's due to the interface
that must be implemented.

I think the correct timetable interface is:

1. Return the next execution time that's strictly (">") after a particular time.
2. Return the earliest runtime for a particular execution time,
accounting for any grace period.
3. Return context metadata for this execution.

The scheduler provides the most recent execution time as input and
creates a dagrun if the returned earliest runtime for the next
execution time is at or after the current time.

Considering again the "afterwork" example, with a grace period of 5
minutes, we'd expect a dagrun shortly after 5 minutes past midnight
(of Monday, Tuesday, and so forth up until midnight after Friday). The
execution time _is_ the time where a given task runs (minus the grace
period).

The reasoning behind (3) is because I consider the notion of "data
interval" to be metadata since this is only a concern for the task
implementation. For example, the scheduler does not need to worry
about this at all.

Other concerns:

- Backfilling is out of scope for the timetable interface.
- Time restrictions (i.e. start and end date) are likewise out of
scope. The scheduler knows when the DAG starts and ends and doesn't
need help from the timetable here.
- Manual runs are trivial because there is no (2) or (3). In fact, for
most DAGs (which care about a data interval), there should probably
not be a play button at all.

I didn't complete the exercise, but it stands to reason that with this
interface, the "afterwork" example would be short and simple given the
interface outlined above.

Thanks

Re: Simplifying the timetable interface

Posted by Malthe <mb...@gmail.com>.
> Catchup is exposed to the Timetable interface to enable this use case
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval#AIP39Richerscheduler_interval-FutureEnhancements

That's an interesting use-case. Perhaps the bullet I have listed as
(3) could be changed into:

3. The timetable provides a method that takes a set of execution times
(whatever the meaning), returns also a set of execution times, or a
dictionary mapping execution times to additional dagrun context (e.g.
data interval specification).

This would allow you to roll sequential executions into one.

Malthe

Re: Simplifying the timetable interface

Posted by Ash Berlin-Taylor <as...@apache.org>.
On Wed, Feb 23 2022 at 14:13:43 +0100, Bas Harenslak 
<ba...@astronomer.io.INVALID> wrote:
> catchup must be implemented which I feel should be generic to all 
> Timetables and not left up to a developer.

Catchup is exposed to the Timetable interface to enable this use case
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval#AIP39Richerscheduler_interval-FutureEnhancements>

> *"Rollup", an alternative to backfill*
> 
> This covers a few use-cases, but once we have an explicit start and 
> end of interval defined it Is desirable to be able to re-run the same 
> (for example) daily dag for a longer period.
> 
> This could be an alternative to backfill, where the Trigger interface 
> in the UI could show optional date range inputs letting the user say 
> "run this dag for a 1 month period".
> 
> This behaviour also leads to a new alternative method of doing 
> catchup: if three dag runs are missed, rather than creating three 
> individual daily runs (for example) we could instead create one DAG 
> run with a three-day data interval. This would be opt-in on a per-dag 
> basis, as it depends entirely on how the task is written

(The last paragraph there specifically)


Re: Simplifying the timetable interface

Posted by Malthe <mb...@gmail.com>.
> With that said, I’m not sure what you’re suggesting here? Is this email a conversation starter? Is it a proposal?

It's a proposal outline I would say and conversation starter in one!

> Note: I cannot see the attachment, believe the mailing list doesn’t allow those.

I have tried uploading it to Github now:

https://user-images.githubusercontent.com/26405/155328599-d35b5b27-a556-46b7-8cd0-7e7e6a2f2277.png

Thanks

Re: Simplifying the timetable interface

Posted by Bas Harenslak <ba...@astronomer.io.INVALID>.
Hi Malthe,

I’m all for simplifying the timetable interface. I think currently Timetables require a deep understanding of Airflow’s scheduling semantics which is way too complex for any Airflow user, plus decisions around e.g. catchup must be implemented which I feel should be generic to all Timetables and not left up to a developer.

With that said, I’m not sure what you’re suggesting here? Is this email a conversation starter? Is it a proposal?

Note: I cannot see the attachment, believe the mailing list doesn’t allow those.

Bas


> On 23 Feb 2022, at 13:44, Malthe <mb...@gmail.com> wrote:
> 
> Hi all,
> 
> I was going to take a stab at adding some custom timetable
> functionality to address two requirements:
> 
> 1. The ability to temporarily switch to an alternative timetable for
> an interim period.
> 2. The ability to introduce relatively custom holiday scheduling which
> is well outside the functionality of cron expressions.
> 
> I could add that while (1) could be done using Python at the
> DAG-level, I would like to use the timetable interface to allow
> accurate predictions into the future. That's for another post, but to
> give some context, I have floated a proposal on Slack to show
> tentative scheduled days in the calendar view using a "grey dot"
> indication to denote that we expect at least on scheduled run (see
> attachment).
> 
> Now, I looked at the "afterwork" example to see what's required in
> order to implement a custom timetable. And I must admit that I find it
> rather daunting given that it's so easy to express what that timetable
> is about:
> 
>     MON-FRI, daily, run after midnight
> 
> Intuitively, that should be a couple of lines of Python code. It ends
> up being quite a lot more than that and that's due to the interface
> that must be implemented.
> 
> I think the correct timetable interface is:
> 
> 1. Return the next execution time that's strictly (">") after a particular time.
> 2. Return the earliest runtime for a particular execution time,
> accounting for any grace period.
> 3. Return context metadata for this execution.
> 
> The scheduler provides the most recent execution time as input and
> creates a dagrun if the returned earliest runtime for the next
> execution time is at or after the current time.
> 
> Considering again the "afterwork" example, with a grace period of 5
> minutes, we'd expect a dagrun shortly after 5 minutes past midnight
> (of Monday, Tuesday, and so forth up until midnight after Friday). The
> execution time _is_ the time where a given task runs (minus the grace
> period).
> 
> The reasoning behind (3) is because I consider the notion of "data
> interval" to be metadata since this is only a concern for the task
> implementation. For example, the scheduler does not need to worry
> about this at all.
> 
> Other concerns:
> 
> - Backfilling is out of scope for the timetable interface.
> - Time restrictions (i.e. start and end date) are likewise out of
> scope. The scheduler knows when the DAG starts and ends and doesn't
> need help from the timetable here.
> - Manual runs are trivial because there is no (2) or (3). In fact, for
> most DAGs (which care about a data interval), there should probably
> not be a play button at all.
> 
> I didn't complete the exercise, but it stands to reason that with this
> interface, the "afterwork" example would be short and simple given the
> interface outlined above.
> 
> Thanks


Re: Simplifying the timetable interface

Posted by Ash Berlin-Taylor <as...@apache.org>.
I took a stab at implementing this timetable: (I didn't write most of 
the timetable code, that was TP Chung, but I did review most/all of the 
PRs so I clearly know this code)

class WeekdayTimetable(Timetable):
    def next_dagrun_info(self, *, last_automated_data_interval: 
Optional[DataInterval], restriction: TimeRestriction) -> 
Optional[DagRunInfo]:

        earliest: pendulum.DateTime | None= None
        if last_automated_data_interval:
            earliest = 
pendulum.instance(last_automated_data_interval.end)
        elif restriction:
            if not restriction.catchup:
                # Since we are daily, start of today is the earliest
                earliest = pendulum.today()
            elif restriction.earliest:
                earliest = pendulum.instance(restriction.earliest)

        if not earliest:
            return None

        interval = self._next_weekday_interval(earliest)
        return DagRunInfo(run_after=interval.end, 
data_interval=interval)

    def infer_manual_data_interval(self, *, run_after: DateTime) -> 
DataInterval:
        return self._next_weekday_interval(run_after)

    def _next_weekday_interval(self, day = pendulum.DateTime):
        if pendulum.MONDAY < day.day_of_week > pendulum.FRIDAY:
            day = day.next(pendulum.MONDAY)

        end = day.start_of('day').add(days=1)
        return DataInterval(day, end)

The output of testing it out:

Friday 21 00:00 Saturday 00:00
Monday 24 00:00 Tuesday 00:00
Tuesday 25 00:00 Wednesday 00:00
Wednesday 26 00:00 Thursday 00:00
Thursday 27 00:00 Friday 00:00
Friday 28 00:00 Saturday 00:00
Monday 31 00:00 Tuesday 00:00
Manual Wednesday 23 13:57 Thursday 00:00
No catchup Wednesday 23 February 00:00 Thursday 00:00


And my test code:

dag2 = DAG(dag_id='timetable-test', timetable=WeekdayTimetable(), 
start_date=pendulum.DateTime(2022, 1, 21))

info = None
dag2.catchup = True
for _ in range(7):
    info = dag2.next_dagrun_info(info.data_interval if info else None)
    if info:
        print(info.data_interval.start.format('dddd DD HH:mm'), 
info.data_interval.end.format('dddd HH:mm'))
    else:
        break

data_interval = 
dag2.timetable.infer_manual_data_interval(run_after=pendulum.now())
print("Manual", data_interval.start.format('dddd DD HH:mm'), 
data_interval.end.format('dddd HH:mm'))


dag3 = DAG(dag_id='timetable-test', timetable=WeekdayTimetable(), 
start_date=pendulum.DateTime(2022, 1, 21), catchup=False)
info = dag3.next_dagrun_info(None)
if info:
    print(
        "No catchup",
        info.data_interval.start.format('dddd DD MMMM HH:mm'),
        info.data_interval.end.format('dddd HH:mm'),
    )
else:
    print("none")


On Wed, Feb 23 2022 at 13:39:24 +0000, Ash Berlin-Taylor 
<as...@apache.org> wrote:
> One comment: Please don't use the phrase "execution time" as it's not 
> clear which of the possible meanings it could be (is it the old 
> exectuion_date? Is it the time the dagrun actually starts?)
> 
> Backfilling is not out of scope for a timetable at all. If I run 
> `airflow dags backfill mydagid --start-date 2020-01-01 --end-date 
> 2021-06-30` how many DagRuns are created and what are logical 
> dates/intervals of them?
> 
>> The scheduler provides the most recent execution time as input and
>> creates a dagrun if the returned earliest runtime for the next
>> execution time is at or after the current time.
> 
> And in case of the very first time a Dag is enabled? I guess it could 
> pass the dag start_date here instead?
> 
> (Implementation detail: It stores the info in 4 columns in the 
> DagModel table, next_dagrun, next_dagrun_interval_start and _end, and 
> next_dagrun_create_after so that the creation of the DagRun can just 
> be done as a DB lookup. Doesn't materially change your statement)
> 
>> - Manual runs are trivial because there is no (2) or (3). In fact, 
>> for
>> most DAGs (which care about a data interval), there should probably
>> not be a play button at all.
> 
> If manual triggered dag runs are out of scope, what is the 
> data_interval_end and data_interval_start values (in the 
> context/templates) for  a manually triggered run?
> 
> It's not possible from the UI currently, but `airflow dags trigger` 
> can be provided with a specific execution date -- Maybe this should 
> be extended to take the data interval too -- but in terms of 
> User-friendly CLI inferring it if not provided makes it easier to 
> use. (A timetable could choose to return an error for the infer 
> method)
> 
> Next question about how go about implementing and releasing this? Now 
> that it's been in a release we can't just break backcompat, so either 
> we need to make this a "Base" template that handles most of this 
> logic, or we need to introspect and tell old Timetable from new.
> 
> You are probably right about earliest/latest and we might be able to 
> do away with that part of the interface.
> 
> -ash
> 
> On Wed, Feb 23 2022 at 12:44:10 +0000, Malthe <mb...@gmail.com> 
> wrote:
>> Hi all,
>> 
>> I was going to take a stab at adding some custom timetable
>> functionality to address two requirements:
>> 
>> 1. The ability to temporarily switch to an alternative timetable for
>> an interim period.
>> 2. The ability to introduce relatively custom holiday scheduling 
>> which
>> is well outside the functionality of cron expressions.
>> 
>> I could add that while (1) could be done using Python at the
>> DAG-level, I would like to use the timetable interface to allow
>> accurate predictions into the future. That's for another post, but to
>> give some context, I have floated a proposal on Slack to show
>> tentative scheduled days in the calendar view using a "grey dot"
>> indication to denote that we expect at least on scheduled run (see
>> attachment).
>> 
>> Now, I looked at the "afterwork" example to see what's required in
>> order to implement a custom timetable. And I must admit that I find 
>> it
>> rather daunting given that it's so easy to express what that 
>> timetable
>> is about:
>> 
>>      MON-FRI, daily, run after midnight
>> 
>> Intuitively, that should be a couple of lines of Python code. It ends
>> up being quite a lot more than that and that's due to the interface
>> that must be implemented.
>> 
>> I think the correct timetable interface is:
>> 
>> 1. Return the next execution time that's strictly (">") after a 
>> particular time.
>> 2. Return the earliest runtime for a particular execution time,
>> accounting for any grace period.
>> 3. Return context metadata for this execution.
>> 
>> The scheduler provides the most recent execution time as input and
>> creates a dagrun if the returned earliest runtime for the next
>> execution time is at or after the current time.
>> 
>> Considering again the "afterwork" example, with a grace period of 5
>> minutes, we'd expect a dagrun shortly after 5 minutes past midnight
>> (of Monday, Tuesday, and so forth up until midnight after Friday). 
>> The
>> execution time _is_ the time where a given task runs (minus the grace
>> period).
>> 
>> The reasoning behind (3) is because I consider the notion of "data
>> interval" to be metadata since this is only a concern for the task
>> implementation. For example, the scheduler does not need to worry
>> about this at all.
>> 
>> Other concerns:
>> 
>> - Backfilling is out of scope for the timetable interface.
>> - Time restrictions (i.e. start and end date) are likewise out of
>> scope. The scheduler knows when the DAG starts and ends and doesn't
>> need help from the timetable here.
>> - Manual runs are trivial because there is no (2) or (3). In fact, 
>> for
>> most DAGs (which care about a data interval), there should probably
>> not be a play button at all.
>> 
>> I didn't complete the exercise, but it stands to reason that with 
>> this
>> interface, the "afterwork" example would be short and simple given 
>> the
>> interface outlined above.
>> 
>> Thanks


Re: Simplifying the timetable interface

Posted by Collin McNulty <co...@astronomer.io.INVALID>.
Jarek,

I agree fully. I think bare minimum we should provide a version of the
CronDataIntervalTimetable that tries to schedule the run immediately after
the logical_date and a Timetable that just takes a list of datetimes and
schedules at those times.

Collin McNulty

Re: Simplifying the timetable interface

Posted by Jarek Potiuk <ja...@potiuk.com>.
BTW.  Just to add to the discussion this is an attempt of "regular"
users to use and modify the timetable as defined now:
https://github.com/apache/airflow/issues/22242

To be honest, I don't even know what to answer the user. Seems that
the user follows our advice - tries to work from our Workday examples
but there is no way the current timetable approach can be easily
tested or verified and if people will start using the current
timetables to do anything, this will lead to similar problems and
confusion.
I honestly think we should provide the users with a few well tested
(automatically) configurable timetable "implementations".

J.

On Tue, Mar 1, 2022 at 9:27 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>
>> This would be great. I agree that if there were ready-to-use
>> timetables with a composable behavior then very few users will need to
>> write custom timetables which is a good thing.
>>
> Cool!
>
>>
>> > I do not think the current interface is "too complex". Not at all. But I think that it is targeted to a different audience than Malthe and Bas talk about. It is addressed for "power users" - not only because it requires deep understanding of Airflow scheduling internals and optimizations but also, because it requires "admin" rights to develop, test and install it. Regular users. who are Dag authors cannot create new Timetables. This is mostly because of security. The "regular users"  need to convince the admins to do so. And yes I am talking about the important segment of our users where you have professional admins/devops configuring Airflow and DAG authors who just write DAGs. I think this is the most interesting and biggest segment of our users to be honest. We should always think about this segment of our users first IMHO.
>>
>> Well I think they're too complex!
>>
>> And I actually elaborated quite a bit on that in this correspondence –
>> including giving a concrete proposal which you did not consider or
>> mention. But I'm happy to be proven wrong. Perhaps it is only me who
>> thinks the interface is wrong. Let us see an example implementation of
>> "-2 day of every month" or "every day after work", either using a
>> declarative specification built on top of some composable timetable,
>> or as a direct implementation.
>>
>
> Precisely. I think we should implement those and then we can see if they can be simplified. I think if we have a few customized timetables including comprehensive unit tests, we will be able to see if we can simplify the whole interface for Airflow 3.
> But having those few predefined timetables and the unit tests for those - will be of a great help when we will want to simplify it IMHO. Then for Airflow 3 (which is still months away if not years) we will be in a much better position to make even breaking changes.
>
>>
>>
>> - Composability. It should be possible to use and/or operators in a
>> nested fashion to compose any timetable (e.g. every Friday OR every
>> last weekday AND every day at 6pm).
>> - Intervals. It should be possible to define a timetable which changes
>> over time, as non-overlapping intervals (e.g. one of each year).
>> - Exclude. It should be possible to exclude certain days (i.e. holidays).
>
>
> By all means :  - most of those were already mentioned as "future work" in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval - composability, intervals. I think we should just follow up what has been planned there.
>
>>
>> I think this is a good starting point and it would allow users to meet
>> most of their scheduling needs.
>>
>> Cheers

Re: Simplifying the timetable interface

Posted by Jarek Potiuk <ja...@potiuk.com>.
>
> This would be great. I agree that if there were ready-to-use
> timetables with a composable behavior then very few users will need to
> write custom timetables which is a good thing.
>
> Cool!


> > I do not think the current interface is "too complex". Not at all. But I
> think that it is targeted to a different audience than Malthe and Bas talk
> about. It is addressed for "power users" - not only because it requires
> deep understanding of Airflow scheduling internals and optimizations but
> also, because it requires "admin" rights to develop, test and install it.
> Regular users. who are Dag authors cannot create new Timetables. This is
> mostly because of security. The "regular users"  need to convince the
> admins to do so. And yes I am talking about the important segment of our
> users where you have professional admins/devops configuring Airflow and DAG
> authors who just write DAGs. I think this is the most interesting and
> biggest segment of our users to be honest. We should always think about
> this segment of our users first IMHO.
>
> Well I think they're too complex!
>
> And I actually elaborated quite a bit on that in this correspondence –
> including giving a concrete proposal which you did not consider or
> mention. But I'm happy to be proven wrong. Perhaps it is only me who
> thinks the interface is wrong. Let us see an example implementation of
> "-2 day of every month" or "every day after work", either using a
> declarative specification built on top of some composable timetable,
> or as a direct implementation.
>
>
Precisely. I think we should implement those and then we can see if they
can be simplified. I think if we have a few customized timetables including
comprehensive unit tests, we will be able to see if we can simplify the
whole interface for Airflow 3.
But having those few predefined timetables and the unit tests for those -
will be of a great help when we will want to simplify it IMHO. Then for
Airflow 3 (which is still months away if not years) we will be in a much
better position to make even breaking changes.


>
> - Composability. It should be possible to use and/or operators in a
> nested fashion to compose any timetable (e.g. every Friday OR every
> last weekday AND every day at 6pm).
> - Intervals. It should be possible to define a timetable which changes
> over time, as non-overlapping intervals (e.g. one of each year).
> - Exclude. It should be possible to exclude certain days (i.e. holidays).
>

By all means :  - most of those were already mentioned as "future work" in
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval
- composability, intervals. I think we should just follow up what has been
planned there.


> I think this is a good starting point and it would allow users to meet
> most of their scheduling needs.
>
> Cheers
>

Re: Simplifying the timetable interface

Posted by Malthe <mb...@gmail.com>.
On Sun, 27 Feb 2022 at 14:04, Jarek Potiuk <ja...@potiuk.com> wrote:
> TL;DR; I think it is about time to complete what we were planning in AIP-39 as "Future Enhancement" and implement a few simple timetable implementations that will handle most popular use cases (using the "complex" timetable API) that will be available to regular users to use (without the need of writing new code). My proposal is that we should define what timetables to add and aim to implement them to include them in Airflow 2.3. Sounds doable and should solve the real problem of our users.

This would be great. I agree that if there were ready-to-use
timetables with a composable behavior then very few users will need to
write custom timetables which is a good thing.

> I do not think the current interface is "too complex". Not at all. But I think that it is targeted to a different audience than Malthe and Bas talk about. It is addressed for "power users" - not only because it requires deep understanding of Airflow scheduling internals and optimizations but also, because it requires "admin" rights to develop, test and install it. Regular users. who are Dag authors cannot create new Timetables. This is mostly because of security. The "regular users"  need to convince the admins to do so. And yes I am talking about the important segment of our users where you have professional admins/devops configuring Airflow and DAG authors who just write DAGs. I think this is the most interesting and biggest segment of our users to be honest. We should always think about this segment of our users first IMHO.

Well I think they're too complex!

And I actually elaborated quite a bit on that in this correspondence –
including giving a concrete proposal which you did not consider or
mention. But I'm happy to be proven wrong. Perhaps it is only me who
thinks the interface is wrong. Let us see an example implementation of
"-2 day of every month" or "every day after work", either using a
declarative specification built on top of some composable timetable,
or as a direct implementation.

> The current API is great when it comes to power users who know airflow's scheduling internals and optimizations that Ash explained.

I know the scheduling internals and I was not able to write a custom
timetable. I was looking at the included example and it did not look
like code that a power user should be able to write.

But let's then turn our attention to what we want out of predefined timetables:

- Composability. It should be possible to use and/or operators in a
nested fashion to compose any timetable (e.g. every Friday OR every
last weekday AND every day at 6pm).
- Intervals. It should be possible to define a timetable which changes
over time, as non-overlapping intervals (e.g. one of each year).
- Exclude. It should be possible to exclude certain days (i.e. holidays).

I think this is a good starting point and it would allow users to meet
most of their scheduling needs.

Cheers

Re: Simplifying the timetable interface

Posted by Jarek Potiuk <ja...@potiuk.com>.
As discussed in the slack I agree with Malthe that the current
timetable interface is complex. But my assessment of the situation and
proposal including a bit more context and plans we had for the AIP-39 are a
bit different.

TL;DR; I think it is about time to complete what we were planning in AIP-39
as "Future Enhancement" and implement a few simple timetable
implementations that will handle most popular use cases (using the
"complex" timetable API) that will be available to regular users to use
(without the need of writing new code). My proposal is that we should
define what timetables to add and aim to implement them to include them in
Airflow 2.3. Sounds doable and should solve the real problem of our users.

Assessment of the situation.

I do not think the current interface is "too complex". Not at all. But I
think that it is targeted to a different audience than Malthe and Bas talk
about. It is addressed for "power users" - not only because it requires
deep understanding of Airflow scheduling internals and optimizations but
also, because it requires "admin" rights to develop, test and install it.
Regular users. who are Dag authors cannot create new Timetables. This is
mostly because of security. The "regular users"  need to convince the
admins to do so. And yes I am talking about the important segment of our
users where you have professional admins/devops configuring Airflow and DAG
authors who just write DAGs. I think this is the most interesting and
biggest segment of our users to be honest. We should always think about
this segment of our users first IMHO.

But what I very strongly agree with - we have very limited "offering" for
the "DAG authors" to be able to harness the powers of our
non-cron-based-timetables. The typical ask that cannot be easily fulfilled
(which I saw many cases of is (from slack discussion from Friday): *"Can
someone provide me some codesamples of scheduling a job on the second to
last day of every month  using timetables?" *
https://apache-airflow.slack.com/archives/CCPRP7943/p1645697960286899 .
Currently, Airflow out-of-the box has no way of supporting that (rather
typical) use case without actually becoming the "power user with admin
rights''. You need to have "someone else" to provide it as a plugin that
you install. This is what we miss currently (and it has been already
planned as future enhancements in AIP-39 actually:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval
.

What current API provides and for whom

The current API is great when it comes to power users who know airflow's
scheduling internals and optimizations that Ash explained. Looking at where
the AIP-39 came from:

* have the "versatile API" that you will be able to implement literally ANY
timetable
* where there are no fixed scheduled intervals
* where the manual runs can co-exist with scheduled run and
* where you could specify backfill range and it will figure out how many
dagruns there will be in this range and run them
* and allow to optimize scheduling decisions (date of next run stored in
the database for easy DB queries among others

I think the current API fulfills those very well and is a great "low level"
API that we can build more "higher-level" implementations of Custom
Timetables on.
But the current API is terrible for casual DAG Authors who want to use
non-cron-compatible timetables - both because of complexity and security
limitations.

What can we do?

I think we should design and write a few (literally a few) higher
level timetables addressed to be used by "regular" DAG authors without
installing anything. Not many. Just a few. We could rather easily ask our
users and produce a list of several timetables that will not have "cron"
limitations but also will handle just a subset of "general timetable"
cases.  For example a Timetable that will allow the user to run for
example: "-2 day of every month" (second to last day for example). Those
timetables should be available in Airflow out-of-the-box. No package
installation and admin permission necessary. We literally need two three
such schedules and be open for user expressing their non-cron-compliant
"typical" schedules and add them as needed.

I do not have yet clear idea on the "UX/declarative configuration" for such
timetables (but something that comes to my mind is that one of those could
allow textual description of the schedule - it would be extremely cool if
the users could create the schedule like "timetable="run on the second to
last day of every month"). With NLP solutions out there, it should be
possible because the domain of "typical" scheduling is really narrow. Maybe
there are some libraries we could use for that :D. But this is just an
idea, maybe we can do it differently.

Those are my thoughts :).


J.



On Wed, Feb 23, 2022 at 4:23 PM Malthe <mb...@gmail.com> wrote:

> On Wed, 23 Feb 2022 at 15:20, Ash Berlin-Taylor <as...@apache.org> wrote:
> >
> > On Wed, Feb 23 2022 at 15:17:48 +0000, Malthe <mb...@gmail.com> wrote:
> >
> > Backfilling is not out of scope for a timetable at all. If I run
> `airflow dags backfill mydagid --start-date 2020-01-01 --end-date
> 2021-06-30` how many DagRuns are created and what are logical
> dates/intervals of them?
> >
> > If the timetable has a daily frequency, then one dagrun per day in that
> interval.
> >
> >
> > DAGs don't have a frequency. They have a timetable. They don't even have
> a scheduler_interval anymore -- that gets converted to an instance of the
> CronDataIntervalTimetable
>
> Yes, if the timetable has a daily frequency internally – that is, if
> the timetable has a logic that produces dagruns spaced out daily –
> then I would expect one dagrun per day in the given interval.
>

Re: Simplifying the timetable interface

Posted by Malthe <mb...@gmail.com>.
On Wed, 23 Feb 2022 at 15:20, Ash Berlin-Taylor <as...@apache.org> wrote:
>
> On Wed, Feb 23 2022 at 15:17:48 +0000, Malthe <mb...@gmail.com> wrote:
>
> Backfilling is not out of scope for a timetable at all. If I run `airflow dags backfill mydagid --start-date 2020-01-01 --end-date 2021-06-30` how many DagRuns are created and what are logical dates/intervals of them?
>
> If the timetable has a daily frequency, then one dagrun per day in that interval.
>
>
> DAGs don't have a frequency. They have a timetable. They don't even have a scheduler_interval anymore -- that gets converted to an instance of the CronDataIntervalTimetable

Yes, if the timetable has a daily frequency internally – that is, if
the timetable has a logic that produces dagruns spaced out daily –
then I would expect one dagrun per day in the given interval.

Re: Simplifying the timetable interface

Posted by Ash Berlin-Taylor <as...@apache.org>.
On Wed, Feb 23 2022 at 15:17:48 +0000, Malthe <mb...@gmail.com> wrote:
>>  Backfilling is not out of scope for a timetable at all. If I run 
>> `airflow dags backfill mydagid --start-date 2020-01-01 --end-date 
>> 2021-06-30` how many DagRuns are created and what are logical 
>> dates/intervals of them?
> 
> If the timetable has a daily frequency, then one dagrun per day in
> that interval.

DAGs don't have a frequency. They have a timetable. They don't even 
have a scheduler_interval anymore -- that gets converted to an instance 
of the CronDataIntervalTimetable


Re: Simplifying the timetable interface

Posted by Malthe <mb...@gmail.com>.
> One comment: Please don't use the phrase "execution time" as it's not clear which of the possible meanings it could be (is it the old exectuion_date? Is it the time the dagrun actually starts?)

Agreed. I guess it's the dagrun's logical_date, its time identity.

> Backfilling is not out of scope for a timetable at all. If I run `airflow dags backfill mydagid --start-date 2020-01-01 --end-date 2021-06-30` how many DagRuns are created and what are logical dates/intervals of them?

If the timetable has a daily frequency, then one dagrun per day in
that interval.

> And in case of the very first time a Dag is enabled? I guess it could pass the dag start_date here instead?

Yes, I think the scheduler will pass in start_date, and out comes the
first logical_date which is always strictly after start_date. In the
workday example, you would put as start_date Monday (i.e., morning)
and that would give the first logical date as Monday midnight (i.e.,
evening).

> (Implementation detail: It stores the info in 4 columns in the DagModel table, next_dagrun, next_dagrun_interval_start and _end, and next_dagrun_create_after so that the creation of the DagRun can just be done as a DB lookup. Doesn't materially change your statement)

I suppose that's an optimization to avoid querying for the latest
dagrun over and over, although I would think some caching mechanism
ought to work just as well. But I don't know the specifics about why
it was decided to materialize those on the DagModel.

> If manual triggered dag runs are out of scope, what is the data_interval_end and data_interval_start values (in the context/templates) for  a manually triggered run?

That would be undefined – as in, those variables would not be in the
context at all for a manually triggered run.

> It's not possible from the UI currently, but `airflow dags trigger` can be provided with a specific execution date -- Maybe this should be extended to take the data interval too -- but in terms of User-friendly CLI inferring it if not provided makes it easier to use. (A timetable could choose to return an error for the infer method)

That ought to work fine because that is just manually specifying the
result of (1) – and steps (2) and (3) can still run as normal. It
might be nice to validate that the suggested logical_date is
compatible with the timetable but perhaps that is up to the timetable
to decide.

> Next question about how go about implementing and releasing this? Now that it's been in a release we can't just break backcompat, so either we need to make this a "Base" template that handles most of this logic, or we need to introspect and tell old Timetable from new.

That's a tricky part :-)

Malthe

Re: Simplifying the timetable interface

Posted by Ash Berlin-Taylor <as...@apache.org>.
One comment: Please don't use the phrase "execution time" as it's not 
clear which of the possible meanings it could be (is it the old 
exectuion_date? Is it the time the dagrun actually starts?)

Backfilling is not out of scope for a timetable at all. If I run 
`airflow dags backfill mydagid --start-date 2020-01-01 --end-date 
2021-06-30` how many DagRuns are created and what are logical 
dates/intervals of them?

> The scheduler provides the most recent execution time as input and
> creates a dagrun if the returned earliest runtime for the next
> execution time is at or after the current time.

And in case of the very first time a Dag is enabled? I guess it could 
pass the dag start_date here instead?

(Implementation detail: It stores the info in 4 columns in the DagModel 
table, next_dagrun, next_dagrun_interval_start and _end, and 
next_dagrun_create_after so that the creation of the DagRun can just be 
done as a DB lookup. Doesn't materially change your statement)

> - Manual runs are trivial because there is no (2) or (3). In fact, for
> most DAGs (which care about a data interval), there should probably
> not be a play button at all.

If manual triggered dag runs are out of scope, what is the 
data_interval_end and data_interval_start values (in the 
context/templates) for  a manually triggered run?

It's not possible from the UI currently, but `airflow dags trigger` can 
be provided with a specific execution date -- Maybe this should be 
extended to take the data interval too -- but in terms of User-friendly 
CLI inferring it if not provided makes it easier to use. (A timetable 
could choose to return an error for the infer method)

Next question about how go about implementing and releasing this? Now 
that it's been in a release we can't just break backcompat, so either 
we need to make this a "Base" template that handles most of this logic, 
or we need to introspect and tell old Timetable from new.

You are probably right about earliest/latest and we might be able to do 
away with that part of the interface.

-ash

On Wed, Feb 23 2022 at 12:44:10 +0000, Malthe <mb...@gmail.com> wrote:
> Hi all,
> 
> I was going to take a stab at adding some custom timetable
> functionality to address two requirements:
> 
> 1. The ability to temporarily switch to an alternative timetable for
> an interim period.
> 2. The ability to introduce relatively custom holiday scheduling which
> is well outside the functionality of cron expressions.
> 
> I could add that while (1) could be done using Python at the
> DAG-level, I would like to use the timetable interface to allow
> accurate predictions into the future. That's for another post, but to
> give some context, I have floated a proposal on Slack to show
> tentative scheduled days in the calendar view using a "grey dot"
> indication to denote that we expect at least on scheduled run (see
> attachment).
> 
> Now, I looked at the "afterwork" example to see what's required in
> order to implement a custom timetable. And I must admit that I find it
> rather daunting given that it's so easy to express what that timetable
> is about:
> 
>      MON-FRI, daily, run after midnight
> 
> Intuitively, that should be a couple of lines of Python code. It ends
> up being quite a lot more than that and that's due to the interface
> that must be implemented.
> 
> I think the correct timetable interface is:
> 
> 1. Return the next execution time that's strictly (">") after a 
> particular time.
> 2. Return the earliest runtime for a particular execution time,
> accounting for any grace period.
> 3. Return context metadata for this execution.
> 
> The scheduler provides the most recent execution time as input and
> creates a dagrun if the returned earliest runtime for the next
> execution time is at or after the current time.
> 
> Considering again the "afterwork" example, with a grace period of 5
> minutes, we'd expect a dagrun shortly after 5 minutes past midnight
> (of Monday, Tuesday, and so forth up until midnight after Friday). The
> execution time _is_ the time where a given task runs (minus the grace
> period).
> 
> The reasoning behind (3) is because I consider the notion of "data
> interval" to be metadata since this is only a concern for the task
> implementation. For example, the scheduler does not need to worry
> about this at all.
> 
> Other concerns:
> 
> - Backfilling is out of scope for the timetable interface.
> - Time restrictions (i.e. start and end date) are likewise out of
> scope. The scheduler knows when the DAG starts and ends and doesn't
> need help from the timetable here.
> - Manual runs are trivial because there is no (2) or (3). In fact, for
> most DAGs (which care about a data interval), there should probably
> not be a play button at all.
> 
> I didn't complete the exercise, but it stands to reason that with this
> interface, the "afterwork" example would be short and simple given the
> interface outlined above.
> 
> Thanks