You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Malthe <mb...@gmail.com> on 2022/05/05 12:29:39 UTC

Missing "start_date" or why must a DAG have one

There's been some prior discussion on removing the requirement for a
DAG without a schedule:

- https://issues.apache.org/jira/browse/AIRFLOW-3739
- https://github.com/apache/airflow/pull/5423

But why actually have the requirement at all.

The documentation isn't particularly clear on why we need "start_date"
and the whole idea seems somewhat confusing:

https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date

Consider:

     croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime)

My UTC time is "2022-05-05T12:22:16.914769" and the above expression
evaluates to:

     2022-05-05T12:25:00

That is, it's nicely aligned as you would expect. I would assume from
reading the code that this carries over to `CronDataIntervalTimetable`
since it uses croniter in exactly this way.

Must we require a "start_date" – ?

Re: Missing "start_date" or why must a DAG have one

Posted by Malthe <mb...@gmail.com>.
On Wed, 18 May 2022 at 17:18, Ash Berlin-Taylor <as...@apache.org> wrote:
> Start date also makes sense for a cron-based dag with catch-up too though...

True.

So,

1. A timedelta without a `start_date` is not wrong, but it'll use
midnight as the reference time (I think this is better than "date
first added" because that could be 12:34:56 or some other random
deployment time).
2. If you need another reference time, for example daily at 4pm, use a
cron-based timetable (which we're also likely to extend soonish to
allow composability, i.e. AND, OR, NOT).
3. Catch-up fills in missing DAG runs starting with `start_date` or
(if unset) the earliest scheduled/automated DAG run, i.e. the
effective `start_date`.

Re: Missing "start_date" or why must a DAG have one

Posted by Ash Berlin-Taylor <as...@apache.org>.
Start date also makes sense for a cron-based dag with catch-up too though...

On 18 May 2022 16:58:54 BST, Malthe <mb...@gmail.com> wrote:
>On Sat, 14 May 2022 at 11:21, Bas Harenslak <ba...@astronomer.io.invalid> wrote:
>> I think we have the following options when no start_date is given:
>>
>>   1. schedule_interval is alias e.g. “@daily” —> is a cron expression internally (0 0 * * *), so run at 00:00
>>   2. schedule_interval is cron e.g. “0 0 * * *” —> cron expression determines when to run, 00:00:00 here
>>   3. schedule_interval is timedelta e.g. “timedelta(days=1)” —> only here we have no clear start_date and need something as a cutoff time, would use first added date as start_date, e.g. 12:34:56
>>
>> So that would still result in deterministic DAG runs.
>
>I think timedelta (i.e. `DeltaDataIntervalTimetable`) and `start_date`
>go hand in hand, perhaps to the point of them fusing into one.
>
>Cheers

Re: Missing "start_date" or why must a DAG have one

Posted by Malthe <mb...@gmail.com>.
On Sat, 14 May 2022 at 11:21, Bas Harenslak <ba...@astronomer.io.invalid> wrote:
> I think we have the following options when no start_date is given:
>
>   1. schedule_interval is alias e.g. “@daily” —> is a cron expression internally (0 0 * * *), so run at 00:00
>   2. schedule_interval is cron e.g. “0 0 * * *” —> cron expression determines when to run, 00:00:00 here
>   3. schedule_interval is timedelta e.g. “timedelta(days=1)” —> only here we have no clear start_date and need something as a cutoff time, would use first added date as start_date, e.g. 12:34:56
>
> So that would still result in deterministic DAG runs.

I think timedelta (i.e. `DeltaDataIntervalTimetable`) and `start_date`
go hand in hand, perhaps to the point of them fusing into one.

Cheers

Re: Missing "start_date" or why must a DAG have one

Posted by Bas Harenslak <ba...@astronomer.io.INVALID>.
Not in favour of a special marker because that’s essentially what start_date is for. Say somebody has a schedule_interval=timedelta(days=1) and wants their DAG to run at 00:00 without having to think of a specific start date, then they’d have to set start_date="random date and time 00:00" and catchup=False.

I think we have the following options when no start_date is given:
schedule_interval is alias e.g. “@daily” —> is a cron expression internally (0 0 * * *), so run at 00:00
schedule_interval is cron e.g. “0 0 * * *” —> cron expression determines when to run, 00:00:00 here
schedule_interval is timedelta e.g. “timedelta(days=1)” —> only here we have no clear start_date and need something as a cutoff time, would use first added date as start_date, e.g. 12:34:56
So that would still result in deterministic DAG runs.

Bas

> On 13 May 2022, at 20:43, Ping Zhang <pi...@umich.edu> wrote:
> 
> "starts whenever you first deploy it", this makes dags nondeterministic. It is true that currently it is very hard to achieve this. Maybe we could use a special start_date marker to indicate this behavior so that users can be very aware of what they are doing.
> 
> There is also another case where start_date is required, if the schedule_interval is a timedelta object.
> 
> 
> Thanks,
> 
> Ping
> 
> 
> On Fri, May 13, 2022 at 5:32 PM Collin McNulty <co...@astronomer.io.invalid> wrote:
> I disagree, start_date is None and catchup=True still describes a useful behavior that’s currently difficult to achieve in Airflow: a DAG that starts whenever you first deploy it and then catches up missed runs if you pause and unpause it or have downtime. 
> 
> On Thu, May 12, 2022 at 5:49 AM Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>> wrote:
> Yeah. Maybe simply start_date should only be required when catchup=True then?  Sounds like it might correctly reflect the intention of catchup=True, while bringing a very solid semantic for explicit start_date. 
> 
> J.
> 
> 
> On Tue, May 10, 2022 at 11:14 PM Ping Zhang <pingzh@umich.edu <ma...@umich.edu>> wrote:
> I agree that for the crontab interval with `catchup=False`, the state_date does not make sense. However, the start_date is still very useful when having catchup=True, whose default value is `True`, https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989 <https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989>. If the stae_date defaults to None, this makes the dag not-portable, since the start_date could be different in different airflow envs. 
> 
> If we want to default the state_date to None, we need some rules to let users know in some cases start_date cannot be None.
> 
> 
> Thanks,
> 
> Ping
> 
> 
> On Mon, May 9, 2022 at 10:02 AM Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>> wrote:
> Coincidentally - this discussion in Github Discussions started just now has a clear use cases when omitting start_date makes perfect sense: https://github.com/apache/airflow/discussions/23594 <https://github.com/apache/airflow/discussions/23594>
> On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <ba...@astronomer.io.invalid> wrote:
> I never understood the requirement for start_date — 99% of the use cases simply want to start from the time the DAG is first added and do not explicitly need to start on a certain date. There is certainly a use case for start_date, but defaulting to None would make more sense IMO, and we could internally register the “first added date” as a start date instead.
> 
> Bas
> 
>> On 9 May 2022, at 09:35, Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>> wrote:
>> 
>> I think the only real need for start_date is the "catchup=True". 
>> I think start_date is really part of the metadata of the DAG - that is really useful in order to determine range of backfill for example. So it's more an intention of the DAG author to describe when we actually want the DAG livecycle started.
>> As such it is nice to keep in the "records" - if we do not have it, we simply do not know when the DAG should "start". I mean - we could see it by historical DagRuns, but the problem is that if DagRuns are removed, that information is lost.
>> 
>> But it does not have to be specified in the DAG() object in Python IMHO
>> 
>> I do not think we should actually remove the "start_dag" from Dag model, but also I think it should be perfectly fine to simply set start_date in Dag model to "NOW()" if it is not passed. the NOW() should not be NOW() really I think - because of the intricacies of "execution_date" "start_interval", "end_interval" it should be automatically adjusted. And here I am not sure exactly - either so that when you create a DAG without start_date, it starts immediately for the current interval, or starts for the future interval (not 100% sure how well it will play with custom timetables but I think it can be worked out rather easily.
>> 
>> J.
>> 
>> 
>> 
>> On Thu, May 5, 2022 at 2:30 PM Malthe <mborch@gmail.com <ma...@gmail.com>> wrote:
>> There's been some prior discussion on removing the requirement for a
>> DAG without a schedule:
>> 
>> - https://issues.apache.org/jira/browse/AIRFLOW-3739 <https://issues.apache.org/jira/browse/AIRFLOW-3739>
>> - https://github.com/apache/airflow/pull/5423 <https://github.com/apache/airflow/pull/5423>
>> 
>> But why actually have the requirement at all.
>> 
>> The documentation isn't particularly clear on why we need "start_date"
>> and the whole idea seems somewhat confusing:
>> 
>> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date <https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date>
>> 
>> Consider:
>> 
>>      croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime)
>> 
>> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
>> evaluates to:
>> 
>>      2022-05-05T12:25:00
>> 
>> That is, it's nicely aligned as you would expect. I would assume from
>> reading the code that this carries over to `CronDataIntervalTimetable`
>> since it uses croniter in exactly this way.
>> 
>> Must we require a "start_date" – ?
> 
> -- 
> 
> Collin McNulty
> Lead Airflow Engineer
> 
> Email: collin@astronomer.io <ma...@astronomer.io>
> Time zone: US Central (CST UTC-6 / CDT UTC-5)
> 
> 
>  <https://www.astronomer.io/>

Re: Missing "start_date" or why must a DAG have one

Posted by Ping Zhang <pi...@umich.edu>.
"starts whenever you first deploy it", this makes dags nondeterministic. It
is true that currently it is very hard to achieve this. Maybe we could use
a special start_date marker to indicate this behavior so that users can be
very aware of what they are doing.

There is also another case where start_date is required, if the
schedule_interval is a timedelta object.


Thanks,

Ping


On Fri, May 13, 2022 at 5:32 PM Collin McNulty <co...@astronomer.io.invalid>
wrote:

> I disagree, start_date is None and catchup=True still describes a useful
> behavior that’s currently difficult to achieve in Airflow: a DAG that
> starts whenever you first deploy it and then catches up missed runs if you
> pause and unpause it or have downtime.
>
> On Thu, May 12, 2022 at 5:49 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Yeah. Maybe simply start_date should only be required when catchup=True
>> then?  Sounds like it might correctly reflect the intention of
>> catchup=True, while bringing a very solid semantic for explicit start_date.
>>
>> J.
>>
>>
>> On Tue, May 10, 2022 at 11:14 PM Ping Zhang <pi...@umich.edu> wrote:
>>
>>> I agree that for the crontab interval with `catchup=False`, the
>>> state_date does not make sense. However, the start_date is still very
>>> useful when having catchup=True, whose default value is `True`,
>>> https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989.
>>> If the stae_date defaults to None, this makes the dag not-portable, since
>>> the start_date could be different in different airflow envs.
>>>
>>> If we want to default the state_date to None, we need some rules to let
>>> users know in some cases start_date cannot be None.
>>>
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> On Mon, May 9, 2022 at 10:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> Coincidentally - this discussion in Github Discussions started just now
>>>> has a clear use cases when omitting start_date makes perfect sense:
>>>> https://github.com/apache/airflow/discussions/23594
>>>>
>>>> On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <ba...@astronomer.io.invalid>
>>>> wrote:
>>>>
>>>>> I never understood the requirement for start_date — 99% of the use
>>>>> cases simply want to start from the time the DAG is first added and do not
>>>>> explicitly need to start on a certain date. There is certainly a use case
>>>>> for start_date, but defaulting to None would make more sense IMO, and we
>>>>> could internally register the “first added date” as a start date instead.
>>>>>
>>>>> Bas
>>>>>
>>>>> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>>
>>>>> I think the only real need for start_date is the "catchup=True".
>>>>> I think start_date is really part of the metadata of the DAG - that is
>>>>> really useful in order to determine range of backfill for example. So it's
>>>>> more an intention of the DAG author to describe when we actually want the
>>>>> DAG livecycle started.
>>>>> As such it is nice to keep in the "records" - if we do not have it, we
>>>>> simply do not know when the DAG should "start". I mean - we could see it by
>>>>> historical DagRuns, but the problem is that if DagRuns are removed, that
>>>>> information is lost.
>>>>>
>>>>> But it does not have to be specified in the DAG() object in Python IMHO
>>>>>
>>>>> I do not think we should actually remove the "start_dag" from Dag
>>>>> model, but also I think it should be perfectly fine to simply set
>>>>> start_date in Dag model to "NOW()" if it is not passed. the NOW()
>>>>> should not be NOW() really I think - because of the intricacies of
>>>>> "execution_date" "start_interval", "end_interval" it should be
>>>>> automatically adjusted. And here I am not sure exactly - either so that
>>>>> when you create a DAG without start_date, it starts immediately for the
>>>>> current interval, or starts for the future interval (not 100% sure how well
>>>>> it will play with custom timetables but I think it can be worked out rather
>>>>> easily.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 5, 2022 at 2:30 PM Malthe <mb...@gmail.com> wrote:
>>>>>
>>>>>> There's been some prior discussion on removing the requirement for a
>>>>>> DAG without a schedule:
>>>>>>
>>>>>> - https://issues.apache.org/jira/browse/AIRFLOW-3739
>>>>>> - https://github.com/apache/airflow/pull/5423
>>>>>>
>>>>>> But why actually have the requirement at all.
>>>>>>
>>>>>> The documentation isn't particularly clear on why we need "start_date"
>>>>>> and the whole idea seems somewhat confusing:
>>>>>>
>>>>>>
>>>>>> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date
>>>>>>
>>>>>> Consider:
>>>>>>
>>>>>>      croniter("*/5 * * * *",
>>>>>> start_time=None).get_next(datetime.datetime)
>>>>>>
>>>>>> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
>>>>>> evaluates to:
>>>>>>
>>>>>>      2022-05-05T12:25:00
>>>>>>
>>>>>> That is, it's nicely aligned as you would expect. I would assume from
>>>>>> reading the code that this carries over to `CronDataIntervalTimetable`
>>>>>> since it uses croniter in exactly this way.
>>>>>>
>>>>>> Must we require a "start_date" – ?
>>>>>>
>>>>>
>>>>> --
>
> Collin McNulty
> Lead Airflow Engineer
>
> Email: collin@astronomer.io <jo...@astronomer.io>
> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>
>
> <https://www.astronomer.io/>
>

Re: Missing "start_date" or why must a DAG have one

Posted by Collin McNulty <co...@astronomer.io.INVALID>.
I disagree, start_date is None and catchup=True still describes a useful
behavior that’s currently difficult to achieve in Airflow: a DAG that
starts whenever you first deploy it and then catches up missed runs if you
pause and unpause it or have downtime.

On Thu, May 12, 2022 at 5:49 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Yeah. Maybe simply start_date should only be required when catchup=True
> then?  Sounds like it might correctly reflect the intention of
> catchup=True, while bringing a very solid semantic for explicit start_date.
>
> J.
>
>
> On Tue, May 10, 2022 at 11:14 PM Ping Zhang <pi...@umich.edu> wrote:
>
>> I agree that for the crontab interval with `catchup=False`, the
>> state_date does not make sense. However, the start_date is still very
>> useful when having catchup=True, whose default value is `True`,
>> https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989.
>> If the stae_date defaults to None, this makes the dag not-portable, since
>> the start_date could be different in different airflow envs.
>>
>> If we want to default the state_date to None, we need some rules to let
>> users know in some cases start_date cannot be None.
>>
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Mon, May 9, 2022 at 10:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Coincidentally - this discussion in Github Discussions started just now
>>> has a clear use cases when omitting start_date makes perfect sense:
>>> https://github.com/apache/airflow/discussions/23594
>>>
>>> On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <ba...@astronomer.io.invalid>
>>> wrote:
>>>
>>>> I never understood the requirement for start_date — 99% of the use
>>>> cases simply want to start from the time the DAG is first added and do not
>>>> explicitly need to start on a certain date. There is certainly a use case
>>>> for start_date, but defaulting to None would make more sense IMO, and we
>>>> could internally register the “first added date” as a start date instead.
>>>>
>>>> Bas
>>>>
>>>> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>> I think the only real need for start_date is the "catchup=True".
>>>> I think start_date is really part of the metadata of the DAG - that is
>>>> really useful in order to determine range of backfill for example. So it's
>>>> more an intention of the DAG author to describe when we actually want the
>>>> DAG livecycle started.
>>>> As such it is nice to keep in the "records" - if we do not have it, we
>>>> simply do not know when the DAG should "start". I mean - we could see it by
>>>> historical DagRuns, but the problem is that if DagRuns are removed, that
>>>> information is lost.
>>>>
>>>> But it does not have to be specified in the DAG() object in Python IMHO
>>>>
>>>> I do not think we should actually remove the "start_dag" from Dag
>>>> model, but also I think it should be perfectly fine to simply set
>>>> start_date in Dag model to "NOW()" if it is not passed. the NOW()
>>>> should not be NOW() really I think - because of the intricacies of
>>>> "execution_date" "start_interval", "end_interval" it should be
>>>> automatically adjusted. And here I am not sure exactly - either so that
>>>> when you create a DAG without start_date, it starts immediately for the
>>>> current interval, or starts for the future interval (not 100% sure how well
>>>> it will play with custom timetables but I think it can be worked out rather
>>>> easily.
>>>>
>>>> J.
>>>>
>>>>
>>>>
>>>> On Thu, May 5, 2022 at 2:30 PM Malthe <mb...@gmail.com> wrote:
>>>>
>>>>> There's been some prior discussion on removing the requirement for a
>>>>> DAG without a schedule:
>>>>>
>>>>> - https://issues.apache.org/jira/browse/AIRFLOW-3739
>>>>> - https://github.com/apache/airflow/pull/5423
>>>>>
>>>>> But why actually have the requirement at all.
>>>>>
>>>>> The documentation isn't particularly clear on why we need "start_date"
>>>>> and the whole idea seems somewhat confusing:
>>>>>
>>>>>
>>>>> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date
>>>>>
>>>>> Consider:
>>>>>
>>>>>      croniter("*/5 * * * *",
>>>>> start_time=None).get_next(datetime.datetime)
>>>>>
>>>>> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
>>>>> evaluates to:
>>>>>
>>>>>      2022-05-05T12:25:00
>>>>>
>>>>> That is, it's nicely aligned as you would expect. I would assume from
>>>>> reading the code that this carries over to `CronDataIntervalTimetable`
>>>>> since it uses croniter in exactly this way.
>>>>>
>>>>> Must we require a "start_date" – ?
>>>>>
>>>>
>>>> --

Collin McNulty
Lead Airflow Engineer

Email: collin@astronomer.io <jo...@astronomer.io>
Time zone: US Central (CST UTC-6 / CDT UTC-5)


<https://www.astronomer.io/>

Re: Missing "start_date" or why must a DAG have one

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yeah. Maybe simply start_date should only be required when catchup=True
then?  Sounds like it might correctly reflect the intention of
catchup=True, while bringing a very solid semantic for explicit start_date.

J.


On Tue, May 10, 2022 at 11:14 PM Ping Zhang <pi...@umich.edu> wrote:

> I agree that for the crontab interval with `catchup=False`, the state_date
> does not make sense. However, the start_date is still very useful when
> having catchup=True, whose default value is `True`,
> https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989.
> If the stae_date defaults to None, this makes the dag not-portable, since
> the start_date could be different in different airflow envs.
>
> If we want to default the state_date to None, we need some rules to let
> users know in some cases start_date cannot be None.
>
>
> Thanks,
>
> Ping
>
>
> On Mon, May 9, 2022 at 10:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Coincidentally - this discussion in Github Discussions started just now
>> has a clear use cases when omitting start_date makes perfect sense:
>> https://github.com/apache/airflow/discussions/23594
>>
>> On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <ba...@astronomer.io.invalid>
>> wrote:
>>
>>> I never understood the requirement for start_date — 99% of the use cases
>>> simply want to start from the time the DAG is first added and do not
>>> explicitly need to start on a certain date. There is certainly a use case
>>> for start_date, but defaulting to None would make more sense IMO, and we
>>> could internally register the “first added date” as a start date instead.
>>>
>>> Bas
>>>
>>> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>> I think the only real need for start_date is the "catchup=True".
>>> I think start_date is really part of the metadata of the DAG - that is
>>> really useful in order to determine range of backfill for example. So it's
>>> more an intention of the DAG author to describe when we actually want the
>>> DAG livecycle started.
>>> As such it is nice to keep in the "records" - if we do not have it, we
>>> simply do not know when the DAG should "start". I mean - we could see it by
>>> historical DagRuns, but the problem is that if DagRuns are removed, that
>>> information is lost.
>>>
>>> But it does not have to be specified in the DAG() object in Python IMHO
>>>
>>> I do not think we should actually remove the "start_dag" from Dag model,
>>> but also I think it should be perfectly fine to simply set start_date in
>>> Dag model to "NOW()" if it is not passed. the NOW() should not be NOW()
>>> really I think - because of the intricacies of "execution_date"
>>> "start_interval", "end_interval" it should be automatically adjusted. And
>>> here I am not sure exactly - either so that when you create a DAG without
>>> start_date, it starts immediately for the current interval, or starts for
>>> the future interval (not 100% sure how well it will play with custom
>>> timetables but I think it can be worked out rather easily.
>>>
>>> J.
>>>
>>>
>>>
>>> On Thu, May 5, 2022 at 2:30 PM Malthe <mb...@gmail.com> wrote:
>>>
>>>> There's been some prior discussion on removing the requirement for a
>>>> DAG without a schedule:
>>>>
>>>> - https://issues.apache.org/jira/browse/AIRFLOW-3739
>>>> - https://github.com/apache/airflow/pull/5423
>>>>
>>>> But why actually have the requirement at all.
>>>>
>>>> The documentation isn't particularly clear on why we need "start_date"
>>>> and the whole idea seems somewhat confusing:
>>>>
>>>>
>>>> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date
>>>>
>>>> Consider:
>>>>
>>>>      croniter("*/5 * * * *",
>>>> start_time=None).get_next(datetime.datetime)
>>>>
>>>> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
>>>> evaluates to:
>>>>
>>>>      2022-05-05T12:25:00
>>>>
>>>> That is, it's nicely aligned as you would expect. I would assume from
>>>> reading the code that this carries over to `CronDataIntervalTimetable`
>>>> since it uses croniter in exactly this way.
>>>>
>>>> Must we require a "start_date" – ?
>>>>
>>>
>>>

Re: Missing "start_date" or why must a DAG have one

Posted by Ping Zhang <pi...@umich.edu>.
I agree that for the crontab interval with `catchup=False`, the state_date
does not make sense. However, the start_date is still very useful when
having catchup=True, whose default value is `True`,
https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989.
If the stae_date defaults to None, this makes the dag not-portable, since
the start_date could be different in different airflow envs.

If we want to default the state_date to None, we need some rules to let
users know in some cases start_date cannot be None.


Thanks,

Ping


On Mon, May 9, 2022 at 10:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Coincidentally - this discussion in Github Discussions started just now
> has a clear use cases when omitting start_date makes perfect sense:
> https://github.com/apache/airflow/discussions/23594
>
> On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <ba...@astronomer.io.invalid>
> wrote:
>
>> I never understood the requirement for start_date — 99% of the use cases
>> simply want to start from the time the DAG is first added and do not
>> explicitly need to start on a certain date. There is certainly a use case
>> for start_date, but defaulting to None would make more sense IMO, and we
>> could internally register the “first added date” as a start date instead.
>>
>> Bas
>>
>> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> I think the only real need for start_date is the "catchup=True".
>> I think start_date is really part of the metadata of the DAG - that is
>> really useful in order to determine range of backfill for example. So it's
>> more an intention of the DAG author to describe when we actually want the
>> DAG livecycle started.
>> As such it is nice to keep in the "records" - if we do not have it, we
>> simply do not know when the DAG should "start". I mean - we could see it by
>> historical DagRuns, but the problem is that if DagRuns are removed, that
>> information is lost.
>>
>> But it does not have to be specified in the DAG() object in Python IMHO
>>
>> I do not think we should actually remove the "start_dag" from Dag model,
>> but also I think it should be perfectly fine to simply set start_date in
>> Dag model to "NOW()" if it is not passed. the NOW() should not be NOW()
>> really I think - because of the intricacies of "execution_date"
>> "start_interval", "end_interval" it should be automatically adjusted. And
>> here I am not sure exactly - either so that when you create a DAG without
>> start_date, it starts immediately for the current interval, or starts for
>> the future interval (not 100% sure how well it will play with custom
>> timetables but I think it can be worked out rather easily.
>>
>> J.
>>
>>
>>
>> On Thu, May 5, 2022 at 2:30 PM Malthe <mb...@gmail.com> wrote:
>>
>>> There's been some prior discussion on removing the requirement for a
>>> DAG without a schedule:
>>>
>>> - https://issues.apache.org/jira/browse/AIRFLOW-3739
>>> - https://github.com/apache/airflow/pull/5423
>>>
>>> But why actually have the requirement at all.
>>>
>>> The documentation isn't particularly clear on why we need "start_date"
>>> and the whole idea seems somewhat confusing:
>>>
>>>
>>> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date
>>>
>>> Consider:
>>>
>>>      croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime)
>>>
>>> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
>>> evaluates to:
>>>
>>>      2022-05-05T12:25:00
>>>
>>> That is, it's nicely aligned as you would expect. I would assume from
>>> reading the code that this carries over to `CronDataIntervalTimetable`
>>> since it uses croniter in exactly this way.
>>>
>>> Must we require a "start_date" – ?
>>>
>>
>>

Re: Missing "start_date" or why must a DAG have one

Posted by Jarek Potiuk <ja...@potiuk.com>.
Coincidentally - this discussion in Github Discussions started just now has
a clear use cases when omitting start_date makes perfect sense:
https://github.com/apache/airflow/discussions/23594

On Mon, May 9, 2022 at 4:01 PM Bas Harenslak <ba...@astronomer.io.invalid>
wrote:

> I never understood the requirement for start_date — 99% of the use cases
> simply want to start from the time the DAG is first added and do not
> explicitly need to start on a certain date. There is certainly a use case
> for start_date, but defaulting to None would make more sense IMO, and we
> could internally register the “first added date” as a start date instead.
>
> Bas
>
> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> I think the only real need for start_date is the "catchup=True".
> I think start_date is really part of the metadata of the DAG - that is
> really useful in order to determine range of backfill for example. So it's
> more an intention of the DAG author to describe when we actually want the
> DAG livecycle started.
> As such it is nice to keep in the "records" - if we do not have it, we
> simply do not know when the DAG should "start". I mean - we could see it by
> historical DagRuns, but the problem is that if DagRuns are removed, that
> information is lost.
>
> But it does not have to be specified in the DAG() object in Python IMHO
>
> I do not think we should actually remove the "start_dag" from Dag model,
> but also I think it should be perfectly fine to simply set start_date in
> Dag model to "NOW()" if it is not passed. the NOW() should not be NOW()
> really I think - because of the intricacies of "execution_date"
> "start_interval", "end_interval" it should be automatically adjusted. And
> here I am not sure exactly - either so that when you create a DAG without
> start_date, it starts immediately for the current interval, or starts for
> the future interval (not 100% sure how well it will play with custom
> timetables but I think it can be worked out rather easily.
>
> J.
>
>
>
> On Thu, May 5, 2022 at 2:30 PM Malthe <mb...@gmail.com> wrote:
>
>> There's been some prior discussion on removing the requirement for a
>> DAG without a schedule:
>>
>> - https://issues.apache.org/jira/browse/AIRFLOW-3739
>> - https://github.com/apache/airflow/pull/5423
>>
>> But why actually have the requirement at all.
>>
>> The documentation isn't particularly clear on why we need "start_date"
>> and the whole idea seems somewhat confusing:
>>
>>
>> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date
>>
>> Consider:
>>
>>      croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime)
>>
>> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
>> evaluates to:
>>
>>      2022-05-05T12:25:00
>>
>> That is, it's nicely aligned as you would expect. I would assume from
>> reading the code that this carries over to `CronDataIntervalTimetable`
>> since it uses croniter in exactly this way.
>>
>> Must we require a "start_date" – ?
>>
>
>

Re: Missing "start_date" or why must a DAG have one

Posted by Bas Harenslak <ba...@astronomer.io.INVALID>.
I never understood the requirement for start_date — 99% of the use cases simply want to start from the time the DAG is first added and do not explicitly need to start on a certain date. There is certainly a use case for start_date, but defaulting to None would make more sense IMO, and we could internally register the “first added date” as a start date instead.

Bas

> On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> I think the only real need for start_date is the "catchup=True". 
> I think start_date is really part of the metadata of the DAG - that is really useful in order to determine range of backfill for example. So it's more an intention of the DAG author to describe when we actually want the DAG livecycle started.
> As such it is nice to keep in the "records" - if we do not have it, we simply do not know when the DAG should "start". I mean - we could see it by historical DagRuns, but the problem is that if DagRuns are removed, that information is lost.
> 
> But it does not have to be specified in the DAG() object in Python IMHO
> 
> I do not think we should actually remove the "start_dag" from Dag model, but also I think it should be perfectly fine to simply set start_date in Dag model to "NOW()" if it is not passed. the NOW() should not be NOW() really I think - because of the intricacies of "execution_date" "start_interval", "end_interval" it should be automatically adjusted. And here I am not sure exactly - either so that when you create a DAG without start_date, it starts immediately for the current interval, or starts for the future interval (not 100% sure how well it will play with custom timetables but I think it can be worked out rather easily.
> 
> J.
> 
> 
> 
> On Thu, May 5, 2022 at 2:30 PM Malthe <mborch@gmail.com <ma...@gmail.com>> wrote:
> There's been some prior discussion on removing the requirement for a
> DAG without a schedule:
> 
> - https://issues.apache.org/jira/browse/AIRFLOW-3739 <https://issues.apache.org/jira/browse/AIRFLOW-3739>
> - https://github.com/apache/airflow/pull/5423 <https://github.com/apache/airflow/pull/5423>
> 
> But why actually have the requirement at all.
> 
> The documentation isn't particularly clear on why we need "start_date"
> and the whole idea seems somewhat confusing:
> 
> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date <https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date>
> 
> Consider:
> 
>      croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime)
> 
> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
> evaluates to:
> 
>      2022-05-05T12:25:00
> 
> That is, it's nicely aligned as you would expect. I would assume from
> reading the code that this carries over to `CronDataIntervalTimetable`
> since it uses croniter in exactly this way.
> 
> Must we require a "start_date" – ?


Re: Missing "start_date" or why must a DAG have one

Posted by Jarek Potiuk <ja...@potiuk.com>.
I think the only real need for start_date is the "catchup=True".
I think start_date is really part of the metadata of the DAG - that is
really useful in order to determine range of backfill for example. So it's
more an intention of the DAG author to describe when we actually want the
DAG livecycle started.
As such it is nice to keep in the "records" - if we do not have it, we
simply do not know when the DAG should "start". I mean - we could see it by
historical DagRuns, but the problem is that if DagRuns are removed, that
information is lost.

But it does not have to be specified in the DAG() object in Python IMHO

I do not think we should actually remove the "start_dag" from Dag model,
but also I think it should be perfectly fine to simply set start_date in
Dag model to "NOW()" if it is not passed. the NOW() should not be NOW()
really I think - because of the intricacies of "execution_date"
"start_interval", "end_interval" it should be automatically adjusted. And
here I am not sure exactly - either so that when you create a DAG without
start_date, it starts immediately for the current interval, or starts for
the future interval (not 100% sure how well it will play with custom
timetables but I think it can be worked out rather easily.

J.



On Thu, May 5, 2022 at 2:30 PM Malthe <mb...@gmail.com> wrote:

> There's been some prior discussion on removing the requirement for a
> DAG without a schedule:
>
> - https://issues.apache.org/jira/browse/AIRFLOW-3739
> - https://github.com/apache/airflow/pull/5423
>
> But why actually have the requirement at all.
>
> The documentation isn't particularly clear on why we need "start_date"
> and the whole idea seems somewhat confusing:
>
>
> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date
>
> Consider:
>
>      croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime)
>
> My UTC time is "2022-05-05T12:22:16.914769" and the above expression
> evaluates to:
>
>      2022-05-05T12:25:00
>
> That is, it's nicely aligned as you would expect. I would assume from
> reading the code that this carries over to `CronDataIntervalTimetable`
> since it uses croniter in exactly this way.
>
> Must we require a "start_date" – ?
>