You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jeremiah Lowin <jl...@gmail.com> on 2016/04/25 22:14:13 UTC

GitHub Issue #1427 takeaways

Bolke, Sid and I had a brief conversation to discuss some of the
implications of https://github.com/airbnb/airflow/issues/1427

There are two large points that need to be addressed:

1. this particular issue arises because of an alignment issue between
start_date and schedule_interval. This can only happen with cron-based
schedule_intervals that describe absolute points in time (like “1am”) as
opposed to time deltas (like “every hour”). Ironically, I once reported
this same issue myself (#959). In the past (and in the docs) we have simply
said that users must make sure the two params agree. We discussed the
possibility of a DAG validation method to raise an error if the start_date
and schedule_interval don’t align, but Bolke made the point (and I agreed)
that in these cases, start_date is sort of like telling the scheduler to
“start paying attention” as opposed to “this is my first execution date”.
In #1427, the scheduler was being asked to start paying attention on
4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However, it
was scheduling a first run at midnight and a second run at 1:10.

Regardless of whether we choose to validate/warn/error, Bolke is going to
change the scheduling logic so that the cron-based interval takes
precedence over a start date. Specifically, the first date on or after the
start_date that complies with the schedule_interval becomes the first
execution date.

2. Issue #1 led to a second issue: depends_on_past checks for a successful
TI at `execution_date - schedule_interval`. This is fragile, since it is
very possible for the previous TI to have run at any time in the past, not
just one schedule_interval ago. This can happen easily with ad-hoc DAG
runs, and also if a DAG was paused for a while. Less commonly, it happens
with the situation described in point #1, where the first scheduled run is
off-schedule (the midnight run followed by the daily 1:10am runs).

The clear fix seems to be to have depends_on_past check the last TI that
ran, regardless of whether it ran `schedule_interval` ago. That's in line
with the intent of the flag. I will submit a fix.

-J

Re: GitHub Issue #1427 takeaways

Posted by Maxime Beauchemin <ma...@gmail.com>.
Quick note to say that all this is not ideal or the way it should be, I'm
just describing how it currently works and how it is documented. I'm very
open to ideas to make all this more intuitive or making the documentation
clearer.

A random idea would be to have a DagRun boolean attribute
"disregard_depend_on_past" for this DagRun that could be altered in the UI.

@aoen is coming up with a page that should bring clarity around "why isn't
this task_instance running?" which can have in-place tooltips around how to
get things going.

On Tue, Apr 26, 2016 at 8:08 AM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> >>> "The clear fix seems to be to have depends_on_past check the last TI
> that
> ran, regardless of whether it ran `schedule_interval` ago. That's in line
> with the intent of the flag. I will submit a fix."
>
> I don't think so. This would lead to skipping runs, which depends_on_past
> is used as a guarantee to run every TI, sequentially.
>
> Absolute scheduling (cron expressions) is much better than relative
> scheduling (origin + interval). Though it's easy to make relative
> scheduling behave in an absolute way. You just have to use a rounded
> start_date to your schedule_interval, and not move things around. Dynamic
> start_dates have always been a problem and should probably not be
> supported, though there's no way for us to tell.
>
> Changing the schedule interval or the "origin time" is a bit tricky and
> should be documented.
>
> If people use depend_on_past=True and change the origin or the interval,
> they basically redefine what "past" actually means and will require to
> "mark success" or defining a new "start_date" as a way to say "please
> disregard depend_on_past for this date"
>
> For those who haven't fully digested "What's the deal with start_dates",
> please take the time to read it:
> http://pythonhosted.org/airflow/faq.html
>
> Also see this part of the docs:
>
> ​
>
> Max
>
> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com> wrote:
>
>> Bolke, Sid and I had a brief conversation to discuss some of the
>> implications of https://github.com/airbnb/airflow/issues/1427
>>
>> There are two large points that need to be addressed:
>>
>> 1. this particular issue arises because of an alignment issue between
>> start_date and schedule_interval. This can only happen with cron-based
>> schedule_intervals that describe absolute points in time (like “1am”) as
>> opposed to time deltas (like “every hour”). Ironically, I once reported
>> this same issue myself (#959). In the past (and in the docs) we have
>> simply
>> said that users must make sure the two params agree. We discussed the
>> possibility of a DAG validation method to raise an error if the start_date
>> and schedule_interval don’t align, but Bolke made the point (and I agreed)
>> that in these cases, start_date is sort of like telling the scheduler to
>> “start paying attention” as opposed to “this is my first execution date”.
>> In #1427, the scheduler was being asked to start paying attention on
>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>> it
>> was scheduling a first run at midnight and a second run at 1:10.
>>
>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>> change the scheduling logic so that the cron-based interval takes
>> precedence over a start date. Specifically, the first date on or after the
>> start_date that complies with the schedule_interval becomes the first
>> execution date.
>>
>> 2. Issue #1 led to a second issue: depends_on_past checks for a successful
>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>> very possible for the previous TI to have run at any time in the past, not
>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>> with the situation described in point #1, where the first scheduled run is
>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>
>> The clear fix seems to be to have depends_on_past check the last TI that
>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>> with the intent of the flag. I will submit a fix.
>>
>> -J
>>
>
>

Re: GitHub Issue #1427 takeaways

Posted by Maxime Beauchemin <ma...@gmail.com>.
For this kind of use case someone should just use a TimeDeltaSensor as a
root task in their DAG if 1:10a is the "trigger". I'm suspecting that
person could use a better trigger than a TimeSensor though. What happens at
1:10a? If it's "the time at which the database scrapes are usually done",
that user should just have a more explicit sensor validating that the
scrapes a done, perhaps look for the data itself or some trigger file /
partition, ...

Max

On Tue, Apr 26, 2016 at 10:30 AM, Siddharth Anand <
siddharthanand@yahoo.com.invalid> wrote:

> I have the following concerns, which may be best illustrated with
> examples:
> Start date : 2016-01-01Schedule Interval : */20 * * * *  (run every 20
> mins)
> Expectation : Runs on
>    - 20160101T000000
>
>    - 20160101T002000
>
>    - 20160101T004000
>
>    - 20160101T014000
>
>    - ...
>
> Start date : 2016-01-01Schedule Interval : 10 1 * * *  (run at 1:10 every
> day)
> Expectation : Runs on
>    - 20160101T011000
>
>    - 20160102T011000
>
>    - 20160103T011000
>
>    - 20160104T011000
>    - ...
>
> In my mind, the second case is invalid. I suspect he wants to run a daily
> ETL job which is aligned on day boundaries but he wants data to settle
> before kicking off the job at 1:10a every day due to late arriving data.
> Though we can kick the run off at 1:10a every day, the execution date for
> each run will be 20160103T011000 every day. The problem is that the ETL
> actually needs to cover the day boundaries from midnight-to-midnight each
> day, not from 1:10a to 1:10a. So, what happens is that the dag runs and the
> task instance execution dates will reflect the scheduling interval, not the
> dates covered by the ETL process. The developer, will need to do his own
> date trimming inside his DAG to get to the day boundary (i.e.
> 20160101T011000 --> 20160101T000000) and his history will reflect
> scheduling instead of data coverage period.
> Today, because we require alignment, the instance dates and dag runs
> reflect both the data coverage period and the schedule. Hence, the
> execution date can be passed as window arguments to downstream systems.
> Does this make sense or am I concerned about a non-issue? Essentially, this
> change will be break existing functionality without people being aware.
> -s
>
>     On Tuesday, April 26, 2016 9:42 AM, Jeremiah Lowin <jl...@apache.org>
> wrote:
>
>
>  On further thought, I understand the issue you're describing where this
> could lead to out-of-order runs. In fact, Bolke alerted me to the
> possibility earlier but I didn't make the connection! That feels like a
> separate issue -- to guarantee that tasks are executed in order (and more
> importantly that their database entries are created). I think the
> depends_on_past issue is related but separate -- though clearly needs to be
> fleshed out for all cases :)
>
> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org> wrote:
>
> > I'm afraid I disagree -- though we may be talking about two different
> > issues. This issue deals specifically with how to identify the "past" TI
> > when evaluating "depends_on_past", and shouldn't be impacted by shifting
> > start_date, transparently or not.
> >
> > Here are three valid examples of depends_on_past DAGs that would fail to
> > run with the current setup:
> >
> > 1. A DAG with no schedule that is only run manually or via ad-hoc
> > backfill. Without a schedule_interval, depends_on_past will always fail
> > (since it looks back one schedule_interval).
> >
> > 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
> > say a scheduled run succeeds and then an off-schedule run fails. When the
> > next scheduled run starts, it shouldn't proceed because the most recent
> > task failed -- but it will look back one schedule_interval, jumping OVER
> > the most recent run, and decide it's ok to proceed.
> >
> > 3. A DAG with a schedule that is paused for a while. This DAG could be
> > running perfectly fine, but if it is paused for a while and then resumed,
> > its depends_on_past tasks will look back one schedule_interval and see
> > nothing, and therefore refuse to run.
> >
> > So my proposal is simply that the depends_on_past logic looks back at the
> > most recent task as opposed to rigidly assuming there is a task one
> > schedule_interval ago. For a regularly scheduled DAG, this will result in
> > absolutely no behavior change. However it will robustly support a much
> > wider variety of cases like the ones I listed above.
> >
> > J
> >
> >
> > On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> >> >>> "The clear fix seems to be to have depends_on_past check the last TI
> >> that
> >> ran, regardless of whether it ran `schedule_interval` ago. That's in
> line
> >> with the intent of the flag. I will submit a fix."
> >>
> >> I don't think so. This would lead to skipping runs, which
> depends_on_past
> >> is used as a guarantee to run every TI, sequentially.
> >>
> >> Absolute scheduling (cron expressions) is much better than relative
> >> scheduling (origin + interval). Though it's easy to make relative
> >> scheduling behave in an absolute way. You just have to use a rounded
> >> start_date to your schedule_interval, and not move things around.
> Dynamic
> >> start_dates have always been a problem and should probably not be
> >> supported, though there's no way for us to tell.
> >>
> >> Changing the schedule interval or the "origin time" is a bit tricky and
> >> should be documented.
> >>
> >> If people use depend_on_past=True and change the origin or the interval,
> >> they basically redefine what "past" actually means and will require to
> >> "mark success" or defining a new "start_date" as a way to say "please
> >> disregard depend_on_past for this date"
> >>
> >> For those who haven't fully digested "What's the deal with start_dates",
> >> please take the time to read it:
> >> http://pythonhosted.org/airflow/faq.html
> >>
> >> Also see this part of the docs:
> >>
> >> ​
> >>
> >> Max
> >>
> >> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com>
> wrote:
> >>
> >>> Bolke, Sid and I had a brief conversation to discuss some of the
> >>> implications of https://github.com/airbnb/airflow/issues/1427
> >>>
> >>> There are two large points that need to be addressed:
> >>>
> >>> 1. this particular issue arises because of an alignment issue between
> >>> start_date and schedule_interval. This can only happen with cron-based
> >>> schedule_intervals that describe absolute points in time (like “1am”)
> as
> >>> opposed to time deltas (like “every hour”). Ironically, I once reported
> >>> this same issue myself (#959). In the past (and in the docs) we have
> >>> simply
> >>> said that users must make sure the two params agree. We discussed the
> >>> possibility of a DAG validation method to raise an error if the
> >>> start_date
> >>> and schedule_interval don’t align, but Bolke made the point (and I
> >>> agreed)
> >>> that in these cases, start_date is sort of like telling the scheduler
> to
> >>> “start paying attention” as opposed to “this is my first execution
> date”.
> >>> In #1427, the scheduler was being asked to start paying attention on
> >>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
> However,
> >>> it
> >>> was scheduling a first run at midnight and a second run at 1:10.
> >>>
> >>> Regardless of whether we choose to validate/warn/error, Bolke is going
> to
> >>> change the scheduling logic so that the cron-based interval takes
> >>> precedence over a start date. Specifically, the first date on or after
> >>> the
> >>> start_date that complies with the schedule_interval becomes the first
> >>> execution date.
> >>>
> >>> 2. Issue #1 led to a second issue: depends_on_past checks for a
> >>> successful
> >>> TI at `execution_date - schedule_interval`. This is fragile, since it
> is
> >>> very possible for the previous TI to have run at any time in the past,
> >>> not
> >>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
> >>> runs, and also if a DAG was paused for a while. Less commonly, it
> happens
> >>> with the situation described in point #1, where the first scheduled run
> >>> is
> >>> off-schedule (the midnight run followed by the daily 1:10am runs).
> >>>
> >>> The clear fix seems to be to have depends_on_past check the last TI
> that
> >>> ran, regardless of whether it ran `schedule_interval` ago. That's in
> line
> >>> with the intent of the flag. I will submit a fix.
> >>>
> >>> -J
> >>>
> >>
> >>
>
>
>

Re: GitHub Issue #1427 takeaways

Posted by Siddharth Anand <si...@yahoo.com.INVALID>.
I have the following concerns, which may be best illustrated with examples: 
Start date : 2016-01-01Schedule Interval : */20 * * * *  (run every 20 mins)
Expectation : Runs on    
   - 20160101T000000   

   - 20160101T002000   

   - 20160101T004000   

   - 20160101T014000   

   - ...

Start date : 2016-01-01Schedule Interval : 10 1 * * *  (run at 1:10 every day)
Expectation : Runs on    
   - 20160101T011000   

   - 20160102T011000   

   - 20160103T011000   

   - 20160104T011000
   - ...

In my mind, the second case is invalid. I suspect he wants to run a daily ETL job which is aligned on day boundaries but he wants data to settle before kicking off the job at 1:10a every day due to late arriving data. Though we can kick the run off at 1:10a every day, the execution date for each run will be 20160103T011000 every day. The problem is that the ETL actually needs to cover the day boundaries from midnight-to-midnight each day, not from 1:10a to 1:10a. So, what happens is that the dag runs and the task instance execution dates will reflect the scheduling interval, not the dates covered by the ETL process. The developer, will need to do his own date trimming inside his DAG to get to the day boundary (i.e. 20160101T011000 --> 20160101T000000) and his history will reflect scheduling instead of data coverage period.  
Today, because we require alignment, the instance dates and dag runs reflect both the data coverage period and the schedule. Hence, the execution date can be passed as window arguments to downstream systems. Does this make sense or am I concerned about a non-issue? Essentially, this change will be break existing functionality without people being aware.
-s 

    On Tuesday, April 26, 2016 9:42 AM, Jeremiah Lowin <jl...@apache.org> wrote:
 

 On further thought, I understand the issue you're describing where this
could lead to out-of-order runs. In fact, Bolke alerted me to the
possibility earlier but I didn't make the connection! That feels like a
separate issue -- to guarantee that tasks are executed in order (and more
importantly that their database entries are created). I think the
depends_on_past issue is related but separate -- though clearly needs to be
fleshed out for all cases :)

On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org> wrote:

> I'm afraid I disagree -- though we may be talking about two different
> issues. This issue deals specifically with how to identify the "past" TI
> when evaluating "depends_on_past", and shouldn't be impacted by shifting
> start_date, transparently or not.
>
> Here are three valid examples of depends_on_past DAGs that would fail to
> run with the current setup:
>
> 1. A DAG with no schedule that is only run manually or via ad-hoc
> backfill. Without a schedule_interval, depends_on_past will always fail
> (since it looks back one schedule_interval).
>
> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
> say a scheduled run succeeds and then an off-schedule run fails. When the
> next scheduled run starts, it shouldn't proceed because the most recent
> task failed -- but it will look back one schedule_interval, jumping OVER
> the most recent run, and decide it's ok to proceed.
>
> 3. A DAG with a schedule that is paused for a while. This DAG could be
> running perfectly fine, but if it is paused for a while and then resumed,
> its depends_on_past tasks will look back one schedule_interval and see
> nothing, and therefore refuse to run.
>
> So my proposal is simply that the depends_on_past logic looks back at the
> most recent task as opposed to rigidly assuming there is a task one
> schedule_interval ago. For a regularly scheduled DAG, this will result in
> absolutely no behavior change. However it will robustly support a much
> wider variety of cases like the ones I listed above.
>
> J
>
>
> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> >>> "The clear fix seems to be to have depends_on_past check the last TI
>> that
>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>> with the intent of the flag. I will submit a fix."
>>
>> I don't think so. This would lead to skipping runs, which depends_on_past
>> is used as a guarantee to run every TI, sequentially.
>>
>> Absolute scheduling (cron expressions) is much better than relative
>> scheduling (origin + interval). Though it's easy to make relative
>> scheduling behave in an absolute way. You just have to use a rounded
>> start_date to your schedule_interval, and not move things around. Dynamic
>> start_dates have always been a problem and should probably not be
>> supported, though there's no way for us to tell.
>>
>> Changing the schedule interval or the "origin time" is a bit tricky and
>> should be documented.
>>
>> If people use depend_on_past=True and change the origin or the interval,
>> they basically redefine what "past" actually means and will require to
>> "mark success" or defining a new "start_date" as a way to say "please
>> disregard depend_on_past for this date"
>>
>> For those who haven't fully digested "What's the deal with start_dates",
>> please take the time to read it:
>> http://pythonhosted.org/airflow/faq.html
>>
>> Also see this part of the docs:
>>
>> ​
>>
>> Max
>>
>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com> wrote:
>>
>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>
>>> There are two large points that need to be addressed:
>>>
>>> 1. this particular issue arises because of an alignment issue between
>>> start_date and schedule_interval. This can only happen with cron-based
>>> schedule_intervals that describe absolute points in time (like “1am”) as
>>> opposed to time deltas (like “every hour”). Ironically, I once reported
>>> this same issue myself (#959). In the past (and in the docs) we have
>>> simply
>>> said that users must make sure the two params agree. We discussed the
>>> possibility of a DAG validation method to raise an error if the
>>> start_date
>>> and schedule_interval don’t align, but Bolke made the point (and I
>>> agreed)
>>> that in these cases, start_date is sort of like telling the scheduler to
>>> “start paying attention” as opposed to “this is my first execution date”.
>>> In #1427, the scheduler was being asked to start paying attention on
>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>>> it
>>> was scheduling a first run at midnight and a second run at 1:10.
>>>
>>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>>> change the scheduling logic so that the cron-based interval takes
>>> precedence over a start date. Specifically, the first date on or after
>>> the
>>> start_date that complies with the schedule_interval becomes the first
>>> execution date.
>>>
>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>> successful
>>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>>> very possible for the previous TI to have run at any time in the past,
>>> not
>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>>> with the situation described in point #1, where the first scheduled run
>>> is
>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>
>>> The clear fix seems to be to have depends_on_past check the last TI that
>>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>>> with the intent of the flag. I will submit a fix.
>>>
>>> -J
>>>
>>
>>

  

Re: GitHub Issue #1427 takeaways

Posted by Bolke de Bruin <bd...@gmail.com>.
All,

I have created AIRFLOW-20 to continue the discussion on the PR/Code. 

https://issues.apache.org/jira/browse/AIRFLOW-20

Bolke

> Op 28 apr. 2016, om 20:08 heeft Bolke de Bruin <bd...@gmail.com> het volgende geschreven:
> 
> Ok PR-1431 now passes all tests for mysql. Postgresql/sqlite have some alembic upgrade issues, that I don’t know yet how to solve. Basically, due to the fact that a unique constraint is added and dropped in the same transaction it does not work. So any ideas are welcome.
> 
> I have to revert my comment earlier on the deadlock issues not there anymore. I got confused as a depend_on_past block is reported as a deadlock by the backfill. I did update the tests to make sure there is a past to depend on (and thus fail if required).
> 
> On the first issue you mention, Jeremiah, a new state for DagRuns WAITING_FOR_PREVIOUS would be the way forward. This would allow you to walk the timeline. With the PR detecting this kind “deadlock” is already very simple.  For backfills, as a backfill job is currently immediate, the failure should also be immediate I think: not deadlocked but reported as cannot schedule as previous run not satisfied, which is easy now and we can even report which root-run has not finished yet with WAITING_FOR_PREVIOUS.
> 
> On the second issue, with the PR this is no longer an issue. If the backfill run is the first one it will know so, and just run. Hence this parameter superfluous in my opinion. I also consider the parameter a workaround currently.
> 
> Bolke
> 
>> Op 28 apr. 2016, om 14:11 heeft Jeremiah Lowin <jl...@gmail.com> het volgende geschreven:
>> 
>> The depends on past deadlock is the worst kind of deadlock (to catch in
>> code). Basically, if you're a DAG and you have "active" (runnable) tasks
>> but none of them have dependencies met, that's a deadlock. But if your
>> tasks depend_on_past we can't draw that conclusion since even though the
>> dependencies are unmet NOW they might be met in the future. However this
>> will still be an issue even with your change Bolke because let's say you
>> have 10 DagRuns active -- 9 of them will report that they can't start, but
>> we don't want to fail them because they are just waiting for past runs. So
>> I think the deadlock unit tests just need to change from looking at the
>> next run to looking 2+ runs out.
>> 
>> Ignore_first_depends_on_past is to allow arbitrary backfills of
>> depends_on_past tasks. In other words, if you want to backfill from A to B,
>> we want to make sure you can run the first one. It used to be that when you
>> ran a backfill, it overrode the start date of the task to trick the task in
>> to running on the first back filled date. Unfortunately that change didn't
>> survive into nested sub processes so it was a fragile hack and
>> ignore_first_depends_on_past replaced it. It's probably not ideal but it's
>> more robust.
>> On Thu, Apr 28, 2016 at 3:53 AM Bolke de Bruin <bd...@gmail.com> wrote:
>> 
>>> I just realized that an extra “benefit” (depending how you look at it) of
>>> this implementation is that many of the deadlock issues are gone.
>>> 
>>> * If there is no previous run, it will satisfy depend_on_past so the
>>> test_backfill_depends_on_past that checks for a deadlock (instead of maybe
>>> really checking for the dependency) will fail as there is no deadlock
>>> anymore.
>>> * test_cli_backfill_depends_on_past and test_dagrun_deadlock also fail as
>>> there is no deadlock anymore
>>> 
>>> Why does this not happen?
>>> 
>>> If a run detects it is the first run, it will start. This removes the
>>> requirement for ignore_first_depends_on_past (was this present for a real
>>> use case or a workaround for the scheduler?).
>>> 
>>> - Bolke
>>> 
>>>> Op 28 apr. 2016, om 09:08 heeft Bolke de Bruin <bd...@gmail.com> het
>>> volgende geschreven:
>>>> 
>>>> No I don’t think that covers it (although I get slightly confused by
>>> your comments "depends_on_past_schedule
>>>> “ and “depends_on_past_run”, but it seems too complex from the user’s
>>> perspective at first glance). The DagRun.previous handling is used to
>>> connect previous related DagRuns in order to make depends_on_past work with
>>> a moving interval or an automatically aligned start_date.
>>>> 
>>>> Imagine the following
>>>> 
>>>> Use Case 1
>>>> Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron
>>> schedule of “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is
>>> true.
>>>> 
>>>> The current scheduler kicks off a run “2016-04-24 00:00:00” and then
>>> stops as the previous interval of “2016-04-24 01:10:00” is “2016-04-23
>>> 01:10:00”. This cannot be found as it has not taken place and thus the
>>> whole thing grinds to a halt and the TaskInstances refuse to run.
>>>> 
>>>> What the user expects here is that the first run is “2016-04-24
>>> 01:10:00”, ie start_date + interval, unless the start_date is on the
>>> interval, ie. start_date is first interval. This is what I address by
>>> start_date normalization in the PR. However, the second issue then kicks in
>>> as the “previous” run can still not be found.
>>>> 
>>>> Use Case 2
>>>> Dag2 has an assigned start_date "2016-04-24 01:10:00"  with a cron
>>> schedule of “10 1 * * *”. Depends_on_past is true.
>>>> 
>>>> The scheduler happily goes through a few runs, but then the dag is
>>> updated and the schedule adjusted. Because the previous interval cannot be
>>> found by the TaskInstance (emphasis), tasks get stuck again requiring an
>>> manual override.
>>>> 
>>>> What the user expects here is that the scheduler is smart enough to
>>> figure out that we are still running the same dag and that it needs to look
>>> up the previous run for that dag and make sure dependencies are met with
>>> that previous dagrun in mind.
>>>> 
>>>> I don’t think those two use cases are edge cases, considering the amount
>>> of questions we get on these subjects.
>>>> 
>>>> To resolve the remaining issues (aside from start_date normalization) I
>>> first made DagRun aware of its predecessors. Then I strengthened the
>>> relationship between TaskInstances and DagRuns slightly, by optionally
>>> including a dagrun_id in the TaskInstance. Now a TaskInstance can lookup
>>> its predecessors in the previous DagRun and know for sure that it is either
>>> the first run or it has a predecessor somewhere in time instead of guessing.
>>>> 
>>>> What I am unsure of is what to consider is the unit of work: 1) is a
>>> TaskInstance dependent on its past predecessor ignoring the outcome of the
>>> DagRun? (previous TaskInstance can be successful, but previous DagRun as a
>>> whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can
>>> it be both? In case of 1 and 3 my logic needs to be updated slightly, but
>>> that should not be too much of a big deal. However I have difficulty
>>> imagining why you want to do that.
>>>> 
>>>> - Bolke
>>>> 
>>>> 
>>>>> Op 27 apr. 2016, om 18:17 heeft Maxime Beauchemin <
>>> maximebeauchemin@gmail.com> het volgende geschreven:
>>>>> 
>>>>> Added a few comments on the PR:
>>>>> 
>>>>> ------------------------------------------
>>>>> 
>>>>> So if I understand right the previous DagRun.id handling is to account
>>> for
>>>>> DAGs that have a mixed workload of scheduled and triggered DagRuns
>>> right?
>>>>> 
>>>>> Do we have real world use cases for those? I always though a DAG would
>>>>> either be scheduled, or externally triggered, not a mix of both, but we
>>>>> never made that a real restriction, so I'm sure some people mix both.
>>>>> 
>>>>> If that is the case that we want to support both, the notion of
>>>>> depends_on_past become mushy, where maybe people may expect or desire
>>>>> either a depends_on_past_schedule or depends_on_past_run.
>>>>> 
>>>>> Since it's pretty edge-case (combo of both mixed scheduled AND
>>>>> depend_on_past=True) and unclear what the desired behavior might be, I'd
>>>>> vote to do something simple in terms of logic and document it clearly.
>>> It
>>>>> could be something like "if the DAG is scheduled, then we use the latest
>>>>> DagRun that is >= previous schedule, if not scheduled then we take the
>>>>> previous DagRun, whatever it is at that time.", this would not require
>>>>> handling the previous_dag_run_id
>>>>> 
>>>>> On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <bd...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> I have also given it some thought and some coding. I hope we agree that
>>>>>> from a user perspective
>>>>>> there currently is an issue. The need to align the start_date with the
>>>>>> interval is counter intuitive
>>>>>> and leads to a lot of questions and issue creation, although it is in
>>> the
>>>>>> documentation. If we are
>>>>>> able to fix this with none or little consequences for current setups
>>> that
>>>>>> should be preferred, I think.
>>>>>> The dependency explainer is really great work, but it doesn’t address
>>> the
>>>>>> core issue.
>>>>>> 
>>>>>> I do recognize the issue mentioned by Max and I agree with Jeremiah
>>> that
>>>>>> it is a separate issue,
>>>>>> however if we are to tackle the above it needs to be done in one go.
>>>>>> 
>>>>>> If you consider a DAG a description of cohesion between work items (in
>>> OOP
>>>>>> java terms
>>>>>> a class), then a DagRun is the instantiation of a DAG in time (in OOP
>>> java
>>>>>> terms an instance).
>>>>>> Tasks are then the description of a work item and a TaskInstance the
>>>>>> instantiation of the Task in time.
>>>>>> 
>>>>>> In my opinion issues pop up due to the current paradigm of considering
>>> the
>>>>>> TaskInstance
>>>>>> the smallest unit of work and asking it to maintain its own state in
>>>>>> relation to other TaskInstances
>>>>>> in a DagRun and in a previous DagRun of which it has no (real)
>>> perception.
>>>>>> Tasks are instantiated
>>>>>> by a cartesian product with the dates of DagRun instead of the DagRuns
>>>>>> itself.
>>>>>> 
>>>>>> The very loose coupling between DagRuns and TaskInstances can be
>>> improved
>>>>>> while maintaining
>>>>>> flexibility to run tasks without a DagRun. This would help with a
>>> couple
>>>>>> of things:
>>>>>> 
>>>>>> 1. start_date can be used as a ‘execution_date’ or a point in time
>>> when to
>>>>>> start looking
>>>>>> 2. a new interval for a dag will maintain depends_on_past
>>>>>> 3. paused dags do not give trouble
>>>>>> 4. tasks will be executed in order
>>>>>> 5. the ignore_first_depend_on_past could be removed as a task will now
>>>>>> know if it is really the first
>>>>>> 
>>>>>> In PR-1431 a lot of this work has been done by:
>>>>>> 
>>>>>> 1. Adding a “previous” field to a DagRun allowing it to connect to its
>>>>>> predecessor
>>>>>> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about
>>> the
>>>>>> DagRun if needed
>>>>>> 3. Using start_date + interval as the first run date unless start_date
>>> is
>>>>>> on the interval then start_date is the first run date
>>>>>> 
>>>>>> The downside of the PR would be that TaskInstances without a DagRun
>>> will
>>>>>> not honor
>>>>>> depend_on_past, so the backfill would need to be adjusted to create a
>>>>>> DagRun
>>>>>> and properly connect it to a previous DagRun if that exists. Not too
>>> much
>>>>>> of a big
>>>>>> deal, I think - will work on that a bit as well in the PR.
>>>>>> 
>>>>>> So this should give at least an idea where things could head to. Any
>>>>>> thoughts?
>>>>>> 
>>>>>> Cheers
>>>>>> Bolke
>>>>>> 
>>>>>>> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <jl...@apache.org>
>>> het
>>>>>> volgende geschreven:
>>>>>>> 
>>>>>>> On further thought, I understand the issue you're describing where
>>> this
>>>>>>> could lead to out-of-order runs. In fact, Bolke alerted me to the
>>>>>>> possibility earlier but I didn't make the connection! That feels like
>>> a
>>>>>>> separate issue -- to guarantee that tasks are executed in order (and
>>> more
>>>>>>> importantly that their database entries are created). I think the
>>>>>>> depends_on_past issue is related but separate -- though clearly needs
>>> to
>>>>>> be
>>>>>>> fleshed out for all cases :)
>>>>>>> 
>>>>>>> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> I'm afraid I disagree -- though we may be talking about two different
>>>>>>>> issues. This issue deals specifically with how to identify the
>>> "past" TI
>>>>>>>> when evaluating "depends_on_past", and shouldn't be impacted by
>>> shifting
>>>>>>>> start_date, transparently or not.
>>>>>>>> 
>>>>>>>> Here are three valid examples of depends_on_past DAGs that would
>>> fail to
>>>>>>>> run with the current setup:
>>>>>>>> 
>>>>>>>> 1. A DAG with no schedule that is only run manually or via ad-hoc
>>>>>>>> backfill. Without a schedule_interval, depends_on_past will always
>>> fail
>>>>>>>> (since it looks back one schedule_interval).
>>>>>>>> 
>>>>>>>> 2. A DAG with a schedule, but that is sometimes run off-schedule.
>>> Let's
>>>>>>>> say a scheduled run succeeds and then an off-schedule run fails. When
>>>>>> the
>>>>>>>> next scheduled run starts, it shouldn't proceed because the most
>>> recent
>>>>>>>> task failed -- but it will look back one schedule_interval, jumping
>>> OVER
>>>>>>>> the most recent run, and decide it's ok to proceed.
>>>>>>>> 
>>>>>>>> 3. A DAG with a schedule that is paused for a while. This DAG could
>>> be
>>>>>>>> running perfectly fine, but if it is paused for a while and then
>>>>>> resumed,
>>>>>>>> its depends_on_past tasks will look back one schedule_interval and
>>> see
>>>>>>>> nothing, and therefore refuse to run.
>>>>>>>> 
>>>>>>>> So my proposal is simply that the depends_on_past logic looks back at
>>>>>> the
>>>>>>>> most recent task as opposed to rigidly assuming there is a task one
>>>>>>>> schedule_interval ago. For a regularly scheduled DAG, this will
>>> result
>>>>>> in
>>>>>>>> absolutely no behavior change. However it will robustly support a
>>> much
>>>>>>>> wider variety of cases like the ones I listed above.
>>>>>>>> 
>>>>>>>> J
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
>>>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>>>>> "The clear fix seems to be to have depends_on_past check the
>>> last TI
>>>>>>>>> that
>>>>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
>>>>>> line
>>>>>>>>> with the intent of the flag. I will submit a fix."
>>>>>>>>> 
>>>>>>>>> I don't think so. This would lead to skipping runs, which
>>>>>> depends_on_past
>>>>>>>>> is used as a guarantee to run every TI, sequentially.
>>>>>>>>> 
>>>>>>>>> Absolute scheduling (cron expressions) is much better than relative
>>>>>>>>> scheduling (origin + interval). Though it's easy to make relative
>>>>>>>>> scheduling behave in an absolute way. You just have to use a rounded
>>>>>>>>> start_date to your schedule_interval, and not move things around.
>>>>>> Dynamic
>>>>>>>>> start_dates have always been a problem and should probably not be
>>>>>>>>> supported, though there's no way for us to tell.
>>>>>>>>> 
>>>>>>>>> Changing the schedule interval or the "origin time" is a bit tricky
>>> and
>>>>>>>>> should be documented.
>>>>>>>>> 
>>>>>>>>> If people use depend_on_past=True and change the origin or the
>>>>>> interval,
>>>>>>>>> they basically redefine what "past" actually means and will require
>>> to
>>>>>>>>> "mark success" or defining a new "start_date" as a way to say
>>> "please
>>>>>>>>> disregard depend_on_past for this date"
>>>>>>>>> 
>>>>>>>>> For those who haven't fully digested "What's the deal with
>>>>>> start_dates",
>>>>>>>>> please take the time to read it:
>>>>>>>>> http://pythonhosted.org/airflow/faq.html
>>>>>>>>> 
>>>>>>>>> Also see this part of the docs:
>>>>>>>>> 
>>>>>>>>> ​
>>>>>>>>> 
>>>>>>>>> Max
>>>>>>>>> 
>>>>>>>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>>>>>>>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>>>>>>>> 
>>>>>>>>>> There are two large points that need to be addressed:
>>>>>>>>>> 
>>>>>>>>>> 1. this particular issue arises because of an alignment issue
>>> between
>>>>>>>>>> start_date and schedule_interval. This can only happen with
>>> cron-based
>>>>>>>>>> schedule_intervals that describe absolute points in time (like
>>> “1am”)
>>>>>> as
>>>>>>>>>> opposed to time deltas (like “every hour”). Ironically, I once
>>>>>> reported
>>>>>>>>>> this same issue myself (#959). In the past (and in the docs) we
>>> have
>>>>>>>>>> simply
>>>>>>>>>> said that users must make sure the two params agree. We discussed
>>> the
>>>>>>>>>> possibility of a DAG validation method to raise an error if the
>>>>>>>>>> start_date
>>>>>>>>>> and schedule_interval don’t align, but Bolke made the point (and I
>>>>>>>>>> agreed)
>>>>>>>>>> that in these cases, start_date is sort of like telling the
>>> scheduler
>>>>>> to
>>>>>>>>>> “start paying attention” as opposed to “this is my first execution
>>>>>> date”.
>>>>>>>>>> In #1427, the scheduler was being asked to start paying attention
>>> on
>>>>>>>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
>>>>>> However,
>>>>>>>>>> it
>>>>>>>>>> was scheduling a first run at midnight and a second run at 1:10.
>>>>>>>>>> 
>>>>>>>>>> Regardless of whether we choose to validate/warn/error, Bolke is
>>>>>> going to
>>>>>>>>>> change the scheduling logic so that the cron-based interval takes
>>>>>>>>>> precedence over a start date. Specifically, the first date on or
>>> after
>>>>>>>>>> the
>>>>>>>>>> start_date that complies with the schedule_interval becomes the
>>> first
>>>>>>>>>> execution date.
>>>>>>>>>> 
>>>>>>>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>>>>>>>>> successful
>>>>>>>>>> TI at `execution_date - schedule_interval`. This is fragile, since
>>> it
>>>>>> is
>>>>>>>>>> very possible for the previous TI to have run at any time in the
>>> past,
>>>>>>>>>> not
>>>>>>>>>> just one schedule_interval ago. This can happen easily with ad-hoc
>>> DAG
>>>>>>>>>> runs, and also if a DAG was paused for a while. Less commonly, it
>>>>>> happens
>>>>>>>>>> with the situation described in point #1, where the first scheduled
>>>>>> run
>>>>>>>>>> is
>>>>>>>>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>>>>>>>> 
>>>>>>>>>> The clear fix seems to be to have depends_on_past check the last TI
>>>>>> that
>>>>>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's
>>> in
>>>>>> line
>>>>>>>>>> with the intent of the flag. I will submit a fix.
>>>>>>>>>> 
>>>>>>>>>> -J
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
> 


Re: GitHub Issue #1427 takeaways

Posted by Bolke de Bruin <bd...@gmail.com>.
Ok PR-1431 now passes all tests for mysql. Postgresql/sqlite have some alembic upgrade issues, that I don’t know yet how to solve. Basically, due to the fact that a unique constraint is added and dropped in the same transaction it does not work. So any ideas are welcome.

I have to revert my comment earlier on the deadlock issues not there anymore. I got confused as a depend_on_past block is reported as a deadlock by the backfill. I did update the tests to make sure there is a past to depend on (and thus fail if required).

On the first issue you mention, Jeremiah, a new state for DagRuns WAITING_FOR_PREVIOUS would be the way forward. This would allow you to walk the timeline. With the PR detecting this kind “deadlock” is already very simple.  For backfills, as a backfill job is currently immediate, the failure should also be immediate I think: not deadlocked but reported as cannot schedule as previous run not satisfied, which is easy now and we can even report which root-run has not finished yet with WAITING_FOR_PREVIOUS.

On the second issue, with the PR this is no longer an issue. If the backfill run is the first one it will know so, and just run. Hence this parameter superfluous in my opinion. I also consider the parameter a workaround currently.

Bolke

> Op 28 apr. 2016, om 14:11 heeft Jeremiah Lowin <jl...@gmail.com> het volgende geschreven:
> 
> The depends on past deadlock is the worst kind of deadlock (to catch in
> code). Basically, if you're a DAG and you have "active" (runnable) tasks
> but none of them have dependencies met, that's a deadlock. But if your
> tasks depend_on_past we can't draw that conclusion since even though the
> dependencies are unmet NOW they might be met in the future. However this
> will still be an issue even with your change Bolke because let's say you
> have 10 DagRuns active -- 9 of them will report that they can't start, but
> we don't want to fail them because they are just waiting for past runs. So
> I think the deadlock unit tests just need to change from looking at the
> next run to looking 2+ runs out.
> 
> Ignore_first_depends_on_past is to allow arbitrary backfills of
> depends_on_past tasks. In other words, if you want to backfill from A to B,
> we want to make sure you can run the first one. It used to be that when you
> ran a backfill, it overrode the start date of the task to trick the task in
> to running on the first back filled date. Unfortunately that change didn't
> survive into nested sub processes so it was a fragile hack and
> ignore_first_depends_on_past replaced it. It's probably not ideal but it's
> more robust.
> On Thu, Apr 28, 2016 at 3:53 AM Bolke de Bruin <bd...@gmail.com> wrote:
> 
>> I just realized that an extra “benefit” (depending how you look at it) of
>> this implementation is that many of the deadlock issues are gone.
>> 
>> * If there is no previous run, it will satisfy depend_on_past so the
>> test_backfill_depends_on_past that checks for a deadlock (instead of maybe
>> really checking for the dependency) will fail as there is no deadlock
>> anymore.
>> * test_cli_backfill_depends_on_past and test_dagrun_deadlock also fail as
>> there is no deadlock anymore
>> 
>> Why does this not happen?
>> 
>> If a run detects it is the first run, it will start. This removes the
>> requirement for ignore_first_depends_on_past (was this present for a real
>> use case or a workaround for the scheduler?).
>> 
>> - Bolke
>> 
>>> Op 28 apr. 2016, om 09:08 heeft Bolke de Bruin <bd...@gmail.com> het
>> volgende geschreven:
>>> 
>>> No I don’t think that covers it (although I get slightly confused by
>> your comments "depends_on_past_schedule
>>> “ and “depends_on_past_run”, but it seems too complex from the user’s
>> perspective at first glance). The DagRun.previous handling is used to
>> connect previous related DagRuns in order to make depends_on_past work with
>> a moving interval or an automatically aligned start_date.
>>> 
>>> Imagine the following
>>> 
>>> Use Case 1
>>> Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron
>> schedule of “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is
>> true.
>>> 
>>> The current scheduler kicks off a run “2016-04-24 00:00:00” and then
>> stops as the previous interval of “2016-04-24 01:10:00” is “2016-04-23
>> 01:10:00”. This cannot be found as it has not taken place and thus the
>> whole thing grinds to a halt and the TaskInstances refuse to run.
>>> 
>>> What the user expects here is that the first run is “2016-04-24
>> 01:10:00”, ie start_date + interval, unless the start_date is on the
>> interval, ie. start_date is first interval. This is what I address by
>> start_date normalization in the PR. However, the second issue then kicks in
>> as the “previous” run can still not be found.
>>> 
>>> Use Case 2
>>> Dag2 has an assigned start_date "2016-04-24 01:10:00"  with a cron
>> schedule of “10 1 * * *”. Depends_on_past is true.
>>> 
>>> The scheduler happily goes through a few runs, but then the dag is
>> updated and the schedule adjusted. Because the previous interval cannot be
>> found by the TaskInstance (emphasis), tasks get stuck again requiring an
>> manual override.
>>> 
>>> What the user expects here is that the scheduler is smart enough to
>> figure out that we are still running the same dag and that it needs to look
>> up the previous run for that dag and make sure dependencies are met with
>> that previous dagrun in mind.
>>> 
>>> I don’t think those two use cases are edge cases, considering the amount
>> of questions we get on these subjects.
>>> 
>>> To resolve the remaining issues (aside from start_date normalization) I
>> first made DagRun aware of its predecessors. Then I strengthened the
>> relationship between TaskInstances and DagRuns slightly, by optionally
>> including a dagrun_id in the TaskInstance. Now a TaskInstance can lookup
>> its predecessors in the previous DagRun and know for sure that it is either
>> the first run or it has a predecessor somewhere in time instead of guessing.
>>> 
>>> What I am unsure of is what to consider is the unit of work: 1) is a
>> TaskInstance dependent on its past predecessor ignoring the outcome of the
>> DagRun? (previous TaskInstance can be successful, but previous DagRun as a
>> whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can
>> it be both? In case of 1 and 3 my logic needs to be updated slightly, but
>> that should not be too much of a big deal. However I have difficulty
>> imagining why you want to do that.
>>> 
>>> - Bolke
>>> 
>>> 
>>>> Op 27 apr. 2016, om 18:17 heeft Maxime Beauchemin <
>> maximebeauchemin@gmail.com> het volgende geschreven:
>>>> 
>>>> Added a few comments on the PR:
>>>> 
>>>> ------------------------------------------
>>>> 
>>>> So if I understand right the previous DagRun.id handling is to account
>> for
>>>> DAGs that have a mixed workload of scheduled and triggered DagRuns
>> right?
>>>> 
>>>> Do we have real world use cases for those? I always though a DAG would
>>>> either be scheduled, or externally triggered, not a mix of both, but we
>>>> never made that a real restriction, so I'm sure some people mix both.
>>>> 
>>>> If that is the case that we want to support both, the notion of
>>>> depends_on_past become mushy, where maybe people may expect or desire
>>>> either a depends_on_past_schedule or depends_on_past_run.
>>>> 
>>>> Since it's pretty edge-case (combo of both mixed scheduled AND
>>>> depend_on_past=True) and unclear what the desired behavior might be, I'd
>>>> vote to do something simple in terms of logic and document it clearly.
>> It
>>>> could be something like "if the DAG is scheduled, then we use the latest
>>>> DagRun that is >= previous schedule, if not scheduled then we take the
>>>> previous DagRun, whatever it is at that time.", this would not require
>>>> handling the previous_dag_run_id
>>>> 
>>>> On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <bd...@gmail.com>
>> wrote:
>>>> 
>>>>> I have also given it some thought and some coding. I hope we agree that
>>>>> from a user perspective
>>>>> there currently is an issue. The need to align the start_date with the
>>>>> interval is counter intuitive
>>>>> and leads to a lot of questions and issue creation, although it is in
>> the
>>>>> documentation. If we are
>>>>> able to fix this with none or little consequences for current setups
>> that
>>>>> should be preferred, I think.
>>>>> The dependency explainer is really great work, but it doesn’t address
>> the
>>>>> core issue.
>>>>> 
>>>>> I do recognize the issue mentioned by Max and I agree with Jeremiah
>> that
>>>>> it is a separate issue,
>>>>> however if we are to tackle the above it needs to be done in one go.
>>>>> 
>>>>> If you consider a DAG a description of cohesion between work items (in
>> OOP
>>>>> java terms
>>>>> a class), then a DagRun is the instantiation of a DAG in time (in OOP
>> java
>>>>> terms an instance).
>>>>> Tasks are then the description of a work item and a TaskInstance the
>>>>> instantiation of the Task in time.
>>>>> 
>>>>> In my opinion issues pop up due to the current paradigm of considering
>> the
>>>>> TaskInstance
>>>>> the smallest unit of work and asking it to maintain its own state in
>>>>> relation to other TaskInstances
>>>>> in a DagRun and in a previous DagRun of which it has no (real)
>> perception.
>>>>> Tasks are instantiated
>>>>> by a cartesian product with the dates of DagRun instead of the DagRuns
>>>>> itself.
>>>>> 
>>>>> The very loose coupling between DagRuns and TaskInstances can be
>> improved
>>>>> while maintaining
>>>>> flexibility to run tasks without a DagRun. This would help with a
>> couple
>>>>> of things:
>>>>> 
>>>>> 1. start_date can be used as a ‘execution_date’ or a point in time
>> when to
>>>>> start looking
>>>>> 2. a new interval for a dag will maintain depends_on_past
>>>>> 3. paused dags do not give trouble
>>>>> 4. tasks will be executed in order
>>>>> 5. the ignore_first_depend_on_past could be removed as a task will now
>>>>> know if it is really the first
>>>>> 
>>>>> In PR-1431 a lot of this work has been done by:
>>>>> 
>>>>> 1. Adding a “previous” field to a DagRun allowing it to connect to its
>>>>> predecessor
>>>>> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about
>> the
>>>>> DagRun if needed
>>>>> 3. Using start_date + interval as the first run date unless start_date
>> is
>>>>> on the interval then start_date is the first run date
>>>>> 
>>>>> The downside of the PR would be that TaskInstances without a DagRun
>> will
>>>>> not honor
>>>>> depend_on_past, so the backfill would need to be adjusted to create a
>>>>> DagRun
>>>>> and properly connect it to a previous DagRun if that exists. Not too
>> much
>>>>> of a big
>>>>> deal, I think - will work on that a bit as well in the PR.
>>>>> 
>>>>> So this should give at least an idea where things could head to. Any
>>>>> thoughts?
>>>>> 
>>>>> Cheers
>>>>> Bolke
>>>>> 
>>>>>> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <jl...@apache.org>
>> het
>>>>> volgende geschreven:
>>>>>> 
>>>>>> On further thought, I understand the issue you're describing where
>> this
>>>>>> could lead to out-of-order runs. In fact, Bolke alerted me to the
>>>>>> possibility earlier but I didn't make the connection! That feels like
>> a
>>>>>> separate issue -- to guarantee that tasks are executed in order (and
>> more
>>>>>> importantly that their database entries are created). I think the
>>>>>> depends_on_past issue is related but separate -- though clearly needs
>> to
>>>>> be
>>>>>> fleshed out for all cases :)
>>>>>> 
>>>>>> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>>> I'm afraid I disagree -- though we may be talking about two different
>>>>>>> issues. This issue deals specifically with how to identify the
>> "past" TI
>>>>>>> when evaluating "depends_on_past", and shouldn't be impacted by
>> shifting
>>>>>>> start_date, transparently or not.
>>>>>>> 
>>>>>>> Here are three valid examples of depends_on_past DAGs that would
>> fail to
>>>>>>> run with the current setup:
>>>>>>> 
>>>>>>> 1. A DAG with no schedule that is only run manually or via ad-hoc
>>>>>>> backfill. Without a schedule_interval, depends_on_past will always
>> fail
>>>>>>> (since it looks back one schedule_interval).
>>>>>>> 
>>>>>>> 2. A DAG with a schedule, but that is sometimes run off-schedule.
>> Let's
>>>>>>> say a scheduled run succeeds and then an off-schedule run fails. When
>>>>> the
>>>>>>> next scheduled run starts, it shouldn't proceed because the most
>> recent
>>>>>>> task failed -- but it will look back one schedule_interval, jumping
>> OVER
>>>>>>> the most recent run, and decide it's ok to proceed.
>>>>>>> 
>>>>>>> 3. A DAG with a schedule that is paused for a while. This DAG could
>> be
>>>>>>> running perfectly fine, but if it is paused for a while and then
>>>>> resumed,
>>>>>>> its depends_on_past tasks will look back one schedule_interval and
>> see
>>>>>>> nothing, and therefore refuse to run.
>>>>>>> 
>>>>>>> So my proposal is simply that the depends_on_past logic looks back at
>>>>> the
>>>>>>> most recent task as opposed to rigidly assuming there is a task one
>>>>>>> schedule_interval ago. For a regularly scheduled DAG, this will
>> result
>>>>> in
>>>>>>> absolutely no behavior change. However it will robustly support a
>> much
>>>>>>> wider variety of cases like the ones I listed above.
>>>>>>> 
>>>>>>> J
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
>>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>>> 
>>>>>>>>>>> "The clear fix seems to be to have depends_on_past check the
>> last TI
>>>>>>>> that
>>>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
>>>>> line
>>>>>>>> with the intent of the flag. I will submit a fix."
>>>>>>>> 
>>>>>>>> I don't think so. This would lead to skipping runs, which
>>>>> depends_on_past
>>>>>>>> is used as a guarantee to run every TI, sequentially.
>>>>>>>> 
>>>>>>>> Absolute scheduling (cron expressions) is much better than relative
>>>>>>>> scheduling (origin + interval). Though it's easy to make relative
>>>>>>>> scheduling behave in an absolute way. You just have to use a rounded
>>>>>>>> start_date to your schedule_interval, and not move things around.
>>>>> Dynamic
>>>>>>>> start_dates have always been a problem and should probably not be
>>>>>>>> supported, though there's no way for us to tell.
>>>>>>>> 
>>>>>>>> Changing the schedule interval or the "origin time" is a bit tricky
>> and
>>>>>>>> should be documented.
>>>>>>>> 
>>>>>>>> If people use depend_on_past=True and change the origin or the
>>>>> interval,
>>>>>>>> they basically redefine what "past" actually means and will require
>> to
>>>>>>>> "mark success" or defining a new "start_date" as a way to say
>> "please
>>>>>>>> disregard depend_on_past for this date"
>>>>>>>> 
>>>>>>>> For those who haven't fully digested "What's the deal with
>>>>> start_dates",
>>>>>>>> please take the time to read it:
>>>>>>>> http://pythonhosted.org/airflow/faq.html
>>>>>>>> 
>>>>>>>> Also see this part of the docs:
>>>>>>>> 
>>>>>>>> ​
>>>>>>>> 
>>>>>>>> Max
>>>>>>>> 
>>>>>>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>>>>>>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>>>>>>> 
>>>>>>>>> There are two large points that need to be addressed:
>>>>>>>>> 
>>>>>>>>> 1. this particular issue arises because of an alignment issue
>> between
>>>>>>>>> start_date and schedule_interval. This can only happen with
>> cron-based
>>>>>>>>> schedule_intervals that describe absolute points in time (like
>> “1am”)
>>>>> as
>>>>>>>>> opposed to time deltas (like “every hour”). Ironically, I once
>>>>> reported
>>>>>>>>> this same issue myself (#959). In the past (and in the docs) we
>> have
>>>>>>>>> simply
>>>>>>>>> said that users must make sure the two params agree. We discussed
>> the
>>>>>>>>> possibility of a DAG validation method to raise an error if the
>>>>>>>>> start_date
>>>>>>>>> and schedule_interval don’t align, but Bolke made the point (and I
>>>>>>>>> agreed)
>>>>>>>>> that in these cases, start_date is sort of like telling the
>> scheduler
>>>>> to
>>>>>>>>> “start paying attention” as opposed to “this is my first execution
>>>>> date”.
>>>>>>>>> In #1427, the scheduler was being asked to start paying attention
>> on
>>>>>>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
>>>>> However,
>>>>>>>>> it
>>>>>>>>> was scheduling a first run at midnight and a second run at 1:10.
>>>>>>>>> 
>>>>>>>>> Regardless of whether we choose to validate/warn/error, Bolke is
>>>>> going to
>>>>>>>>> change the scheduling logic so that the cron-based interval takes
>>>>>>>>> precedence over a start date. Specifically, the first date on or
>> after
>>>>>>>>> the
>>>>>>>>> start_date that complies with the schedule_interval becomes the
>> first
>>>>>>>>> execution date.
>>>>>>>>> 
>>>>>>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>>>>>>>> successful
>>>>>>>>> TI at `execution_date - schedule_interval`. This is fragile, since
>> it
>>>>> is
>>>>>>>>> very possible for the previous TI to have run at any time in the
>> past,
>>>>>>>>> not
>>>>>>>>> just one schedule_interval ago. This can happen easily with ad-hoc
>> DAG
>>>>>>>>> runs, and also if a DAG was paused for a while. Less commonly, it
>>>>> happens
>>>>>>>>> with the situation described in point #1, where the first scheduled
>>>>> run
>>>>>>>>> is
>>>>>>>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>>>>>>> 
>>>>>>>>> The clear fix seems to be to have depends_on_past check the last TI
>>>>> that
>>>>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's
>> in
>>>>> line
>>>>>>>>> with the intent of the flag. I will submit a fix.
>>>>>>>>> 
>>>>>>>>> -J
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>> 


Re: GitHub Issue #1427 takeaways

Posted by Jeremiah Lowin <jl...@gmail.com>.
The depends on past deadlock is the worst kind of deadlock (to catch in
code). Basically, if you're a DAG and you have "active" (runnable) tasks
but none of them have dependencies met, that's a deadlock. But if your
tasks depend_on_past we can't draw that conclusion since even though the
dependencies are unmet NOW they might be met in the future. However this
will still be an issue even with your change Bolke because let's say you
have 10 DagRuns active -- 9 of them will report that they can't start, but
we don't want to fail them because they are just waiting for past runs. So
I think the deadlock unit tests just need to change from looking at the
next run to looking 2+ runs out.

Ignore_first_depends_on_past is to allow arbitrary backfills of
depends_on_past tasks. In other words, if you want to backfill from A to B,
we want to make sure you can run the first one. It used to be that when you
ran a backfill, it overrode the start date of the task to trick the task in
to running on the first back filled date. Unfortunately that change didn't
survive into nested sub processes so it was a fragile hack and
ignore_first_depends_on_past replaced it. It's probably not ideal but it's
more robust.
On Thu, Apr 28, 2016 at 3:53 AM Bolke de Bruin <bd...@gmail.com> wrote:

> I just realized that an extra “benefit” (depending how you look at it) of
> this implementation is that many of the deadlock issues are gone.
>
> * If there is no previous run, it will satisfy depend_on_past so the
> test_backfill_depends_on_past that checks for a deadlock (instead of maybe
> really checking for the dependency) will fail as there is no deadlock
> anymore.
> * test_cli_backfill_depends_on_past and test_dagrun_deadlock also fail as
> there is no deadlock anymore
>
> Why does this not happen?
>
> If a run detects it is the first run, it will start. This removes the
> requirement for ignore_first_depends_on_past (was this present for a real
> use case or a workaround for the scheduler?).
>
> - Bolke
>
> > Op 28 apr. 2016, om 09:08 heeft Bolke de Bruin <bd...@gmail.com> het
> volgende geschreven:
> >
> > No I don’t think that covers it (although I get slightly confused by
> your comments "depends_on_past_schedule
> > “ and “depends_on_past_run”, but it seems too complex from the user’s
> perspective at first glance). The DagRun.previous handling is used to
> connect previous related DagRuns in order to make depends_on_past work with
> a moving interval or an automatically aligned start_date.
> >
> > Imagine the following
> >
> > Use Case 1
> > Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron
> schedule of “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is
> true.
> >
> > The current scheduler kicks off a run “2016-04-24 00:00:00” and then
> stops as the previous interval of “2016-04-24 01:10:00” is “2016-04-23
> 01:10:00”. This cannot be found as it has not taken place and thus the
> whole thing grinds to a halt and the TaskInstances refuse to run.
> >
> > What the user expects here is that the first run is “2016-04-24
> 01:10:00”, ie start_date + interval, unless the start_date is on the
> interval, ie. start_date is first interval. This is what I address by
> start_date normalization in the PR. However, the second issue then kicks in
> as the “previous” run can still not be found.
> >
> > Use Case 2
> > Dag2 has an assigned start_date "2016-04-24 01:10:00"  with a cron
> schedule of “10 1 * * *”. Depends_on_past is true.
> >
> > The scheduler happily goes through a few runs, but then the dag is
> updated and the schedule adjusted. Because the previous interval cannot be
> found by the TaskInstance (emphasis), tasks get stuck again requiring an
> manual override.
> >
> > What the user expects here is that the scheduler is smart enough to
> figure out that we are still running the same dag and that it needs to look
> up the previous run for that dag and make sure dependencies are met with
> that previous dagrun in mind.
> >
> > I don’t think those two use cases are edge cases, considering the amount
> of questions we get on these subjects.
> >
> > To resolve the remaining issues (aside from start_date normalization) I
> first made DagRun aware of its predecessors. Then I strengthened the
> relationship between TaskInstances and DagRuns slightly, by optionally
> including a dagrun_id in the TaskInstance. Now a TaskInstance can lookup
> its predecessors in the previous DagRun and know for sure that it is either
> the first run or it has a predecessor somewhere in time instead of guessing.
> >
> > What I am unsure of is what to consider is the unit of work: 1) is a
> TaskInstance dependent on its past predecessor ignoring the outcome of the
> DagRun? (previous TaskInstance can be successful, but previous DagRun as a
> whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can
> it be both? In case of 1 and 3 my logic needs to be updated slightly, but
> that should not be too much of a big deal. However I have difficulty
> imagining why you want to do that.
> >
> > - Bolke
> >
> >
> >> Op 27 apr. 2016, om 18:17 heeft Maxime Beauchemin <
> maximebeauchemin@gmail.com> het volgende geschreven:
> >>
> >> Added a few comments on the PR:
> >>
> >> ------------------------------------------
> >>
> >> So if I understand right the previous DagRun.id handling is to account
> for
> >> DAGs that have a mixed workload of scheduled and triggered DagRuns
> right?
> >>
> >> Do we have real world use cases for those? I always though a DAG would
> >> either be scheduled, or externally triggered, not a mix of both, but we
> >> never made that a real restriction, so I'm sure some people mix both.
> >>
> >> If that is the case that we want to support both, the notion of
> >> depends_on_past become mushy, where maybe people may expect or desire
> >> either a depends_on_past_schedule or depends_on_past_run.
> >>
> >> Since it's pretty edge-case (combo of both mixed scheduled AND
> >> depend_on_past=True) and unclear what the desired behavior might be, I'd
> >> vote to do something simple in terms of logic and document it clearly.
> It
> >> could be something like "if the DAG is scheduled, then we use the latest
> >> DagRun that is >= previous schedule, if not scheduled then we take the
> >> previous DagRun, whatever it is at that time.", this would not require
> >> handling the previous_dag_run_id
> >>
> >> On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>
> >>> I have also given it some thought and some coding. I hope we agree that
> >>> from a user perspective
> >>> there currently is an issue. The need to align the start_date with the
> >>> interval is counter intuitive
> >>> and leads to a lot of questions and issue creation, although it is in
> the
> >>> documentation. If we are
> >>> able to fix this with none or little consequences for current setups
> that
> >>> should be preferred, I think.
> >>> The dependency explainer is really great work, but it doesn’t address
> the
> >>> core issue.
> >>>
> >>> I do recognize the issue mentioned by Max and I agree with Jeremiah
> that
> >>> it is a separate issue,
> >>> however if we are to tackle the above it needs to be done in one go.
> >>>
> >>> If you consider a DAG a description of cohesion between work items (in
> OOP
> >>> java terms
> >>> a class), then a DagRun is the instantiation of a DAG in time (in OOP
> java
> >>> terms an instance).
> >>> Tasks are then the description of a work item and a TaskInstance the
> >>> instantiation of the Task in time.
> >>>
> >>> In my opinion issues pop up due to the current paradigm of considering
> the
> >>> TaskInstance
> >>> the smallest unit of work and asking it to maintain its own state in
> >>> relation to other TaskInstances
> >>> in a DagRun and in a previous DagRun of which it has no (real)
> perception.
> >>> Tasks are instantiated
> >>> by a cartesian product with the dates of DagRun instead of the DagRuns
> >>> itself.
> >>>
> >>> The very loose coupling between DagRuns and TaskInstances can be
> improved
> >>> while maintaining
> >>> flexibility to run tasks without a DagRun. This would help with a
> couple
> >>> of things:
> >>>
> >>> 1. start_date can be used as a ‘execution_date’ or a point in time
> when to
> >>> start looking
> >>> 2. a new interval for a dag will maintain depends_on_past
> >>> 3. paused dags do not give trouble
> >>> 4. tasks will be executed in order
> >>> 5. the ignore_first_depend_on_past could be removed as a task will now
> >>> know if it is really the first
> >>>
> >>> In PR-1431 a lot of this work has been done by:
> >>>
> >>> 1. Adding a “previous” field to a DagRun allowing it to connect to its
> >>> predecessor
> >>> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about
> the
> >>> DagRun if needed
> >>> 3. Using start_date + interval as the first run date unless start_date
> is
> >>> on the interval then start_date is the first run date
> >>>
> >>> The downside of the PR would be that TaskInstances without a DagRun
> will
> >>> not honor
> >>> depend_on_past, so the backfill would need to be adjusted to create a
> >>> DagRun
> >>> and properly connect it to a previous DagRun if that exists. Not too
> much
> >>> of a big
> >>> deal, I think - will work on that a bit as well in the PR.
> >>>
> >>> So this should give at least an idea where things could head to. Any
> >>> thoughts?
> >>>
> >>> Cheers
> >>> Bolke
> >>>
> >>>> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <jl...@apache.org>
> het
> >>> volgende geschreven:
> >>>>
> >>>> On further thought, I understand the issue you're describing where
> this
> >>>> could lead to out-of-order runs. In fact, Bolke alerted me to the
> >>>> possibility earlier but I didn't make the connection! That feels like
> a
> >>>> separate issue -- to guarantee that tasks are executed in order (and
> more
> >>>> importantly that their database entries are created). I think the
> >>>> depends_on_past issue is related but separate -- though clearly needs
> to
> >>> be
> >>>> fleshed out for all cases :)
> >>>>
> >>>> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org>
> >>> wrote:
> >>>>
> >>>>> I'm afraid I disagree -- though we may be talking about two different
> >>>>> issues. This issue deals specifically with how to identify the
> "past" TI
> >>>>> when evaluating "depends_on_past", and shouldn't be impacted by
> shifting
> >>>>> start_date, transparently or not.
> >>>>>
> >>>>> Here are three valid examples of depends_on_past DAGs that would
> fail to
> >>>>> run with the current setup:
> >>>>>
> >>>>> 1. A DAG with no schedule that is only run manually or via ad-hoc
> >>>>> backfill. Without a schedule_interval, depends_on_past will always
> fail
> >>>>> (since it looks back one schedule_interval).
> >>>>>
> >>>>> 2. A DAG with a schedule, but that is sometimes run off-schedule.
> Let's
> >>>>> say a scheduled run succeeds and then an off-schedule run fails. When
> >>> the
> >>>>> next scheduled run starts, it shouldn't proceed because the most
> recent
> >>>>> task failed -- but it will look back one schedule_interval, jumping
> OVER
> >>>>> the most recent run, and decide it's ok to proceed.
> >>>>>
> >>>>> 3. A DAG with a schedule that is paused for a while. This DAG could
> be
> >>>>> running perfectly fine, but if it is paused for a while and then
> >>> resumed,
> >>>>> its depends_on_past tasks will look back one schedule_interval and
> see
> >>>>> nothing, and therefore refuse to run.
> >>>>>
> >>>>> So my proposal is simply that the depends_on_past logic looks back at
> >>> the
> >>>>> most recent task as opposed to rigidly assuming there is a task one
> >>>>> schedule_interval ago. For a regularly scheduled DAG, this will
> result
> >>> in
> >>>>> absolutely no behavior change. However it will robustly support a
> much
> >>>>> wider variety of cases like the ones I listed above.
> >>>>>
> >>>>> J
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
> >>>>> maximebeauchemin@gmail.com> wrote:
> >>>>>
> >>>>>>>>> "The clear fix seems to be to have depends_on_past check the
> last TI
> >>>>>> that
> >>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
> >>> line
> >>>>>> with the intent of the flag. I will submit a fix."
> >>>>>>
> >>>>>> I don't think so. This would lead to skipping runs, which
> >>> depends_on_past
> >>>>>> is used as a guarantee to run every TI, sequentially.
> >>>>>>
> >>>>>> Absolute scheduling (cron expressions) is much better than relative
> >>>>>> scheduling (origin + interval). Though it's easy to make relative
> >>>>>> scheduling behave in an absolute way. You just have to use a rounded
> >>>>>> start_date to your schedule_interval, and not move things around.
> >>> Dynamic
> >>>>>> start_dates have always been a problem and should probably not be
> >>>>>> supported, though there's no way for us to tell.
> >>>>>>
> >>>>>> Changing the schedule interval or the "origin time" is a bit tricky
> and
> >>>>>> should be documented.
> >>>>>>
> >>>>>> If people use depend_on_past=True and change the origin or the
> >>> interval,
> >>>>>> they basically redefine what "past" actually means and will require
> to
> >>>>>> "mark success" or defining a new "start_date" as a way to say
> "please
> >>>>>> disregard depend_on_past for this date"
> >>>>>>
> >>>>>> For those who haven't fully digested "What's the deal with
> >>> start_dates",
> >>>>>> please take the time to read it:
> >>>>>> http://pythonhosted.org/airflow/faq.html
> >>>>>>
> >>>>>> Also see this part of the docs:
> >>>>>>
> >>>>>> ​
> >>>>>>
> >>>>>> Max
> >>>>>>
> >>>>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>>> Bolke, Sid and I had a brief conversation to discuss some of the
> >>>>>>> implications of https://github.com/airbnb/airflow/issues/1427
> >>>>>>>
> >>>>>>> There are two large points that need to be addressed:
> >>>>>>>
> >>>>>>> 1. this particular issue arises because of an alignment issue
> between
> >>>>>>> start_date and schedule_interval. This can only happen with
> cron-based
> >>>>>>> schedule_intervals that describe absolute points in time (like
> “1am”)
> >>> as
> >>>>>>> opposed to time deltas (like “every hour”). Ironically, I once
> >>> reported
> >>>>>>> this same issue myself (#959). In the past (and in the docs) we
> have
> >>>>>>> simply
> >>>>>>> said that users must make sure the two params agree. We discussed
> the
> >>>>>>> possibility of a DAG validation method to raise an error if the
> >>>>>>> start_date
> >>>>>>> and schedule_interval don’t align, but Bolke made the point (and I
> >>>>>>> agreed)
> >>>>>>> that in these cases, start_date is sort of like telling the
> scheduler
> >>> to
> >>>>>>> “start paying attention” as opposed to “this is my first execution
> >>> date”.
> >>>>>>> In #1427, the scheduler was being asked to start paying attention
> on
> >>>>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
> >>> However,
> >>>>>>> it
> >>>>>>> was scheduling a first run at midnight and a second run at 1:10.
> >>>>>>>
> >>>>>>> Regardless of whether we choose to validate/warn/error, Bolke is
> >>> going to
> >>>>>>> change the scheduling logic so that the cron-based interval takes
> >>>>>>> precedence over a start date. Specifically, the first date on or
> after
> >>>>>>> the
> >>>>>>> start_date that complies with the schedule_interval becomes the
> first
> >>>>>>> execution date.
> >>>>>>>
> >>>>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
> >>>>>>> successful
> >>>>>>> TI at `execution_date - schedule_interval`. This is fragile, since
> it
> >>> is
> >>>>>>> very possible for the previous TI to have run at any time in the
> past,
> >>>>>>> not
> >>>>>>> just one schedule_interval ago. This can happen easily with ad-hoc
> DAG
> >>>>>>> runs, and also if a DAG was paused for a while. Less commonly, it
> >>> happens
> >>>>>>> with the situation described in point #1, where the first scheduled
> >>> run
> >>>>>>> is
> >>>>>>> off-schedule (the midnight run followed by the daily 1:10am runs).
> >>>>>>>
> >>>>>>> The clear fix seems to be to have depends_on_past check the last TI
> >>> that
> >>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's
> in
> >>> line
> >>>>>>> with the intent of the flag. I will submit a fix.
> >>>>>>>
> >>>>>>> -J
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>
> >>>
> >
>
>

Re: GitHub Issue #1427 takeaways

Posted by Bolke de Bruin <bd...@gmail.com>.
I just realized that an extra “benefit” (depending how you look at it) of this implementation is that many of the deadlock issues are gone.

* If there is no previous run, it will satisfy depend_on_past so the test_backfill_depends_on_past that checks for a deadlock (instead of maybe really checking for the dependency) will fail as there is no deadlock anymore.
* test_cli_backfill_depends_on_past and test_dagrun_deadlock also fail as there is no deadlock anymore

Why does this not happen?

If a run detects it is the first run, it will start. This removes the requirement for ignore_first_depends_on_past (was this present for a real use case or a workaround for the scheduler?).

- Bolke

> Op 28 apr. 2016, om 09:08 heeft Bolke de Bruin <bd...@gmail.com> het volgende geschreven:
> 
> No I don’t think that covers it (although I get slightly confused by your comments "depends_on_past_schedule
> “ and “depends_on_past_run”, but it seems too complex from the user’s perspective at first glance). The DagRun.previous handling is used to connect previous related DagRuns in order to make depends_on_past work with a moving interval or an automatically aligned start_date.
> 
> Imagine the following
> 
> Use Case 1
> Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron schedule of “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is true.
> 
> The current scheduler kicks off a run “2016-04-24 00:00:00” and then stops as the previous interval of “2016-04-24 01:10:00” is “2016-04-23 01:10:00”. This cannot be found as it has not taken place and thus the whole thing grinds to a halt and the TaskInstances refuse to run.
> 
> What the user expects here is that the first run is “2016-04-24 01:10:00”, ie start_date + interval, unless the start_date is on the interval, ie. start_date is first interval. This is what I address by start_date normalization in the PR. However, the second issue then kicks in as the “previous” run can still not be found.
> 
> Use Case 2
> Dag2 has an assigned start_date "2016-04-24 01:10:00"  with a cron schedule of “10 1 * * *”. Depends_on_past is true.
> 
> The scheduler happily goes through a few runs, but then the dag is updated and the schedule adjusted. Because the previous interval cannot be found by the TaskInstance (emphasis), tasks get stuck again requiring an manual override.
> 
> What the user expects here is that the scheduler is smart enough to figure out that we are still running the same dag and that it needs to look up the previous run for that dag and make sure dependencies are met with that previous dagrun in mind.
> 
> I don’t think those two use cases are edge cases, considering the amount of questions we get on these subjects.
> 
> To resolve the remaining issues (aside from start_date normalization) I first made DagRun aware of its predecessors. Then I strengthened the relationship between TaskInstances and DagRuns slightly, by optionally including a dagrun_id in the TaskInstance. Now a TaskInstance can lookup its predecessors in the previous DagRun and know for sure that it is either the first run or it has a predecessor somewhere in time instead of guessing. 
> 
> What I am unsure of is what to consider is the unit of work: 1) is a TaskInstance dependent on its past predecessor ignoring the outcome of the DagRun? (previous TaskInstance can be successful, but previous DagRun as a whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can it be both? In case of 1 and 3 my logic needs to be updated slightly, but that should not be too much of a big deal. However I have difficulty imagining why you want to do that.
> 
> - Bolke
> 
> 
>> Op 27 apr. 2016, om 18:17 heeft Maxime Beauchemin <ma...@gmail.com> het volgende geschreven:
>> 
>> Added a few comments on the PR:
>> 
>> ------------------------------------------
>> 
>> So if I understand right the previous DagRun.id handling is to account for
>> DAGs that have a mixed workload of scheduled and triggered DagRuns right?
>> 
>> Do we have real world use cases for those? I always though a DAG would
>> either be scheduled, or externally triggered, not a mix of both, but we
>> never made that a real restriction, so I'm sure some people mix both.
>> 
>> If that is the case that we want to support both, the notion of
>> depends_on_past become mushy, where maybe people may expect or desire
>> either a depends_on_past_schedule or depends_on_past_run.
>> 
>> Since it's pretty edge-case (combo of both mixed scheduled AND
>> depend_on_past=True) and unclear what the desired behavior might be, I'd
>> vote to do something simple in terms of logic and document it clearly. It
>> could be something like "if the DAG is scheduled, then we use the latest
>> DagRun that is >= previous schedule, if not scheduled then we take the
>> previous DagRun, whatever it is at that time.", this would not require
>> handling the previous_dag_run_id
>> 
>> On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <bd...@gmail.com> wrote:
>> 
>>> I have also given it some thought and some coding. I hope we agree that
>>> from a user perspective
>>> there currently is an issue. The need to align the start_date with the
>>> interval is counter intuitive
>>> and leads to a lot of questions and issue creation, although it is in the
>>> documentation. If we are
>>> able to fix this with none or little consequences for current setups that
>>> should be preferred, I think.
>>> The dependency explainer is really great work, but it doesn’t address the
>>> core issue.
>>> 
>>> I do recognize the issue mentioned by Max and I agree with Jeremiah that
>>> it is a separate issue,
>>> however if we are to tackle the above it needs to be done in one go.
>>> 
>>> If you consider a DAG a description of cohesion between work items (in OOP
>>> java terms
>>> a class), then a DagRun is the instantiation of a DAG in time (in OOP java
>>> terms an instance).
>>> Tasks are then the description of a work item and a TaskInstance the
>>> instantiation of the Task in time.
>>> 
>>> In my opinion issues pop up due to the current paradigm of considering the
>>> TaskInstance
>>> the smallest unit of work and asking it to maintain its own state in
>>> relation to other TaskInstances
>>> in a DagRun and in a previous DagRun of which it has no (real) perception.
>>> Tasks are instantiated
>>> by a cartesian product with the dates of DagRun instead of the DagRuns
>>> itself.
>>> 
>>> The very loose coupling between DagRuns and TaskInstances can be improved
>>> while maintaining
>>> flexibility to run tasks without a DagRun. This would help with a couple
>>> of things:
>>> 
>>> 1. start_date can be used as a ‘execution_date’ or a point in time when to
>>> start looking
>>> 2. a new interval for a dag will maintain depends_on_past
>>> 3. paused dags do not give trouble
>>> 4. tasks will be executed in order
>>> 5. the ignore_first_depend_on_past could be removed as a task will now
>>> know if it is really the first
>>> 
>>> In PR-1431 a lot of this work has been done by:
>>> 
>>> 1. Adding a “previous” field to a DagRun allowing it to connect to its
>>> predecessor
>>> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the
>>> DagRun if needed
>>> 3. Using start_date + interval as the first run date unless start_date is
>>> on the interval then start_date is the first run date
>>> 
>>> The downside of the PR would be that TaskInstances without a DagRun will
>>> not honor
>>> depend_on_past, so the backfill would need to be adjusted to create a
>>> DagRun
>>> and properly connect it to a previous DagRun if that exists. Not too much
>>> of a big
>>> deal, I think - will work on that a bit as well in the PR.
>>> 
>>> So this should give at least an idea where things could head to. Any
>>> thoughts?
>>> 
>>> Cheers
>>> Bolke
>>> 
>>>> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <jl...@apache.org> het
>>> volgende geschreven:
>>>> 
>>>> On further thought, I understand the issue you're describing where this
>>>> could lead to out-of-order runs. In fact, Bolke alerted me to the
>>>> possibility earlier but I didn't make the connection! That feels like a
>>>> separate issue -- to guarantee that tasks are executed in order (and more
>>>> importantly that their database entries are created). I think the
>>>> depends_on_past issue is related but separate -- though clearly needs to
>>> be
>>>> fleshed out for all cases :)
>>>> 
>>>> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org>
>>> wrote:
>>>> 
>>>>> I'm afraid I disagree -- though we may be talking about two different
>>>>> issues. This issue deals specifically with how to identify the "past" TI
>>>>> when evaluating "depends_on_past", and shouldn't be impacted by shifting
>>>>> start_date, transparently or not.
>>>>> 
>>>>> Here are three valid examples of depends_on_past DAGs that would fail to
>>>>> run with the current setup:
>>>>> 
>>>>> 1. A DAG with no schedule that is only run manually or via ad-hoc
>>>>> backfill. Without a schedule_interval, depends_on_past will always fail
>>>>> (since it looks back one schedule_interval).
>>>>> 
>>>>> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
>>>>> say a scheduled run succeeds and then an off-schedule run fails. When
>>> the
>>>>> next scheduled run starts, it shouldn't proceed because the most recent
>>>>> task failed -- but it will look back one schedule_interval, jumping OVER
>>>>> the most recent run, and decide it's ok to proceed.
>>>>> 
>>>>> 3. A DAG with a schedule that is paused for a while. This DAG could be
>>>>> running perfectly fine, but if it is paused for a while and then
>>> resumed,
>>>>> its depends_on_past tasks will look back one schedule_interval and see
>>>>> nothing, and therefore refuse to run.
>>>>> 
>>>>> So my proposal is simply that the depends_on_past logic looks back at
>>> the
>>>>> most recent task as opposed to rigidly assuming there is a task one
>>>>> schedule_interval ago. For a regularly scheduled DAG, this will result
>>> in
>>>>> absolutely no behavior change. However it will robustly support a much
>>>>> wider variety of cases like the ones I listed above.
>>>>> 
>>>>> J
>>>>> 
>>>>> 
>>>>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
>>>>> maximebeauchemin@gmail.com> wrote:
>>>>> 
>>>>>>>>> "The clear fix seems to be to have depends_on_past check the last TI
>>>>>> that
>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
>>> line
>>>>>> with the intent of the flag. I will submit a fix."
>>>>>> 
>>>>>> I don't think so. This would lead to skipping runs, which
>>> depends_on_past
>>>>>> is used as a guarantee to run every TI, sequentially.
>>>>>> 
>>>>>> Absolute scheduling (cron expressions) is much better than relative
>>>>>> scheduling (origin + interval). Though it's easy to make relative
>>>>>> scheduling behave in an absolute way. You just have to use a rounded
>>>>>> start_date to your schedule_interval, and not move things around.
>>> Dynamic
>>>>>> start_dates have always been a problem and should probably not be
>>>>>> supported, though there's no way for us to tell.
>>>>>> 
>>>>>> Changing the schedule interval or the "origin time" is a bit tricky and
>>>>>> should be documented.
>>>>>> 
>>>>>> If people use depend_on_past=True and change the origin or the
>>> interval,
>>>>>> they basically redefine what "past" actually means and will require to
>>>>>> "mark success" or defining a new "start_date" as a way to say "please
>>>>>> disregard depend_on_past for this date"
>>>>>> 
>>>>>> For those who haven't fully digested "What's the deal with
>>> start_dates",
>>>>>> please take the time to read it:
>>>>>> http://pythonhosted.org/airflow/faq.html
>>>>>> 
>>>>>> Also see this part of the docs:
>>>>>> 
>>>>>> ​
>>>>>> 
>>>>>> Max
>>>>>> 
>>>>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>>>>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>>>>> 
>>>>>>> There are two large points that need to be addressed:
>>>>>>> 
>>>>>>> 1. this particular issue arises because of an alignment issue between
>>>>>>> start_date and schedule_interval. This can only happen with cron-based
>>>>>>> schedule_intervals that describe absolute points in time (like “1am”)
>>> as
>>>>>>> opposed to time deltas (like “every hour”). Ironically, I once
>>> reported
>>>>>>> this same issue myself (#959). In the past (and in the docs) we have
>>>>>>> simply
>>>>>>> said that users must make sure the two params agree. We discussed the
>>>>>>> possibility of a DAG validation method to raise an error if the
>>>>>>> start_date
>>>>>>> and schedule_interval don’t align, but Bolke made the point (and I
>>>>>>> agreed)
>>>>>>> that in these cases, start_date is sort of like telling the scheduler
>>> to
>>>>>>> “start paying attention” as opposed to “this is my first execution
>>> date”.
>>>>>>> In #1427, the scheduler was being asked to start paying attention on
>>>>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
>>> However,
>>>>>>> it
>>>>>>> was scheduling a first run at midnight and a second run at 1:10.
>>>>>>> 
>>>>>>> Regardless of whether we choose to validate/warn/error, Bolke is
>>> going to
>>>>>>> change the scheduling logic so that the cron-based interval takes
>>>>>>> precedence over a start date. Specifically, the first date on or after
>>>>>>> the
>>>>>>> start_date that complies with the schedule_interval becomes the first
>>>>>>> execution date.
>>>>>>> 
>>>>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>>>>>> successful
>>>>>>> TI at `execution_date - schedule_interval`. This is fragile, since it
>>> is
>>>>>>> very possible for the previous TI to have run at any time in the past,
>>>>>>> not
>>>>>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>>>>>>> runs, and also if a DAG was paused for a while. Less commonly, it
>>> happens
>>>>>>> with the situation described in point #1, where the first scheduled
>>> run
>>>>>>> is
>>>>>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>>>>> 
>>>>>>> The clear fix seems to be to have depends_on_past check the last TI
>>> that
>>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
>>> line
>>>>>>> with the intent of the flag. I will submit a fix.
>>>>>>> 
>>>>>>> -J
>>>>>>> 
>>>>>> 
>>>>>> 
>>> 
>>> 
> 


Re: GitHub Issue #1427 takeaways

Posted by Bolke de Bruin <bd...@gmail.com>.
No I don’t think that covers it (although I get slightly confused by your comments "depends_on_past_schedule
 “ and “depends_on_past_run”, but it seems too complex from the user’s perspective at first glance). The DagRun.previous handling is used to connect previous related DagRuns in order to make depends_on_past work with a moving interval or an automatically aligned start_date.

Imagine the following

Use Case 1
Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron schedule of “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is true.

The current scheduler kicks off a run “2016-04-24 00:00:00” and then stops as the previous interval of “2016-04-24 01:10:00” is “2016-04-23 01:10:00”. This cannot be found as it has not taken place and thus the whole thing grinds to a halt and the TaskInstances refuse to run.

What the user expects here is that the first run is “2016-04-24 01:10:00”, ie start_date + interval, unless the start_date is on the interval, ie. start_date is first interval. This is what I address by start_date normalization in the PR. However, the second issue then kicks in as the “previous” run can still not be found.

Use Case 2
Dag2 has an assigned start_date "2016-04-24 01:10:00"  with a cron schedule of “10 1 * * *”. Depends_on_past is true.

The scheduler happily goes through a few runs, but then the dag is updated and the schedule adjusted. Because the previous interval cannot be found by the TaskInstance (emphasis), tasks get stuck again requiring an manual override.

What the user expects here is that the scheduler is smart enough to figure out that we are still running the same dag and that it needs to look up the previous run for that dag and make sure dependencies are met with that previous dagrun in mind.

I don’t think those two use cases are edge cases, considering the amount of questions we get on these subjects.

To resolve the remaining issues (aside from start_date normalization) I first made DagRun aware of its predecessors. Then I strengthened the relationship between TaskInstances and DagRuns slightly, by optionally including a dagrun_id in the TaskInstance. Now a TaskInstance can lookup its predecessors in the previous DagRun and know for sure that it is either the first run or it has a predecessor somewhere in time instead of guessing. 

What I am unsure of is what to consider is the unit of work: 1) is a TaskInstance dependent on its past predecessor ignoring the outcome of the DagRun? (previous TaskInstance can be successful, but previous DagRun as a whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can it be both? In case of 1 and 3 my logic needs to be updated slightly, but that should not be too much of a big deal. However I have difficulty imagining why you want to do that.

- Bolke


> Op 27 apr. 2016, om 18:17 heeft Maxime Beauchemin <ma...@gmail.com> het volgende geschreven:
> 
> Added a few comments on the PR:
> 
> ------------------------------------------
> 
> So if I understand right the previous DagRun.id handling is to account for
> DAGs that have a mixed workload of scheduled and triggered DagRuns right?
> 
> Do we have real world use cases for those? I always though a DAG would
> either be scheduled, or externally triggered, not a mix of both, but we
> never made that a real restriction, so I'm sure some people mix both.
> 
> If that is the case that we want to support both, the notion of
> depends_on_past become mushy, where maybe people may expect or desire
> either a depends_on_past_schedule or depends_on_past_run.
> 
> Since it's pretty edge-case (combo of both mixed scheduled AND
> depend_on_past=True) and unclear what the desired behavior might be, I'd
> vote to do something simple in terms of logic and document it clearly. It
> could be something like "if the DAG is scheduled, then we use the latest
> DagRun that is >= previous schedule, if not scheduled then we take the
> previous DagRun, whatever it is at that time.", this would not require
> handling the previous_dag_run_id
> 
> On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <bd...@gmail.com> wrote:
> 
>> I have also given it some thought and some coding. I hope we agree that
>> from a user perspective
>> there currently is an issue. The need to align the start_date with the
>> interval is counter intuitive
>> and leads to a lot of questions and issue creation, although it is in the
>> documentation. If we are
>> able to fix this with none or little consequences for current setups that
>> should be preferred, I think.
>> The dependency explainer is really great work, but it doesn’t address the
>> core issue.
>> 
>> I do recognize the issue mentioned by Max and I agree with Jeremiah that
>> it is a separate issue,
>> however if we are to tackle the above it needs to be done in one go.
>> 
>> If you consider a DAG a description of cohesion between work items (in OOP
>> java terms
>> a class), then a DagRun is the instantiation of a DAG in time (in OOP java
>> terms an instance).
>> Tasks are then the description of a work item and a TaskInstance the
>> instantiation of the Task in time.
>> 
>> In my opinion issues pop up due to the current paradigm of considering the
>> TaskInstance
>> the smallest unit of work and asking it to maintain its own state in
>> relation to other TaskInstances
>> in a DagRun and in a previous DagRun of which it has no (real) perception.
>> Tasks are instantiated
>> by a cartesian product with the dates of DagRun instead of the DagRuns
>> itself.
>> 
>> The very loose coupling between DagRuns and TaskInstances can be improved
>> while maintaining
>> flexibility to run tasks without a DagRun. This would help with a couple
>> of things:
>> 
>> 1. start_date can be used as a ‘execution_date’ or a point in time when to
>> start looking
>> 2. a new interval for a dag will maintain depends_on_past
>> 3. paused dags do not give trouble
>> 4. tasks will be executed in order
>> 5. the ignore_first_depend_on_past could be removed as a task will now
>> know if it is really the first
>> 
>> In PR-1431 a lot of this work has been done by:
>> 
>> 1. Adding a “previous” field to a DagRun allowing it to connect to its
>> predecessor
>> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the
>> DagRun if needed
>> 3. Using start_date + interval as the first run date unless start_date is
>> on the interval then start_date is the first run date
>> 
>> The downside of the PR would be that TaskInstances without a DagRun will
>> not honor
>> depend_on_past, so the backfill would need to be adjusted to create a
>> DagRun
>> and properly connect it to a previous DagRun if that exists. Not too much
>> of a big
>> deal, I think - will work on that a bit as well in the PR.
>> 
>> So this should give at least an idea where things could head to. Any
>> thoughts?
>> 
>> Cheers
>> Bolke
>> 
>>> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <jl...@apache.org> het
>> volgende geschreven:
>>> 
>>> On further thought, I understand the issue you're describing where this
>>> could lead to out-of-order runs. In fact, Bolke alerted me to the
>>> possibility earlier but I didn't make the connection! That feels like a
>>> separate issue -- to guarantee that tasks are executed in order (and more
>>> importantly that their database entries are created). I think the
>>> depends_on_past issue is related but separate -- though clearly needs to
>> be
>>> fleshed out for all cases :)
>>> 
>>> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org>
>> wrote:
>>> 
>>>> I'm afraid I disagree -- though we may be talking about two different
>>>> issues. This issue deals specifically with how to identify the "past" TI
>>>> when evaluating "depends_on_past", and shouldn't be impacted by shifting
>>>> start_date, transparently or not.
>>>> 
>>>> Here are three valid examples of depends_on_past DAGs that would fail to
>>>> run with the current setup:
>>>> 
>>>> 1. A DAG with no schedule that is only run manually or via ad-hoc
>>>> backfill. Without a schedule_interval, depends_on_past will always fail
>>>> (since it looks back one schedule_interval).
>>>> 
>>>> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
>>>> say a scheduled run succeeds and then an off-schedule run fails. When
>> the
>>>> next scheduled run starts, it shouldn't proceed because the most recent
>>>> task failed -- but it will look back one schedule_interval, jumping OVER
>>>> the most recent run, and decide it's ok to proceed.
>>>> 
>>>> 3. A DAG with a schedule that is paused for a while. This DAG could be
>>>> running perfectly fine, but if it is paused for a while and then
>> resumed,
>>>> its depends_on_past tasks will look back one schedule_interval and see
>>>> nothing, and therefore refuse to run.
>>>> 
>>>> So my proposal is simply that the depends_on_past logic looks back at
>> the
>>>> most recent task as opposed to rigidly assuming there is a task one
>>>> schedule_interval ago. For a regularly scheduled DAG, this will result
>> in
>>>> absolutely no behavior change. However it will robustly support a much
>>>> wider variety of cases like the ones I listed above.
>>>> 
>>>> J
>>>> 
>>>> 
>>>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
>>>> maximebeauchemin@gmail.com> wrote:
>>>> 
>>>>>>>> "The clear fix seems to be to have depends_on_past check the last TI
>>>>> that
>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
>> line
>>>>> with the intent of the flag. I will submit a fix."
>>>>> 
>>>>> I don't think so. This would lead to skipping runs, which
>> depends_on_past
>>>>> is used as a guarantee to run every TI, sequentially.
>>>>> 
>>>>> Absolute scheduling (cron expressions) is much better than relative
>>>>> scheduling (origin + interval). Though it's easy to make relative
>>>>> scheduling behave in an absolute way. You just have to use a rounded
>>>>> start_date to your schedule_interval, and not move things around.
>> Dynamic
>>>>> start_dates have always been a problem and should probably not be
>>>>> supported, though there's no way for us to tell.
>>>>> 
>>>>> Changing the schedule interval or the "origin time" is a bit tricky and
>>>>> should be documented.
>>>>> 
>>>>> If people use depend_on_past=True and change the origin or the
>> interval,
>>>>> they basically redefine what "past" actually means and will require to
>>>>> "mark success" or defining a new "start_date" as a way to say "please
>>>>> disregard depend_on_past for this date"
>>>>> 
>>>>> For those who haven't fully digested "What's the deal with
>> start_dates",
>>>>> please take the time to read it:
>>>>> http://pythonhosted.org/airflow/faq.html
>>>>> 
>>>>> Also see this part of the docs:
>>>>> 
>>>>> ​
>>>>> 
>>>>> Max
>>>>> 
>>>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>>>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>>>> 
>>>>>> There are two large points that need to be addressed:
>>>>>> 
>>>>>> 1. this particular issue arises because of an alignment issue between
>>>>>> start_date and schedule_interval. This can only happen with cron-based
>>>>>> schedule_intervals that describe absolute points in time (like “1am”)
>> as
>>>>>> opposed to time deltas (like “every hour”). Ironically, I once
>> reported
>>>>>> this same issue myself (#959). In the past (and in the docs) we have
>>>>>> simply
>>>>>> said that users must make sure the two params agree. We discussed the
>>>>>> possibility of a DAG validation method to raise an error if the
>>>>>> start_date
>>>>>> and schedule_interval don’t align, but Bolke made the point (and I
>>>>>> agreed)
>>>>>> that in these cases, start_date is sort of like telling the scheduler
>> to
>>>>>> “start paying attention” as opposed to “this is my first execution
>> date”.
>>>>>> In #1427, the scheduler was being asked to start paying attention on
>>>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
>> However,
>>>>>> it
>>>>>> was scheduling a first run at midnight and a second run at 1:10.
>>>>>> 
>>>>>> Regardless of whether we choose to validate/warn/error, Bolke is
>> going to
>>>>>> change the scheduling logic so that the cron-based interval takes
>>>>>> precedence over a start date. Specifically, the first date on or after
>>>>>> the
>>>>>> start_date that complies with the schedule_interval becomes the first
>>>>>> execution date.
>>>>>> 
>>>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>>>>> successful
>>>>>> TI at `execution_date - schedule_interval`. This is fragile, since it
>> is
>>>>>> very possible for the previous TI to have run at any time in the past,
>>>>>> not
>>>>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>>>>>> runs, and also if a DAG was paused for a while. Less commonly, it
>> happens
>>>>>> with the situation described in point #1, where the first scheduled
>> run
>>>>>> is
>>>>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>>>> 
>>>>>> The clear fix seems to be to have depends_on_past check the last TI
>> that
>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
>> line
>>>>>> with the intent of the flag. I will submit a fix.
>>>>>> 
>>>>>> -J
>>>>>> 
>>>>> 
>>>>> 
>> 
>> 


Re: GitHub Issue #1427 takeaways

Posted by Maxime Beauchemin <ma...@gmail.com>.
Added a few comments on the PR:

------------------------------------------

So if I understand right the previous DagRun.id handling is to account for
DAGs that have a mixed workload of scheduled and triggered DagRuns right?

Do we have real world use cases for those? I always though a DAG would
either be scheduled, or externally triggered, not a mix of both, but we
never made that a real restriction, so I'm sure some people mix both.

If that is the case that we want to support both, the notion of
depends_on_past become mushy, where maybe people may expect or desire
either a depends_on_past_schedule or depends_on_past_run.

Since it's pretty edge-case (combo of both mixed scheduled AND
depend_on_past=True) and unclear what the desired behavior might be, I'd
vote to do something simple in terms of logic and document it clearly. It
could be something like "if the DAG is scheduled, then we use the latest
DagRun that is >= previous schedule, if not scheduled then we take the
previous DagRun, whatever it is at that time.", this would not require
handling the previous_dag_run_id

On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <bd...@gmail.com> wrote:

> I have also given it some thought and some coding. I hope we agree that
> from a user perspective
> there currently is an issue. The need to align the start_date with the
> interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the
> documentation. If we are
> able to fix this with none or little consequences for current setups that
> should be preferred, I think.
> The dependency explainer is really great work, but it doesn’t address the
> core issue.
>
> I do recognize the issue mentioned by Max and I agree with Jeremiah that
> it is a separate issue,
> however if we are to tackle the above it needs to be done in one go.
>
> If you consider a DAG a description of cohesion between work items (in OOP
> java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java
> terms an instance).
> Tasks are then the description of a work item and a TaskInstance the
> instantiation of the Task in time.
>
> In my opinion issues pop up due to the current paradigm of considering the
> TaskInstance
> the smallest unit of work and asking it to maintain its own state in
> relation to other TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception.
> Tasks are instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns
> itself.
>
> The very loose coupling between DagRuns and TaskInstances can be improved
> while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple
> of things:
>
> 1. start_date can be used as a ‘execution_date’ or a point in time when to
> start looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order
> 5. the ignore_first_depend_on_past could be removed as a task will now
> know if it is really the first
>
> In PR-1431 a lot of this work has been done by:
>
> 1. Adding a “previous” field to a DagRun allowing it to connect to its
> predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the
> DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is
> on the interval then start_date is the first run date
>
> The downside of the PR would be that TaskInstances without a DagRun will
> not honor
> depend_on_past, so the backfill would need to be adjusted to create a
> DagRun
> and properly connect it to a previous DagRun if that exists. Not too much
> of a big
> deal, I think - will work on that a bit as well in the PR.
>
> So this should give at least an idea where things could head to. Any
> thoughts?
>
> Cheers
> Bolke
>
> > Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <jl...@apache.org> het
> volgende geschreven:
> >
> > On further thought, I understand the issue you're describing where this
> > could lead to out-of-order runs. In fact, Bolke alerted me to the
> > possibility earlier but I didn't make the connection! That feels like a
> > separate issue -- to guarantee that tasks are executed in order (and more
> > importantly that their database entries are created). I think the
> > depends_on_past issue is related but separate -- though clearly needs to
> be
> > fleshed out for all cases :)
> >
> > On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org>
> wrote:
> >
> >> I'm afraid I disagree -- though we may be talking about two different
> >> issues. This issue deals specifically with how to identify the "past" TI
> >> when evaluating "depends_on_past", and shouldn't be impacted by shifting
> >> start_date, transparently or not.
> >>
> >> Here are three valid examples of depends_on_past DAGs that would fail to
> >> run with the current setup:
> >>
> >> 1. A DAG with no schedule that is only run manually or via ad-hoc
> >> backfill. Without a schedule_interval, depends_on_past will always fail
> >> (since it looks back one schedule_interval).
> >>
> >> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
> >> say a scheduled run succeeds and then an off-schedule run fails. When
> the
> >> next scheduled run starts, it shouldn't proceed because the most recent
> >> task failed -- but it will look back one schedule_interval, jumping OVER
> >> the most recent run, and decide it's ok to proceed.
> >>
> >> 3. A DAG with a schedule that is paused for a while. This DAG could be
> >> running perfectly fine, but if it is paused for a while and then
> resumed,
> >> its depends_on_past tasks will look back one schedule_interval and see
> >> nothing, and therefore refuse to run.
> >>
> >> So my proposal is simply that the depends_on_past logic looks back at
> the
> >> most recent task as opposed to rigidly assuming there is a task one
> >> schedule_interval ago. For a regularly scheduled DAG, this will result
> in
> >> absolutely no behavior change. However it will robustly support a much
> >> wider variety of cases like the ones I listed above.
> >>
> >> J
> >>
> >>
> >> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
> >> maximebeauchemin@gmail.com> wrote:
> >>
> >>>>>> "The clear fix seems to be to have depends_on_past check the last TI
> >>> that
> >>> ran, regardless of whether it ran `schedule_interval` ago. That's in
> line
> >>> with the intent of the flag. I will submit a fix."
> >>>
> >>> I don't think so. This would lead to skipping runs, which
> depends_on_past
> >>> is used as a guarantee to run every TI, sequentially.
> >>>
> >>> Absolute scheduling (cron expressions) is much better than relative
> >>> scheduling (origin + interval). Though it's easy to make relative
> >>> scheduling behave in an absolute way. You just have to use a rounded
> >>> start_date to your schedule_interval, and not move things around.
> Dynamic
> >>> start_dates have always been a problem and should probably not be
> >>> supported, though there's no way for us to tell.
> >>>
> >>> Changing the schedule interval or the "origin time" is a bit tricky and
> >>> should be documented.
> >>>
> >>> If people use depend_on_past=True and change the origin or the
> interval,
> >>> they basically redefine what "past" actually means and will require to
> >>> "mark success" or defining a new "start_date" as a way to say "please
> >>> disregard depend_on_past for this date"
> >>>
> >>> For those who haven't fully digested "What's the deal with
> start_dates",
> >>> please take the time to read it:
> >>> http://pythonhosted.org/airflow/faq.html
> >>>
> >>> Also see this part of the docs:
> >>>
> >>> ​
> >>>
> >>> Max
> >>>
> >>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com>
> wrote:
> >>>
> >>>> Bolke, Sid and I had a brief conversation to discuss some of the
> >>>> implications of https://github.com/airbnb/airflow/issues/1427
> >>>>
> >>>> There are two large points that need to be addressed:
> >>>>
> >>>> 1. this particular issue arises because of an alignment issue between
> >>>> start_date and schedule_interval. This can only happen with cron-based
> >>>> schedule_intervals that describe absolute points in time (like “1am”)
> as
> >>>> opposed to time deltas (like “every hour”). Ironically, I once
> reported
> >>>> this same issue myself (#959). In the past (and in the docs) we have
> >>>> simply
> >>>> said that users must make sure the two params agree. We discussed the
> >>>> possibility of a DAG validation method to raise an error if the
> >>>> start_date
> >>>> and schedule_interval don’t align, but Bolke made the point (and I
> >>>> agreed)
> >>>> that in these cases, start_date is sort of like telling the scheduler
> to
> >>>> “start paying attention” as opposed to “this is my first execution
> date”.
> >>>> In #1427, the scheduler was being asked to start paying attention on
> >>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
> However,
> >>>> it
> >>>> was scheduling a first run at midnight and a second run at 1:10.
> >>>>
> >>>> Regardless of whether we choose to validate/warn/error, Bolke is
> going to
> >>>> change the scheduling logic so that the cron-based interval takes
> >>>> precedence over a start date. Specifically, the first date on or after
> >>>> the
> >>>> start_date that complies with the schedule_interval becomes the first
> >>>> execution date.
> >>>>
> >>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
> >>>> successful
> >>>> TI at `execution_date - schedule_interval`. This is fragile, since it
> is
> >>>> very possible for the previous TI to have run at any time in the past,
> >>>> not
> >>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
> >>>> runs, and also if a DAG was paused for a while. Less commonly, it
> happens
> >>>> with the situation described in point #1, where the first scheduled
> run
> >>>> is
> >>>> off-schedule (the midnight run followed by the daily 1:10am runs).
> >>>>
> >>>> The clear fix seems to be to have depends_on_past check the last TI
> that
> >>>> ran, regardless of whether it ran `schedule_interval` ago. That's in
> line
> >>>> with the intent of the flag. I will submit a fix.
> >>>>
> >>>> -J
> >>>>
> >>>
> >>>
>
>

Re: GitHub Issue #1427 takeaways

Posted by Bolke de Bruin <bd...@gmail.com>.
I have also given it some thought and some coding. I hope we agree that from a user perspective
there currently is an issue. The need to align the start_date with the interval is counter intuitive
and leads to a lot of questions and issue creation, although it is in the documentation. If we are
able to fix this with none or little consequences for current setups that should be preferred, I think.
The dependency explainer is really great work, but it doesn’t address the core issue.

I do recognize the issue mentioned by Max and I agree with Jeremiah that it is a separate issue,
however if we are to tackle the above it needs to be done in one go.

If you consider a DAG a description of cohesion between work items (in OOP java terms
a class), then a DagRun is the instantiation of a DAG in time (in OOP java terms an instance). 
Tasks are then the description of a work item and a TaskInstance the instantiation of the Task in time.

In my opinion issues pop up due to the current paradigm of considering the TaskInstance
the smallest unit of work and asking it to maintain its own state in relation to other TaskInstances
in a DagRun and in a previous DagRun of which it has no (real) perception. Tasks are instantiated
by a cartesian product with the dates of DagRun instead of the DagRuns itself. 

The very loose coupling between DagRuns and TaskInstances can be improved while maintaining
flexibility to run tasks without a DagRun. This would help with a couple of things:

1. start_date can be used as a ‘execution_date’ or a point in time when to start looking
2. a new interval for a dag will maintain depends_on_past
3. paused dags do not give trouble
4. tasks will be executed in order 
5. the ignore_first_depend_on_past could be removed as a task will now know if it is really the first

In PR-1431 a lot of this work has been done by:

1. Adding a “previous” field to a DagRun allowing it to connect to its predecessor
2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the DagRun if needed
3. Using start_date + interval as the first run date unless start_date is on the interval then start_date is the first run date

The downside of the PR would be that TaskInstances without a DagRun will not honor
depend_on_past, so the backfill would need to be adjusted to create a DagRun
and properly connect it to a previous DagRun if that exists. Not too much of a big 
deal, I think - will work on that a bit as well in the PR.

So this should give at least an idea where things could head to. Any thoughts?

Cheers
Bolke

> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <jl...@apache.org> het volgende geschreven:
> 
> On further thought, I understand the issue you're describing where this
> could lead to out-of-order runs. In fact, Bolke alerted me to the
> possibility earlier but I didn't make the connection! That feels like a
> separate issue -- to guarantee that tasks are executed in order (and more
> importantly that their database entries are created). I think the
> depends_on_past issue is related but separate -- though clearly needs to be
> fleshed out for all cases :)
> 
> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org> wrote:
> 
>> I'm afraid I disagree -- though we may be talking about two different
>> issues. This issue deals specifically with how to identify the "past" TI
>> when evaluating "depends_on_past", and shouldn't be impacted by shifting
>> start_date, transparently or not.
>> 
>> Here are three valid examples of depends_on_past DAGs that would fail to
>> run with the current setup:
>> 
>> 1. A DAG with no schedule that is only run manually or via ad-hoc
>> backfill. Without a schedule_interval, depends_on_past will always fail
>> (since it looks back one schedule_interval).
>> 
>> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
>> say a scheduled run succeeds and then an off-schedule run fails. When the
>> next scheduled run starts, it shouldn't proceed because the most recent
>> task failed -- but it will look back one schedule_interval, jumping OVER
>> the most recent run, and decide it's ok to proceed.
>> 
>> 3. A DAG with a schedule that is paused for a while. This DAG could be
>> running perfectly fine, but if it is paused for a while and then resumed,
>> its depends_on_past tasks will look back one schedule_interval and see
>> nothing, and therefore refuse to run.
>> 
>> So my proposal is simply that the depends_on_past logic looks back at the
>> most recent task as opposed to rigidly assuming there is a task one
>> schedule_interval ago. For a regularly scheduled DAG, this will result in
>> absolutely no behavior change. However it will robustly support a much
>> wider variety of cases like the ones I listed above.
>> 
>> J
>> 
>> 
>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
>> maximebeauchemin@gmail.com> wrote:
>> 
>>>>>> "The clear fix seems to be to have depends_on_past check the last TI
>>> that
>>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>>> with the intent of the flag. I will submit a fix."
>>> 
>>> I don't think so. This would lead to skipping runs, which depends_on_past
>>> is used as a guarantee to run every TI, sequentially.
>>> 
>>> Absolute scheduling (cron expressions) is much better than relative
>>> scheduling (origin + interval). Though it's easy to make relative
>>> scheduling behave in an absolute way. You just have to use a rounded
>>> start_date to your schedule_interval, and not move things around. Dynamic
>>> start_dates have always been a problem and should probably not be
>>> supported, though there's no way for us to tell.
>>> 
>>> Changing the schedule interval or the "origin time" is a bit tricky and
>>> should be documented.
>>> 
>>> If people use depend_on_past=True and change the origin or the interval,
>>> they basically redefine what "past" actually means and will require to
>>> "mark success" or defining a new "start_date" as a way to say "please
>>> disregard depend_on_past for this date"
>>> 
>>> For those who haven't fully digested "What's the deal with start_dates",
>>> please take the time to read it:
>>> http://pythonhosted.org/airflow/faq.html
>>> 
>>> Also see this part of the docs:
>>> 
>>> ​
>>> 
>>> Max
>>> 
>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com> wrote:
>>> 
>>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>> 
>>>> There are two large points that need to be addressed:
>>>> 
>>>> 1. this particular issue arises because of an alignment issue between
>>>> start_date and schedule_interval. This can only happen with cron-based
>>>> schedule_intervals that describe absolute points in time (like “1am”) as
>>>> opposed to time deltas (like “every hour”). Ironically, I once reported
>>>> this same issue myself (#959). In the past (and in the docs) we have
>>>> simply
>>>> said that users must make sure the two params agree. We discussed the
>>>> possibility of a DAG validation method to raise an error if the
>>>> start_date
>>>> and schedule_interval don’t align, but Bolke made the point (and I
>>>> agreed)
>>>> that in these cases, start_date is sort of like telling the scheduler to
>>>> “start paying attention” as opposed to “this is my first execution date”.
>>>> In #1427, the scheduler was being asked to start paying attention on
>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>>>> it
>>>> was scheduling a first run at midnight and a second run at 1:10.
>>>> 
>>>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>>>> change the scheduling logic so that the cron-based interval takes
>>>> precedence over a start date. Specifically, the first date on or after
>>>> the
>>>> start_date that complies with the schedule_interval becomes the first
>>>> execution date.
>>>> 
>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>>> successful
>>>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>>>> very possible for the previous TI to have run at any time in the past,
>>>> not
>>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>>>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>>>> with the situation described in point #1, where the first scheduled run
>>>> is
>>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>> 
>>>> The clear fix seems to be to have depends_on_past check the last TI that
>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>>>> with the intent of the flag. I will submit a fix.
>>>> 
>>>> -J
>>>> 
>>> 
>>> 


Re: GitHub Issue #1427 takeaways

Posted by Jeremiah Lowin <jl...@apache.org>.
On further thought, I understand the issue you're describing where this
could lead to out-of-order runs. In fact, Bolke alerted me to the
possibility earlier but I didn't make the connection! That feels like a
separate issue -- to guarantee that tasks are executed in order (and more
importantly that their database entries are created). I think the
depends_on_past issue is related but separate -- though clearly needs to be
fleshed out for all cases :)

On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <jl...@apache.org> wrote:

> I'm afraid I disagree -- though we may be talking about two different
> issues. This issue deals specifically with how to identify the "past" TI
> when evaluating "depends_on_past", and shouldn't be impacted by shifting
> start_date, transparently or not.
>
> Here are three valid examples of depends_on_past DAGs that would fail to
> run with the current setup:
>
> 1. A DAG with no schedule that is only run manually or via ad-hoc
> backfill. Without a schedule_interval, depends_on_past will always fail
> (since it looks back one schedule_interval).
>
> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
> say a scheduled run succeeds and then an off-schedule run fails. When the
> next scheduled run starts, it shouldn't proceed because the most recent
> task failed -- but it will look back one schedule_interval, jumping OVER
> the most recent run, and decide it's ok to proceed.
>
> 3. A DAG with a schedule that is paused for a while. This DAG could be
> running perfectly fine, but if it is paused for a while and then resumed,
> its depends_on_past tasks will look back one schedule_interval and see
> nothing, and therefore refuse to run.
>
> So my proposal is simply that the depends_on_past logic looks back at the
> most recent task as opposed to rigidly assuming there is a task one
> schedule_interval ago. For a regularly scheduled DAG, this will result in
> absolutely no behavior change. However it will robustly support a much
> wider variety of cases like the ones I listed above.
>
> J
>
>
> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> >>> "The clear fix seems to be to have depends_on_past check the last TI
>> that
>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>> with the intent of the flag. I will submit a fix."
>>
>> I don't think so. This would lead to skipping runs, which depends_on_past
>> is used as a guarantee to run every TI, sequentially.
>>
>> Absolute scheduling (cron expressions) is much better than relative
>> scheduling (origin + interval). Though it's easy to make relative
>> scheduling behave in an absolute way. You just have to use a rounded
>> start_date to your schedule_interval, and not move things around. Dynamic
>> start_dates have always been a problem and should probably not be
>> supported, though there's no way for us to tell.
>>
>> Changing the schedule interval or the "origin time" is a bit tricky and
>> should be documented.
>>
>> If people use depend_on_past=True and change the origin or the interval,
>> they basically redefine what "past" actually means and will require to
>> "mark success" or defining a new "start_date" as a way to say "please
>> disregard depend_on_past for this date"
>>
>> For those who haven't fully digested "What's the deal with start_dates",
>> please take the time to read it:
>> http://pythonhosted.org/airflow/faq.html
>>
>> Also see this part of the docs:
>>
>> ​
>>
>> Max
>>
>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com> wrote:
>>
>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>
>>> There are two large points that need to be addressed:
>>>
>>> 1. this particular issue arises because of an alignment issue between
>>> start_date and schedule_interval. This can only happen with cron-based
>>> schedule_intervals that describe absolute points in time (like “1am”) as
>>> opposed to time deltas (like “every hour”). Ironically, I once reported
>>> this same issue myself (#959). In the past (and in the docs) we have
>>> simply
>>> said that users must make sure the two params agree. We discussed the
>>> possibility of a DAG validation method to raise an error if the
>>> start_date
>>> and schedule_interval don’t align, but Bolke made the point (and I
>>> agreed)
>>> that in these cases, start_date is sort of like telling the scheduler to
>>> “start paying attention” as opposed to “this is my first execution date”.
>>> In #1427, the scheduler was being asked to start paying attention on
>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>>> it
>>> was scheduling a first run at midnight and a second run at 1:10.
>>>
>>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>>> change the scheduling logic so that the cron-based interval takes
>>> precedence over a start date. Specifically, the first date on or after
>>> the
>>> start_date that complies with the schedule_interval becomes the first
>>> execution date.
>>>
>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>> successful
>>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>>> very possible for the previous TI to have run at any time in the past,
>>> not
>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>>> with the situation described in point #1, where the first scheduled run
>>> is
>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>
>>> The clear fix seems to be to have depends_on_past check the last TI that
>>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>>> with the intent of the flag. I will submit a fix.
>>>
>>> -J
>>>
>>
>>

Re: GitHub Issue #1427 takeaways

Posted by Jeremiah Lowin <jl...@apache.org>.
I'm afraid I disagree -- though we may be talking about two different
issues. This issue deals specifically with how to identify the "past" TI
when evaluating "depends_on_past", and shouldn't be impacted by shifting
start_date, transparently or not.

Here are three valid examples of depends_on_past DAGs that would fail to
run with the current setup:

1. A DAG with no schedule that is only run manually or via ad-hoc backfill.
Without a schedule_interval, depends_on_past will always fail (since it
looks back one schedule_interval).

2. A DAG with a schedule, but that is sometimes run off-schedule. Let's say
a scheduled run succeeds and then an off-schedule run fails. When the next
scheduled run starts, it shouldn't proceed because the most recent task
failed -- but it will look back one schedule_interval, jumping OVER the
most recent run, and decide it's ok to proceed.

3. A DAG with a schedule that is paused for a while. This DAG could be
running perfectly fine, but if it is paused for a while and then resumed,
its depends_on_past tasks will look back one schedule_interval and see
nothing, and therefore refuse to run.

So my proposal is simply that the depends_on_past logic looks back at the
most recent task as opposed to rigidly assuming there is a task one
schedule_interval ago. For a regularly scheduled DAG, this will result in
absolutely no behavior change. However it will robustly support a much
wider variety of cases like the ones I listed above.

J

On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> >>> "The clear fix seems to be to have depends_on_past check the last TI
> that
> ran, regardless of whether it ran `schedule_interval` ago. That's in line
> with the intent of the flag. I will submit a fix."
>
> I don't think so. This would lead to skipping runs, which depends_on_past
> is used as a guarantee to run every TI, sequentially.
>
> Absolute scheduling (cron expressions) is much better than relative
> scheduling (origin + interval). Though it's easy to make relative
> scheduling behave in an absolute way. You just have to use a rounded
> start_date to your schedule_interval, and not move things around. Dynamic
> start_dates have always been a problem and should probably not be
> supported, though there's no way for us to tell.
>
> Changing the schedule interval or the "origin time" is a bit tricky and
> should be documented.
>
> If people use depend_on_past=True and change the origin or the interval,
> they basically redefine what "past" actually means and will require to
> "mark success" or defining a new "start_date" as a way to say "please
> disregard depend_on_past for this date"
>
> For those who haven't fully digested "What's the deal with start_dates",
> please take the time to read it:
> http://pythonhosted.org/airflow/faq.html
>
> Also see this part of the docs:
>
> ​
>
> Max
>
> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com> wrote:
>
>> Bolke, Sid and I had a brief conversation to discuss some of the
>> implications of https://github.com/airbnb/airflow/issues/1427
>>
>> There are two large points that need to be addressed:
>>
>> 1. this particular issue arises because of an alignment issue between
>> start_date and schedule_interval. This can only happen with cron-based
>> schedule_intervals that describe absolute points in time (like “1am”) as
>> opposed to time deltas (like “every hour”). Ironically, I once reported
>> this same issue myself (#959). In the past (and in the docs) we have
>> simply
>> said that users must make sure the two params agree. We discussed the
>> possibility of a DAG validation method to raise an error if the start_date
>> and schedule_interval don’t align, but Bolke made the point (and I agreed)
>> that in these cases, start_date is sort of like telling the scheduler to
>> “start paying attention” as opposed to “this is my first execution date”.
>> In #1427, the scheduler was being asked to start paying attention on
>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>> it
>> was scheduling a first run at midnight and a second run at 1:10.
>>
>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>> change the scheduling logic so that the cron-based interval takes
>> precedence over a start date. Specifically, the first date on or after the
>> start_date that complies with the schedule_interval becomes the first
>> execution date.
>>
>> 2. Issue #1 led to a second issue: depends_on_past checks for a successful
>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>> very possible for the previous TI to have run at any time in the past, not
>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>> with the situation described in point #1, where the first scheduled run is
>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>
>> The clear fix seems to be to have depends_on_past check the last TI that
>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>> with the intent of the flag. I will submit a fix.
>>
>> -J
>>
>
>

Re: GitHub Issue #1427 takeaways

Posted by Maxime Beauchemin <ma...@gmail.com>.
>>> "The clear fix seems to be to have depends_on_past check the last TI
that
ran, regardless of whether it ran `schedule_interval` ago. That's in line
with the intent of the flag. I will submit a fix."

I don't think so. This would lead to skipping runs, which depends_on_past
is used as a guarantee to run every TI, sequentially.

Absolute scheduling (cron expressions) is much better than relative
scheduling (origin + interval). Though it's easy to make relative
scheduling behave in an absolute way. You just have to use a rounded
start_date to your schedule_interval, and not move things around. Dynamic
start_dates have always been a problem and should probably not be
supported, though there's no way for us to tell.

Changing the schedule interval or the "origin time" is a bit tricky and
should be documented.

If people use depend_on_past=True and change the origin or the interval,
they basically redefine what "past" actually means and will require to
"mark success" or defining a new "start_date" as a way to say "please
disregard depend_on_past for this date"

For those who haven't fully digested "What's the deal with start_dates",
please take the time to read it:
http://pythonhosted.org/airflow/faq.html

Also see this part of the docs:

​

Max

On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <jl...@gmail.com> wrote:

> Bolke, Sid and I had a brief conversation to discuss some of the
> implications of https://github.com/airbnb/airflow/issues/1427
>
> There are two large points that need to be addressed:
>
> 1. this particular issue arises because of an alignment issue between
> start_date and schedule_interval. This can only happen with cron-based
> schedule_intervals that describe absolute points in time (like “1am”) as
> opposed to time deltas (like “every hour”). Ironically, I once reported
> this same issue myself (#959). In the past (and in the docs) we have simply
> said that users must make sure the two params agree. We discussed the
> possibility of a DAG validation method to raise an error if the start_date
> and schedule_interval don’t align, but Bolke made the point (and I agreed)
> that in these cases, start_date is sort of like telling the scheduler to
> “start paying attention” as opposed to “this is my first execution date”.
> In #1427, the scheduler was being asked to start paying attention on
> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However, it
> was scheduling a first run at midnight and a second run at 1:10.
>
> Regardless of whether we choose to validate/warn/error, Bolke is going to
> change the scheduling logic so that the cron-based interval takes
> precedence over a start date. Specifically, the first date on or after the
> start_date that complies with the schedule_interval becomes the first
> execution date.
>
> 2. Issue #1 led to a second issue: depends_on_past checks for a successful
> TI at `execution_date - schedule_interval`. This is fragile, since it is
> very possible for the previous TI to have run at any time in the past, not
> just one schedule_interval ago. This can happen easily with ad-hoc DAG
> runs, and also if a DAG was paused for a while. Less commonly, it happens
> with the situation described in point #1, where the first scheduled run is
> off-schedule (the midnight run followed by the daily 1:10am runs).
>
> The clear fix seems to be to have depends_on_past check the last TI that
> ran, regardless of whether it ran `schedule_interval` ago. That's in line
> with the intent of the flag. I will submit a fix.
>
> -J
>