You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Daniel Standish <dp...@gmail.com> on 2019/09/04 03:14:03 UTC

Re: Setting to add choice of schedule at end or schedule at start of interval

What if we merely add a property "run_date" to DagRun?  At present
this would be essentially same as "next_execution_date".

Then no change to scheduler would be required, and no new dag parameter or
config.  Perhaps you could add a toggle to the DAGs UI view that lets you
choose whether to display "last run" by "run_date" or "execution_date".

If you want your dags to be parameterized by the date when they meant to be
run -- as opposed to their implicit interval-of-interest -- then you can
reference "run_date".

One potential source of confusion with this is backfilling: what does
"run_date" mean in the context of a backfill?  You could say it means
essentially "initial run date", i.e. "do not run before date", i.e. "run
after date" or "run-at date".  So, for a daily, job the 2019-01-02
"run_date" corresponds to a 2019-01-01 execution_date.  This makes sense
right?

Perhaps in the future, the relationship between "run_date" and
"execution_date" can be more dynamic.  Perhaps in the future we rename
"execution_date" for clarity, or to be more generic.  But it makes sense
that a dag run will always have a run date, so it doesn't seem like a
terrible idea to add a property representing this.

Would this meet the goals of the PR?




On Wed, Aug 28, 2019 at 11:50 AM James Meickle
<jm...@quantopian.com.invalid> wrote:

> Totally agree with Daniel here. I think that if we implement this feature
> as proposed, it will actively discourage us from implementing a better
> data-aware feature that would remain invisible to most users while neatly
> addressing a lot of edge cases that currently require really ugly hacks. I
> believe that having more data awareness features in Airflow (like the data
> lineage work, or other metadata integrations) is worth investing in if we
> can do it without too much required user-facing complexity. The Airflow
> project isn't a full data warehouse suite but it's also not just "cron with
> a UI", so we should try to be pragmatic and fit in power-user features
> where we can do so without compromising the project's overall goals.
>
> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dp...@gmail.com>
> wrote:
>
> > I am just thinking there is the potential for a more comprehensive
> > enhancement here, and I worry that this is a band-aid that, like all new
> > features has the potential to constrain future options.  It does not help
> > us to do anything we cannot already do.
> >
> > The source of this problem is that scheduling and interval-of-interest
> are
> > mixed together.
> >
> > My thought is there may be a way to separate scheduling and
> > interval-of-interest to uniformly resolve "execution_date" vs "run_date"
> > confusion.  We could make *explicit* instead of *implicit* the
> relationship
> > between run_date *(not currently a concept in airflow)* and
> > "interval-of-interest" *(currently represented by execution_date)*.
> >
> > I also see in this the potential to unlock some other improvements:
> > * support a greater diversity of incremental processes
> > * allow more flexible backfilling
> > * provide better views of data you have vs data you don't.
> >
> > The canonical airflow job is date-partitioned idempotent data pull.  Your
> > interval of interest is from execution_date to execution_date + 1
> > interval.  Schedule_interval is not just the scheduling cadence but it is
> > also your interval-of-interest partition function.   If that doesn't work
> > for your job, you set catchup=False and roll your own.
> >
> > What if there was a way to generalize?  E.g. could we allow for more
> > flexible partition function that deviated from scheduler cadence?  E.g.
> > what if your interval-of-interest partitions could be governed by "min 1
> > day, max 30 days".  Then on on-going basis, your daily loads would be a
> > range of 1 day but then if server down for couple days, this could be
> > caught up in one task and if you backfill it could be up to 30-day
> batches.
> >
> > Perhaps there is an abstraction that could be used by a greater diversity
> > of incremental processes.  Such a thing could support a nice "data
> > contiguity view". I imagine a horizontal bar that is solid where we have
> > the data and empty where we don't.  Then you click on a "missing" section
> > and you can  trigger a backfill task with that date interval according to
> > your partitioning rules.
> >
> > I can imagine using this for an incremental job where each time we pull
> the
> > new data since last time; in the `execute` method the operator could set
> > `self.high_watermark` with the max datetime processed.  Or maybe a
> callback
> > function could be used to gather this value.  This value could be used in
> > next run, and cold be depicted in a view.
> >
> > Default intervals of interest could be status quo -- i.e. partitions
> equal
> > to schedule interval -- but could be overwritten using templating or
> > callbacks or setting it during `execute`.
> >
> > So anyway, I don't have a master plan all figured out.  But I think there
> > is opportunity in this area for more comprehensive enhancement that goes
> > more directly at the root of the problem.
> >
> >
> >
> >
> > On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > How about an alternative approach that would introduce 2 new keyword
> > > arguments that are clear (something like, but maybe better than
> > > `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
> > > unchanged, but plan it's deprecation. As a first step `execution_date`
> > > would be inferred from the new args, and warn about deprecation when
> > used.
> > >
> > > Max
> > >
> > > On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bd...@gmail.com>
> > wrote:
> > >
> > > > Execution date is execution date for a dag run no matter what. There
> is
> > > no
> > > > end interval or start interval for a dag run. The only time this is
> > > > relevant is when we calculate the next or previous dagrun.
> > > >
> > > > So I don't Daniels rationale makes sense (?)
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 27 Aug 2019, at 17:40, Philippe Gagnon <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > I agree with Daniel's rationale but I am also worried about
> backwards
> > > > > compatibility as this would perhaps be the most disruptive breaking
> > > > change
> > > > > possible. I think maybe we should write down the different options
> > > > > available to us (AIP?) and call for a vote. What does everyone
> think?
> > > > >
> > > > >> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jc...@gmail.com>
> > > wrote:
> > > > >>
> > > > >> Can't execution date can already mean different things depending
> on
> > if
> > > > the
> > > > >> dag run was initiated via the scheduler or manually via command
> > > > line/API?
> > > > >> I agree that making it consistent might make it easier to explain
> to
> > > new
> > > > >> users, but should we exchange that for breaking pretty much every
> > > > existing
> > > > >> dag by re-defining what execution date is?
> > > > >> -James
> > > > >>
> > > > >> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
> > > dpstandish@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>>>
> > > > >>>> To Daniel’s concerns, I would argue this is not a change to
> what a
> > > dag
> > > > >>> run
> > > > >>>> is, it is rather a change to WHEN that dag run will be
> scheduled.
> > > > >>>
> > > > >>>
> > > > >>> Execution date is part of the definition of a dag_run; it is
> > uniquely
> > > > >>> identified by an execution_date and dag_id.
> > > > >>>
> > > > >>> When someone asks what is a dag_run, we should be able to provide
> > an
> > > > >>> answer.
> > > > >>>
> > > > >>> Imagine trying to explain what a dag run is, when execution_date
> > can
> > > > mean
> > > > >>> different things.
> > > > >>>    Admin: "A dag run is an execution_date and a dag_id".
> > > > >>>    New user: "Ok. Clear as a bell. What's an execution_date?"
> > > > >>>    Admin: "Well, it can be one of two things.  It *could* be when
> > the
> > > > >> dag
> > > > >>> will be run... but it could *also* be 'the time when dag should
> be
> > > run
> > > > >>> minus one schedule interval".  It depends on whether you choose
> > 'end'
> > > > or
> > > > >>> 'start' for 'schedule_interval_edge.'  If you choose 'start' then
> > > > >>> execution_date means 'when dag will be run'.  If you choose 'end'
> > > then
> > > > >>> execution_date means 'when dag will be run minus one interval.'
> If
> > > you
> > > > >>> change the parameter after some time, then we don't necessarily
> > know
> > > > what
> > > > >>> it means at all times".
> > > > >>>
> > > > >>> Why would we do this to ourselves?
> > > > >>>
> > > > >>> Alternatively, we can give dag_run a clear, unambiguous meaning:
> > > > >>> * dag_run is dag_id + execution_date
> > > > >>> * execution_date is when dag will be run (notwithstanding
> scheduler
> > > > >> delay,
> > > > >>> queuing)
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Execution_date is defined as "run-at date minus 1 interval".  The
> > > > >>> assumption in this is that you tasks care about this particular
> > date.
> > > > >>> Obviously this makes sense for some tasks but not for others.
> > > > >>>
> > > > >>> I would prop
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcoder01@gmail.com
> >
> > > > wrote:
> > > > >>>>
> > > > >>>> I think this is a great improvement and should be merged. To
> > > Daniel’s
> > > > >>>> concerns, I would argue this is not a change to what a dag run
> is,
> > > it
> > > > >> is
> > > > >>>> rather a change to WHEN that dag run will be scheduled.
> > > > >>>> I had implemented a similar change in my own version but
> > ultimately
> > > > >>> backed
> > > > >>>> so I didn’t have to patch after each new release. In my opinion
> > the
> > > > >> main
> > > > >>>> flaw in the current scheduler, and I have brought this up
> before,
> > is
> > > > >> when
> > > > >>>> you don’t have a consistent schedule interval (e.g. only run
> M-F).
> > > > >> After
> > > > >>>> backing out the “schedule at interval start” I had to switch to
> a
> > > > daily
> > > > >>>> schedule and go through and put a short circuit operator in each
> > of
> > > my
> > > > >>> M-F
> > > > >>>> dags to get the behavior that I wanted. This results in putting
> > > > >>> scheduling
> > > > >>>> logic inside the dag, when scheduling logic should be in the
> > > > scheduler.
> > > > >>>>
> > > > >>>> -James
> > > > >>>>
> > > > >>>>
> > > > >>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <
> > dpstandish@gmail.com
> > > >
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> Re
> > > > >>>>>
> > > > >>>>>> What are people's feelings on changing the default execution
> to
> > > > >>> schedule
> > > > >>>>>> interval start
> > > > >>>>>
> > > > >>>>> and
> > > > >>>>>
> > > > >>>>>> I'm in favor of doing that, but then exposing new variables of
> > > > >>>>>> "interval_start" and "interval_end", etc. so that people write
> > > > >>>>>> clearer-looking at-a-glance DAGs
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> While I am def on board with the spirit of this PR, I would
> vote
> > we
> > > > >> do
> > > > >>>> not
> > > > >>>>> accept this PR as is, because it cements a confusing option.
> > > > >>>>>
> > > > >>>>> *What is the right representation of a dag run?*
> > > > >>>>>
> > > > >>>>> Right now the representation is "dag run-at date minus 1
> > interval".
> > > > >> It
> > > > >>>>> should just be "dag run-at date".
> > > > >>>>>
> > > > >>>>> We don't need to address the question of whether execution date
> > is
> > > > >> the
> > > > >>>>> start or the end of an interval; it doesn't matter.
> > > > >>>>>
> > > > >>>>> In all cases, a given dag run will be targeted for *some*
> initial
> > > > >>> "run-at
> > > > >>>>> time"; so *that* should be the time that is part of the PK of a
> > dag
> > > > >>> run,
> > > > >>>>> and *that *is the time that should be exposed as the dag run
> > > > >> "execution
> > > > >>>>> date"
> > > > >>>>>
> > > > >>>>> *Interval of interest is not a dag_run attribute*
> > > > >>>>>
> > > > >>>>> We also mix in this question of the date interval that the
> > *tasks*
> > > > >> are
> > > > >>>>> interested in.  But the *dag run* need not concern itself with
> > this
> > > > >> in
> > > > >>>> any
> > > > >>>>> way.  That is for the tasks to figure out: if they happen to
> need
> > > > >> "dag
> > > > >>>>> run-at date," then they can reference that; if they want the
> > prior
> > > > >> one,
> > > > >>>> ask
> > > > >>>>> for the prior one.
> > > > >>>>>
> > > > >>>>> Previously, I was in the camp that thought it was a great idea
> to
> > > > >>> rename
> > > > >>>>> "execution_date" to "period_start" or "interval_start".  But I
> > now
> > > > >>> think
> > > > >>>>> this is folly.  It invokes this question of the "interval of
> > > > >> interest"
> > > > >>> or
> > > > >>>>> "period of interest".  But the dag doesn't need to know
> anything
> > > > >> about
> > > > >>>>> that.
> > > > >>>>>
> > > > >>>>> Within the same dag you may have tasks with different intervals
> > of
> > > > >>>>> interest.  So why make assumptions in the dag; just give the
> > facts:
> > > > >>> this
> > > > >>>> is
> > > > >>>>> my run date; this is the prior run date, etc.  It would be a
> > > > >> regression
> > > > >>>>> from the perspective of providing accurate names.
> > > > >>>>>
> > > > >>>>> *Proposal*
> > > > >>>>>
> > > > >>>>> So, I would propose we change "execution_date" to mean "dag
> > run-at
> > > > >>> date"
> > > > >>>> as
> > > > >>>>> opposed to "dag run-at date minus 1".  But we should do so
> > without
> > > > >>>>> reference to interval end or interval start.
> > > > >>>>>
> > > > >>>>> *Configurability*
> > > > >>>>>
> > > > >>>>> The more configuration options we have, the more noise there is
> > as
> > > a
> > > > >>> user
> > > > >>>>> trying to understand how to use airflow, so I'd rather us not
> > make
> > > > >> this
> > > > >>>>> configurable at all.
> > > > >>>>>
> > > > >>>>> That said, perhaps a more clear and more explicit means making
> > this
> > > > >>>>> configurable would be to define an integer param
> > > > >>>>> "dag_run_execution_date_interval_offset", which would control
> how
> > > > >> many
> > > > >>>>> intervals back from actual "dag run-at date" the "execution
> date"
> > > > >>> should
> > > > >>>>> be.  (current behavior = 1, new behavior = 0).
> > > > >>>>>
> > > > >>>>> *Side note*
> > > > >>>>>
> > > > >>>>> Hopefully not to derail discussion: I think there are
> additional,
> > > > >>> related
> > > > >>>>> task attributes that may want to come into being: namely,
> > > > >> low_watermark
> > > > >>>> and
> > > > >>>>> high_watermark.  There is the potential, with attributes like
> > this,
> > > > >> for
> > > > >>>>> adding better out-of-the-box support for common data workflows
> > that
> > > > >> we
> > > > >>>> now
> > > > >>>>> need to use xcom for, namely incremental loads.  But I want to
> > give
> > > > >> it
> > > > >>>> more
> > > > >>>>> thought before proposing anything specific.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
> > > > >> Jarek.Potiuk@polidea.com
> > > > >>>>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Good one Damian. I will have a list of issues that can be
> > possible
> > > > >> to
> > > > >>>>>> handle at the workshop, so that one goes there.
> > > > >>>>>>
> > > > >>>>>> J.
> > > > >>>>>>
> > > > >>>>>> Principal Software Engineer
> > > > >>>>>> Phone: +48660796129
> > > > >>>>>>
> > > > >>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> > > > >>>>>> damian.shaw.2@credit-suisse.com> napisał:
> > > > >>>>>>
> > > > >>>>>>> I can't understate what a conceptual improvement this would
> be
> > > for
> > > > >>> the
> > > > >>>>>> end
> > > > >>>>>>> users of Airflow in our environment. I've written a lot of
> code
> > > so
> > > > >>> all
> > > > >>>>>> our
> > > > >>>>>>> configuration works like this anyway. But the UI still shows
> > the
> > > > >>>> Airflow
> > > > >>>>>>> dates which still to this day sometimes confuse me.
> > > > >>>>>>>
> > > > >>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some
> of
> > > my
> > > > >>>> first
> > > > >>>>>>> PRs could be additional test cases around edge cases to do
> with
> > > DST
> > > > >>> and
> > > > >>>>>>> cron scheduling that I have concerns about :)
> > > > >>>>>>>
> > > > >>>>>>> Damian
> > > > >>>>>>>
> > > > >>>>>>> -----Original Message-----
> > > > >>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
> > > > >>>>>>> Sent: Friday, August 23, 2019 6:50 AM
> > > > >>>>>>> To: dev@airflow.apache.org
> > > > >>>>>>> Subject: Setting to add choice of schedule at end or schedule
> > at
> > > > >>> start
> > > > >>>> of
> > > > >>>>>>> interval
> > > > >>>>>>>
> > > > >>>>>>> This has come up a few times before, someone has now opened a
> > PR
> > > > >> that
> > > > >>>>>>> makes this a global+per-dag setting:
> > > > >>>>>>> https://github.com/apache/airflow/pull/5787 and it also
> > includes
> > > > >>> docs
> > > > >>>>>>> that I think does a good job of illustrating the two modes.
> > > > >>>>>>>
> > > > >>>>>>> Does anyone object to this being merged? If no one says
> > anything
> > > by
> > > > >>>>>> midday
> > > > >>>>>>> on Tuesday I will take that as assent and will merge it.
> > > > >>>>>>>
> > > > >>>>>>> The docs from the PR included below.
> > > > >>>>>>>
> > > > >>>>>>> Thanks,
> > > > >>>>>>> Ash
> > > > >>>>>>>
> > > > >>>>>>> Scheduled Time vs Execution Time
> > > > >>>>>>> ''''''''''''''''''''''''''''''''
> > > > >>>>>>>
> > > > >>>>>>> A DAG with a ``schedule_interval`` will execute once per
> > > interval.
> > > > >> By
> > > > >>>>>>> default, the execution of a DAG will occur at the **end** of
> > the
> > > > >>>>>>> schedule interval.
> > > > >>>>>>>
> > > > >>>>>>> A few examples:
> > > > >>>>>>>
> > > > >>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run
> that
> > > > >>>> processes
> > > > >>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16
> > > 17:59:59,
> > > > >>>>>>> i.e. once that hour is over.
> > > > >>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
> > > > >>> processes
> > > > >>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
> > > > >>>>>>>
> > > > >>>>>>> The reasoning behind this execution vs scheduling behaviour
> is
> > > that
> > > > >>>>>>> data for the interval to be processed won't be fully
> available
> > > > >> until
> > > > >>>>>>> the interval has elapsed.
> > > > >>>>>>>
> > > > >>>>>>> In cases where you wish the DAG to be executed at the
> **start**
> > > of
> > > > >>> the
> > > > >>>>>>> interval, specify ``schedule_at_interval_end=False``, either
> in
> > > > >>>>>>> ``airflow.cfg``, or on a per-DAG basis.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> ===============================================================================
> > > > >>>>>>>
> > > > >>>>>>> Please access the attached hyperlink for an important
> > electronic
> > > > >>>>>>> communications disclaimer:
> > > > >>>>>>>
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> ===============================================================================
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
>

Re: Setting to add choice of schedule at end or schedule at start of interval

Posted by Ash Berlin-Taylor <as...@apache.org>.

Oh clever! But...

I'd like us to not make this the "official" way just yet as I'm considering changing how the scheduler works when it comes to executing the code (part of the larger DAG serialisation effort, where I'd like to stop the scheduler executing python code on every loop.) - and making this official will close off avenues that I'm actively looking at going down.

-ash

> On 6 Sep 2019, at 06:40, Maxime Beauchemin <ma...@gmail.com> wrote:
> 
> Just had a thought and looked a tiny bit at the source code to assess
> feasibility, but it seems like you could just derive the DAG class and
> override `previous_schedule` and `following_schedule` methods. The
> signature of both is you get a `datetime.datetime` and have to return
> another. It's pretty easy to put your arbitrarily complex logic in there.
> 
> There may be a few hiccups to sort out things like like
> `airflow.utils.dates.date_range` (where duplicated time-step logic exist)
> to make sure that all time-step logic aligns with these two methods I just
> mentioned, but from that point it could be become the official way to
> incorporate funky date-step logic.
> 
> Max
> 
> On Wed, Sep 4, 2019 at 12:54 PM Daniel Standish <dp...@gmail.com>
> wrote:
> 
>> Re:
>> 
>>> For example, if I need to run a DAG every 20 minutes between 8 AM and 4
>>> PM...
>> 
>> 
>> This makes a lot of sense!  Thank you for providing this example.  My
>> initial thought of course is "well can't you just set it to run */20
>> between 7:40am and 3:40pm," but I don't think that is possible in cron.
>> Which is why you have to do hacky shit as you've said and it indeed sounds
>> terrible.  I never had to achieve a schedule like this, and yeah -- it
>> should not be this hard.
>> 
>> Re:
>> 
>>> I can’t see how adding a property to Dagrun that is essentially
>>> identical to next_execution_date would add any benefit.
>> 
>> That's why i was like what the hell is the point of this thing!   I thought
>> it was just purely cosmetic, so that in effect "execution_date" would
>> optionally mean "run_date".
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Sep 4, 2019 at 12:10 PM James Coder <jc...@gmail.com> wrote:
>> 
>>> I can’t see how adding a property to Dagrun that is essentially
>>> identical to next_execution_date would add any benefit. The way I see
>>> it the issue at hand here is not the availability of dates. There are
>>> plenty of options in the template context for dates before and after
>>> execution date. My view point is the problem this is trying to solve
>>> is that waiting until the right edge of an interval has passed to
>>> schedule a dag run has some shortcomings. Mainly that if your
>>> intervals vary in length you are forced to put scheduling logic that
>>> should reside in the scheduler in your DAGs. For example, if I need to
>>> run a DAG every 20 minutes between 8 AM and 4 PM, in it's current
>>> form, the scheduler won't schedule that 4PM run until 8 AM the next
>>> day. "Just use next_execution_date" you say, well that's all well and
>>> good between 8AM and 3:40 PM, but when 4:01 PM rolls around and you
>>> don't have the results because they won't be available until after 8
>>> the next day, that doesn't sound so good, does it? In order to work
>>> around this, you have to add additional runs and short circuit
>>> operators over and over. It's a hassle.  Allowing for scheduling dags
>>> at the left edge of an interval and allowing it to behave more like
>>> cron, where it runs at the time specified, not schedule + interval,
>>> would make things much less complicated for users like myself that
>>> can't always wait until the right edge of the interval.
>>> 
>>> 
>>> James Coder
>>> 
>>>> On Sep 3, 2019, at 11:14 PM, Daniel Standish <dp...@gmail.com>
>>> wrote:
>>>> 
>>>> What if we merely add a property "run_date" to DagRun?  At present
>>>> this would be essentially same as "next_execution_date".
>>>> 
>>>> Then no change to scheduler would be required, and no new dag parameter
>>> or
>>>> config.  Perhaps you could add a toggle to the DAGs UI view that lets
>> you
>>>> choose whether to display "last run" by "run_date" or "execution_date".
>>>> 
>>>> If you want your dags to be parameterized by the date when they meant
>> to
>>> be
>>>> run -- as opposed to their implicit interval-of-interest -- then you
>> can
>>>> reference "run_date".
>>>> 
>>>> One potential source of confusion with this is backfilling: what does
>>>> "run_date" mean in the context of a backfill?  You could say it means
>>>> essentially "initial run date", i.e. "do not run before date", i.e.
>> "run
>>>> after date" or "run-at date".  So, for a daily, job the 2019-01-02
>>>> "run_date" corresponds to a 2019-01-01 execution_date.  This makes
>> sense
>>>> right?
>>>> 
>>>> Perhaps in the future, the relationship between "run_date" and
>>>> "execution_date" can be more dynamic.  Perhaps in the future we rename
>>>> "execution_date" for clarity, or to be more generic.  But it makes
>> sense
>>>> that a dag run will always have a run date, so it doesn't seem like a
>>>> terrible idea to add a property representing this.
>>>> 
>>>> Would this meet the goals of the PR?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Aug 28, 2019 at 11:50 AM James Meickle
>>>> <jm...@quantopian.com.invalid> wrote:
>>>> 
>>>>> Totally agree with Daniel here. I think that if we implement this
>>> feature
>>>>> as proposed, it will actively discourage us from implementing a better
>>>>> data-aware feature that would remain invisible to most users while
>>> neatly
>>>>> addressing a lot of edge cases that currently require really ugly
>>> hacks. I
>>>>> believe that having more data awareness features in Airflow (like the
>>> data
>>>>> lineage work, or other metadata integrations) is worth investing in if
>>> we
>>>>> can do it without too much required user-facing complexity. The
>> Airflow
>>>>> project isn't a full data warehouse suite but it's also not just "cron
>>> with
>>>>> a UI", so we should try to be pragmatic and fit in power-user features
>>>>> where we can do so without compromising the project's overall goals.
>>>>> 
>>>>> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dpstandish@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> I am just thinking there is the potential for a more comprehensive
>>>>>> enhancement here, and I worry that this is a band-aid that, like all
>>> new
>>>>>> features has the potential to constrain future options.  It does not
>>> help
>>>>>> us to do anything we cannot already do.
>>>>>> 
>>>>>> The source of this problem is that scheduling and
>> interval-of-interest
>>>>> are
>>>>>> mixed together.
>>>>>> 
>>>>>> My thought is there may be a way to separate scheduling and
>>>>>> interval-of-interest to uniformly resolve "execution_date" vs
>>> "run_date"
>>>>>> confusion.  We could make *explicit* instead of *implicit* the
>>>>> relationship
>>>>>> between run_date *(not currently a concept in airflow)* and
>>>>>> "interval-of-interest" *(currently represented by execution_date)*.
>>>>>> 
>>>>>> I also see in this the potential to unlock some other improvements:
>>>>>> * support a greater diversity of incremental processes
>>>>>> * allow more flexible backfilling
>>>>>> * provide better views of data you have vs data you don't.
>>>>>> 
>>>>>> The canonical airflow job is date-partitioned idempotent data pull.
>>> Your
>>>>>> interval of interest is from execution_date to execution_date + 1
>>>>>> interval.  Schedule_interval is not just the scheduling cadence but
>> it
>>> is
>>>>>> also your interval-of-interest partition function.   If that doesn't
>>> work
>>>>>> for your job, you set catchup=False and roll your own.
>>>>>> 
>>>>>> What if there was a way to generalize?  E.g. could we allow for more
>>>>>> flexible partition function that deviated from scheduler cadence?
>> E.g.
>>>>>> what if your interval-of-interest partitions could be governed by
>> "min
>>> 1
>>>>>> day, max 30 days".  Then on on-going basis, your daily loads would
>> be a
>>>>>> range of 1 day but then if server down for couple days, this could be
>>>>>> caught up in one task and if you backfill it could be up to 30-day
>>>>> batches.
>>>>>> 
>>>>>> Perhaps there is an abstraction that could be used by a greater
>>> diversity
>>>>>> of incremental processes.  Such a thing could support a nice "data
>>>>>> contiguity view". I imagine a horizontal bar that is solid where we
>>> have
>>>>>> the data and empty where we don't.  Then you click on a "missing"
>>> section
>>>>>> and you can  trigger a backfill task with that date interval
>> according
>>> to
>>>>>> your partitioning rules.
>>>>>> 
>>>>>> I can imagine using this for an incremental job where each time we
>> pull
>>>>> the
>>>>>> new data since last time; in the `execute` method the operator could
>>> set
>>>>>> `self.high_watermark` with the max datetime processed.  Or maybe a
>>>>> callback
>>>>>> function could be used to gather this value.  This value could be
>> used
>>> in
>>>>>> next run, and cold be depicted in a view.
>>>>>> 
>>>>>> Default intervals of interest could be status quo -- i.e. partitions
>>>>> equal
>>>>>> to schedule interval -- but could be overwritten using templating or
>>>>>> callbacks or setting it during `execute`.
>>>>>> 
>>>>>> So anyway, I don't have a master plan all figured out.  But I think
>>> there
>>>>>> is opportunity in this area for more comprehensive enhancement that
>>> goes
>>>>>> more directly at the root of the problem.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>> 
>>>>>>> How about an alternative approach that would introduce 2 new keyword
>>>>>>> arguments that are clear (something like, but maybe better than
>>>>>>> `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
>>>>>>> unchanged, but plan it's deprecation. As a first step
>> `execution_date`
>>>>>>> would be inferred from the new args, and warn about deprecation when
>>>>>> used.
>>>>>>> 
>>>>>>> Max
>>>>>>> 
>>>>>>> On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bd...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Execution date is execution date for a dag run no matter what.
>> There
>>>>> is
>>>>>>> no
>>>>>>>> end interval or start interval for a dag run. The only time this is
>>>>>>>> relevant is when we calculate the next or previous dagrun.
>>>>>>>> 
>>>>>>>> So I don't Daniels rationale makes sense (?)
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On 27 Aug 2019, at 17:40, Philippe Gagnon <ph...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I agree with Daniel's rationale but I am also worried about
>>>>> backwards
>>>>>>>>> compatibility as this would perhaps be the most disruptive
>> breaking
>>>>>>>> change
>>>>>>>>> possible. I think maybe we should write down the different options
>>>>>>>>> available to us (AIP?) and call for a vote. What does everyone
>>>>> think?
>>>>>>>>> 
>>>>>>>>>> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jc...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Can't execution date can already mean different things depending
>>>>> on
>>>>>> if
>>>>>>>> the
>>>>>>>>>> dag run was initiated via the scheduler or manually via command
>>>>>>>> line/API?
>>>>>>>>>> I agree that making it consistent might make it easier to explain
>>>>> to
>>>>>>> new
>>>>>>>>>> users, but should we exchange that for breaking pretty much every
>>>>>>>> existing
>>>>>>>>>> dag by re-defining what execution date is?
>>>>>>>>>> -James
>>>>>>>>>> 
>>>>>>>>>> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
>>>>>>> dpstandish@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> To Daniel’s concerns, I would argue this is not a change to
>>>>> what a
>>>>>>> dag
>>>>>>>>>>> run
>>>>>>>>>>>> is, it is rather a change to WHEN that dag run will be
>>>>> scheduled.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Execution date is part of the definition of a dag_run; it is
>>>>>> uniquely
>>>>>>>>>>> identified by an execution_date and dag_id.
>>>>>>>>>>> 
>>>>>>>>>>> When someone asks what is a dag_run, we should be able to
>> provide
>>>>>> an
>>>>>>>>>>> answer.
>>>>>>>>>>> 
>>>>>>>>>>> Imagine trying to explain what a dag run is, when execution_date
>>>>>> can
>>>>>>>> mean
>>>>>>>>>>> different things.
>>>>>>>>>>>  Admin: "A dag run is an execution_date and a dag_id".
>>>>>>>>>>>  New user: "Ok. Clear as a bell. What's an execution_date?"
>>>>>>>>>>>  Admin: "Well, it can be one of two things.  It *could* be when
>>>>>> the
>>>>>>>>>> dag
>>>>>>>>>>> will be run... but it could *also* be 'the time when dag should
>>>>> be
>>>>>>> run
>>>>>>>>>>> minus one schedule interval".  It depends on whether you choose
>>>>>> 'end'
>>>>>>>> or
>>>>>>>>>>> 'start' for 'schedule_interval_edge.'  If you choose 'start'
>> then
>>>>>>>>>>> execution_date means 'when dag will be run'.  If you choose
>> 'end'
>>>>>>> then
>>>>>>>>>>> execution_date means 'when dag will be run minus one interval.'
>>>>> If
>>>>>>> you
>>>>>>>>>>> change the parameter after some time, then we don't necessarily
>>>>>> know
>>>>>>>> what
>>>>>>>>>>> it means at all times".
>>>>>>>>>>> 
>>>>>>>>>>> Why would we do this to ourselves?
>>>>>>>>>>> 
>>>>>>>>>>> Alternatively, we can give dag_run a clear, unambiguous meaning:
>>>>>>>>>>> * dag_run is dag_id + execution_date
>>>>>>>>>>> * execution_date is when dag will be run (notwithstanding
>>>>> scheduler
>>>>>>>>>> delay,
>>>>>>>>>>> queuing)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Execution_date is defined as "run-at date minus 1 interval".
>> The
>>>>>>>>>>> assumption in this is that you tasks care about this particular
>>>>>> date.
>>>>>>>>>>> Obviously this makes sense for some tasks but not for others.
>>>>>>>>>>> 
>>>>>>>>>>> I would prop
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <
>> jcoder01@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I think this is a great improvement and should be merged. To
>>>>>>> Daniel’s
>>>>>>>>>>>> concerns, I would argue this is not a change to what a dag run
>>>>> is,
>>>>>>> it
>>>>>>>>>> is
>>>>>>>>>>>> rather a change to WHEN that dag run will be scheduled.
>>>>>>>>>>>> I had implemented a similar change in my own version but
>>>>>> ultimately
>>>>>>>>>>> backed
>>>>>>>>>>>> so I didn’t have to patch after each new release. In my opinion
>>>>>> the
>>>>>>>>>> main
>>>>>>>>>>>> flaw in the current scheduler, and I have brought this up
>>>>> before,
>>>>>> is
>>>>>>>>>> when
>>>>>>>>>>>> you don’t have a consistent schedule interval (e.g. only run
>>>>> M-F).
>>>>>>>>>> After
>>>>>>>>>>>> backing out the “schedule at interval start” I had to switch to
>>>>> a
>>>>>>>> daily
>>>>>>>>>>>> schedule and go through and put a short circuit operator in
>> each
>>>>>> of
>>>>>>> my
>>>>>>>>>>> M-F
>>>>>>>>>>>> dags to get the behavior that I wanted. This results in putting
>>>>>>>>>>> scheduling
>>>>>>>>>>>> logic inside the dag, when scheduling logic should be in the
>>>>>>>> scheduler.
>>>>>>>>>>>> 
>>>>>>>>>>>> -James
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <
>>>>>> dpstandish@gmail.com
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Re
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What are people's feelings on changing the default execution
>>>>> to
>>>>>>>>>>> schedule
>>>>>>>>>>>>>> interval start
>>>>>>>>>>>>> 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm in favor of doing that, but then exposing new variables
>> of
>>>>>>>>>>>>>> "interval_start" and "interval_end", etc. so that people
>> write
>>>>>>>>>>>>>> clearer-looking at-a-glance DAGs
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While I am def on board with the spirit of this PR, I would
>>>>> vote
>>>>>> we
>>>>>>>>>> do
>>>>>>>>>>>> not
>>>>>>>>>>>>> accept this PR as is, because it cements a confusing option.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *What is the right representation of a dag run?*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Right now the representation is "dag run-at date minus 1
>>>>>> interval".
>>>>>>>>>> It
>>>>>>>>>>>>> should just be "dag run-at date".
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We don't need to address the question of whether execution
>> date
>>>>>> is
>>>>>>>>>> the
>>>>>>>>>>>>> start or the end of an interval; it doesn't matter.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In all cases, a given dag run will be targeted for *some*
>>>>> initial
>>>>>>>>>>> "run-at
>>>>>>>>>>>>> time"; so *that* should be the time that is part of the PK of
>> a
>>>>>> dag
>>>>>>>>>>> run,
>>>>>>>>>>>>> and *that *is the time that should be exposed as the dag run
>>>>>>>>>> "execution
>>>>>>>>>>>>> date"
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Interval of interest is not a dag_run attribute*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We also mix in this question of the date interval that the
>>>>>> *tasks*
>>>>>>>>>> are
>>>>>>>>>>>>> interested in.  But the *dag run* need not concern itself with
>>>>>> this
>>>>>>>>>> in
>>>>>>>>>>>> any
>>>>>>>>>>>>> way.  That is for the tasks to figure out: if they happen to
>>>>> need
>>>>>>>>>> "dag
>>>>>>>>>>>>> run-at date," then they can reference that; if they want the
>>>>>> prior
>>>>>>>>>> one,
>>>>>>>>>>>> ask
>>>>>>>>>>>>> for the prior one.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Previously, I was in the camp that thought it was a great idea
>>>>> to
>>>>>>>>>>> rename
>>>>>>>>>>>>> "execution_date" to "period_start" or "interval_start".  But I
>>>>>> now
>>>>>>>>>>> think
>>>>>>>>>>>>> this is folly.  It invokes this question of the "interval of
>>>>>>>>>> interest"
>>>>>>>>>>> or
>>>>>>>>>>>>> "period of interest".  But the dag doesn't need to know
>>>>> anything
>>>>>>>>>> about
>>>>>>>>>>>>> that.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Within the same dag you may have tasks with different
>> intervals
>>>>>> of
>>>>>>>>>>>>> interest.  So why make assumptions in the dag; just give the
>>>>>> facts:
>>>>>>>>>>> this
>>>>>>>>>>>> is
>>>>>>>>>>>>> my run date; this is the prior run date, etc.  It would be a
>>>>>>>>>> regression
>>>>>>>>>>>>> from the perspective of providing accurate names.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Proposal*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So, I would propose we change "execution_date" to mean "dag
>>>>>> run-at
>>>>>>>>>>> date"
>>>>>>>>>>>> as
>>>>>>>>>>>>> opposed to "dag run-at date minus 1".  But we should do so
>>>>>> without
>>>>>>>>>>>>> reference to interval end or interval start.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Configurability*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The more configuration options we have, the more noise there
>> is
>>>>>> as
>>>>>>> a
>>>>>>>>>>> user
>>>>>>>>>>>>> trying to understand how to use airflow, so I'd rather us not
>>>>>> make
>>>>>>>>>> this
>>>>>>>>>>>>> configurable at all.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That said, perhaps a more clear and more explicit means making
>>>>>> this
>>>>>>>>>>>>> configurable would be to define an integer param
>>>>>>>>>>>>> "dag_run_execution_date_interval_offset", which would control
>>>>> how
>>>>>>>>>> many
>>>>>>>>>>>>> intervals back from actual "dag run-at date" the "execution
>>>>> date"
>>>>>>>>>>> should
>>>>>>>>>>>>> be.  (current behavior = 1, new behavior = 0).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Side note*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hopefully not to derail discussion: I think there are
>>>>> additional,
>>>>>>>>>>> related
>>>>>>>>>>>>> task attributes that may want to come into being: namely,
>>>>>>>>>> low_watermark
>>>>>>>>>>>> and
>>>>>>>>>>>>> high_watermark.  There is the potential, with attributes like
>>>>>> this,
>>>>>>>>>> for
>>>>>>>>>>>>> adding better out-of-the-box support for common data workflows
>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>> now
>>>>>>>>>>>>> need to use xcom for, namely incremental loads.  But I want to
>>>>>> give
>>>>>>>>>> it
>>>>>>>>>>>> more
>>>>>>>>>>>>> thought before proposing anything specific.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
>>>>>>>>>> Jarek.Potiuk@polidea.com
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Good one Damian. I will have a list of issues that can be
>>>>>> possible
>>>>>>>>>> to
>>>>>>>>>>>>>> handle at the workshop, so that one goes there.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> J.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Principal Software Engineer
>>>>>>>>>>>>>> Phone: +48660796129
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
>>>>>>>>>>>>>> damian.shaw.2@credit-suisse.com> napisał:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I can't understate what a conceptual improvement this would
>>>>> be
>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>> users of Airflow in our environment. I've written a lot of
>>>>> code
>>>>>>> so
>>>>>>>>>>> all
>>>>>>>>>>>>>> our
>>>>>>>>>>>>>>> configuration works like this anyway. But the UI still shows
>>>>>> the
>>>>>>>>>>>> Airflow
>>>>>>>>>>>>>>> dates which still to this day sometimes confuse me.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some
>>>>> of
>>>>>>> my
>>>>>>>>>>>> first
>>>>>>>>>>>>>>> PRs could be additional test cases around edge cases to do
>>>>> with
>>>>>>> DST
>>>>>>>>>>> and
>>>>>>>>>>>>>>> cron scheduling that I have concerns about :)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Damian
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
>>>>>>>>>>>>>>> Sent: Friday, August 23, 2019 6:50 AM
>>>>>>>>>>>>>>> To: dev@airflow.apache.org
>>>>>>>>>>>>>>> Subject: Setting to add choice of schedule at end or
>> schedule
>>>>>> at
>>>>>>>>>>> start
>>>>>>>>>>>> of
>>>>>>>>>>>>>>> interval
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This has come up a few times before, someone has now opened
>> a
>>>>>> PR
>>>>>>>>>> that
>>>>>>>>>>>>>>> makes this a global+per-dag setting:
>>>>>>>>>>>>>>> https://github.com/apache/airflow/pull/5787 and it also
>>>>>> includes
>>>>>>>>>>> docs
>>>>>>>>>>>>>>> that I think does a good job of illustrating the two modes.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Does anyone object to this being merged? If no one says
>>>>>> anything
>>>>>>> by
>>>>>>>>>>>>>> midday
>>>>>>>>>>>>>>> on Tuesday I will take that as assent and will merge it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The docs from the PR included below.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ash
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Scheduled Time vs Execution Time
>>>>>>>>>>>>>>> ''''''''''''''''''''''''''''''''
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> A DAG with a ``schedule_interval`` will execute once per
>>>>>>> interval.
>>>>>>>>>> By
>>>>>>>>>>>>>>> default, the execution of a DAG will occur at the **end** of
>>>>>> the
>>>>>>>>>>>>>>> schedule interval.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> A few examples:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run
>>>>> that
>>>>>>>>>>>> processes
>>>>>>>>>>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16
>>>>>>> 17:59:59,
>>>>>>>>>>>>>>> i.e. once that hour is over.
>>>>>>>>>>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run
>> that
>>>>>>>>>>> processes
>>>>>>>>>>>>>>> 2019-08-16 will start running shortly after 2019-08-17
>> 00:00.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The reasoning behind this execution vs scheduling behaviour
>>>>> is
>>>>>>> that
>>>>>>>>>>>>>>> data for the interval to be processed won't be fully
>>>>> available
>>>>>>>>>> until
>>>>>>>>>>>>>>> the interval has elapsed.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In cases where you wish the DAG to be executed at the
>>>>> **start**
>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>> interval, specify ``schedule_at_interval_end=False``, either
>>>>> in
>>>>>>>>>>>>>>> ``airflow.cfg``, or on a per-DAG basis.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> ===============================================================================
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Please access the attached hyperlink for an important
>>>>>> electronic
>>>>>>>>>>>>>>> communications disclaimer:
>>>>>>>>>>>>>>> 
>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> ===============================================================================
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>

Re: Setting to add choice of schedule at end or schedule at start of interval

Posted by James Coder <jc...@gmail.com>.

For my problem, and the one mentioned earlier for those of us in the financial world dealing with holidays this could be a solid solution. 
For my example below you could derive DAG and add a max_interval property that is a timedelta and if the delta between dttm and the value coming out of following/previous schedule is greater than that property, return dttm + max_interval. 
You might actually be able to do it without adding a property and just look at the delta for other runs and derive it from that.
For the holidays, one could probably just check if the return value of a super() following/previous schedule is in your holiday list, and then just super following/previous_schedule again until it’s not a holiday. 

While these are somewhat orthogonal to whether this PR should be merged, it is a helpful conversation for dealing with funky scheduling logic. 
Thanks for the idea Max!

-James


> On Sep 6, 2019, at 1:40 AM, Maxime Beauchemin <ma...@gmail.com> wrote:
> 
> Just had a thought and looked a tiny bit at the source code to assess
> feasibility, but it seems like you could just derive the DAG class and
> override `previous_schedule` and `following_schedule` methods. The
> signature of both is you get a `datetime.datetime` and have to return
> another. It's pretty easy to put your arbitrarily complex logic in there.
> 
> There may be a few hiccups to sort out things like like
> `airflow.utils.dates.date_range` (where duplicated time-step logic exist)
> to make sure that all time-step logic aligns with these two methods I just
> mentioned, but from that point it could be become the official way to
> incorporate funky date-step logic.
> 
> Max
> 
> On Wed, Sep 4, 2019 at 12:54 PM Daniel Standish <dp...@gmail.com>
> wrote:
> 
>> Re:
>> 
>>> For example, if I need to run a DAG every 20 minutes between 8 AM and 4
>>> PM...
>> 
>> 
>> This makes a lot of sense!  Thank you for providing this example.  My
>> initial thought of course is "well can't you just set it to run */20
>> between 7:40am and 3:40pm," but I don't think that is possible in cron.
>> Which is why you have to do hacky shit as you've said and it indeed sounds
>> terrible.  I never had to achieve a schedule like this, and yeah -- it
>> should not be this hard.
>> 
>> Re:
>> 
>>> I can’t see how adding a property to Dagrun that is essentially
>>> identical to next_execution_date would add any benefit.
>> 
>> That's why i was like what the hell is the point of this thing!   I thought
>> it was just purely cosmetic, so that in effect "execution_date" would
>> optionally mean "run_date".
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Wed, Sep 4, 2019 at 12:10 PM James Coder <jc...@gmail.com> wrote:
>>> 
>>> I can’t see how adding a property to Dagrun that is essentially
>>> identical to next_execution_date would add any benefit. The way I see
>>> it the issue at hand here is not the availability of dates. There are
>>> plenty of options in the template context for dates before and after
>>> execution date. My view point is the problem this is trying to solve
>>> is that waiting until the right edge of an interval has passed to
>>> schedule a dag run has some shortcomings. Mainly that if your
>>> intervals vary in length you are forced to put scheduling logic that
>>> should reside in the scheduler in your DAGs. For example, if I need to
>>> run a DAG every 20 minutes between 8 AM and 4 PM, in it's current
>>> form, the scheduler won't schedule that 4PM run until 8 AM the next
>>> day. "Just use next_execution_date" you say, well that's all well and
>>> good between 8AM and 3:40 PM, but when 4:01 PM rolls around and you
>>> don't have the results because they won't be available until after 8
>>> the next day, that doesn't sound so good, does it? In order to work
>>> around this, you have to add additional runs and short circuit
>>> operators over and over. It's a hassle.  Allowing for scheduling dags
>>> at the left edge of an interval and allowing it to behave more like
>>> cron, where it runs at the time specified, not schedule + interval,
>>> would make things much less complicated for users like myself that
>>> can't always wait until the right edge of the interval.
>>> 
>>> 
>>> James Coder
>>> 
>>>> On Sep 3, 2019, at 11:14 PM, Daniel Standish <dp...@gmail.com>
>>> wrote:
>>>> 
>>>> What if we merely add a property "run_date" to DagRun?  At present
>>>> this would be essentially same as "next_execution_date".
>>>> 
>>>> Then no change to scheduler would be required, and no new dag parameter
>>> or
>>>> config.  Perhaps you could add a toggle to the DAGs UI view that lets
>> you
>>>> choose whether to display "last run" by "run_date" or "execution_date".
>>>> 
>>>> If you want your dags to be parameterized by the date when they meant
>> to
>>> be
>>>> run -- as opposed to their implicit interval-of-interest -- then you
>> can
>>>> reference "run_date".
>>>> 
>>>> One potential source of confusion with this is backfilling: what does
>>>> "run_date" mean in the context of a backfill?  You could say it means
>>>> essentially "initial run date", i.e. "do not run before date", i.e.
>> "run
>>>> after date" or "run-at date".  So, for a daily, job the 2019-01-02
>>>> "run_date" corresponds to a 2019-01-01 execution_date.  This makes
>> sense
>>>> right?
>>>> 
>>>> Perhaps in the future, the relationship between "run_date" and
>>>> "execution_date" can be more dynamic.  Perhaps in the future we rename
>>>> "execution_date" for clarity, or to be more generic.  But it makes
>> sense
>>>> that a dag run will always have a run date, so it doesn't seem like a
>>>> terrible idea to add a property representing this.
>>>> 
>>>> Would this meet the goals of the PR?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Aug 28, 2019 at 11:50 AM James Meickle
>>>> <jm...@quantopian.com.invalid> wrote:
>>>> 
>>>>> Totally agree with Daniel here. I think that if we implement this
>>> feature
>>>>> as proposed, it will actively discourage us from implementing a better
>>>>> data-aware feature that would remain invisible to most users while
>>> neatly
>>>>> addressing a lot of edge cases that currently require really ugly
>>> hacks. I
>>>>> believe that having more data awareness features in Airflow (like the
>>> data
>>>>> lineage work, or other metadata integrations) is worth investing in if
>>> we
>>>>> can do it without too much required user-facing complexity. The
>> Airflow
>>>>> project isn't a full data warehouse suite but it's also not just "cron
>>> with
>>>>> a UI", so we should try to be pragmatic and fit in power-user features
>>>>> where we can do so without compromising the project's overall goals.
>>>>> 
>>>>> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dpstandish@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> I am just thinking there is the potential for a more comprehensive
>>>>>> enhancement here, and I worry that this is a band-aid that, like all
>>> new
>>>>>> features has the potential to constrain future options.  It does not
>>> help
>>>>>> us to do anything we cannot already do.
>>>>>> 
>>>>>> The source of this problem is that scheduling and
>> interval-of-interest
>>>>> are
>>>>>> mixed together.
>>>>>> 
>>>>>> My thought is there may be a way to separate scheduling and
>>>>>> interval-of-interest to uniformly resolve "execution_date" vs
>>> "run_date"
>>>>>> confusion.  We could make *explicit* instead of *implicit* the
>>>>> relationship
>>>>>> between run_date *(not currently a concept in airflow)* and
>>>>>> "interval-of-interest" *(currently represented by execution_date)*.
>>>>>> 
>>>>>> I also see in this the potential to unlock some other improvements:
>>>>>> * support a greater diversity of incremental processes
>>>>>> * allow more flexible backfilling
>>>>>> * provide better views of data you have vs data you don't.
>>>>>> 
>>>>>> The canonical airflow job is date-partitioned idempotent data pull.
>>> Your
>>>>>> interval of interest is from execution_date to execution_date + 1
>>>>>> interval.  Schedule_interval is not just the scheduling cadence but
>> it
>>> is
>>>>>> also your interval-of-interest partition function.   If that doesn't
>>> work
>>>>>> for your job, you set catchup=False and roll your own.
>>>>>> 
>>>>>> What if there was a way to generalize?  E.g. could we allow for more
>>>>>> flexible partition function that deviated from scheduler cadence?
>> E.g.
>>>>>> what if your interval-of-interest partitions could be governed by
>> "min
>>> 1
>>>>>> day, max 30 days".  Then on on-going basis, your daily loads would
>> be a
>>>>>> range of 1 day but then if server down for couple days, this could be
>>>>>> caught up in one task and if you backfill it could be up to 30-day
>>>>> batches.
>>>>>> 
>>>>>> Perhaps there is an abstraction that could be used by a greater
>>> diversity
>>>>>> of incremental processes.  Such a thing could support a nice "data
>>>>>> contiguity view". I imagine a horizontal bar that is solid where we
>>> have
>>>>>> the data and empty where we don't.  Then you click on a "missing"
>>> section
>>>>>> and you can  trigger a backfill task with that date interval
>> according
>>> to
>>>>>> your partitioning rules.
>>>>>> 
>>>>>> I can imagine using this for an incremental job where each time we
>> pull
>>>>> the
>>>>>> new data since last time; in the `execute` method the operator could
>>> set
>>>>>> `self.high_watermark` with the max datetime processed.  Or maybe a
>>>>> callback
>>>>>> function could be used to gather this value.  This value could be
>> used
>>> in
>>>>>> next run, and cold be depicted in a view.
>>>>>> 
>>>>>> Default intervals of interest could be status quo -- i.e. partitions
>>>>> equal
>>>>>> to schedule interval -- but could be overwritten using templating or
>>>>>> callbacks or setting it during `execute`.
>>>>>> 
>>>>>> So anyway, I don't have a master plan all figured out.  But I think
>>> there
>>>>>> is opportunity in this area for more comprehensive enhancement that
>>> goes
>>>>>> more directly at the root of the problem.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>> 
>>>>>>> How about an alternative approach that would introduce 2 new keyword
>>>>>>> arguments that are clear (something like, but maybe better than
>>>>>>> `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
>>>>>>> unchanged, but plan it's deprecation. As a first step
>> `execution_date`
>>>>>>> would be inferred from the new args, and warn about deprecation when
>>>>>> used.
>>>>>>> 
>>>>>>> Max
>>>>>>> 
>>>>>>> On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bd...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Execution date is execution date for a dag run no matter what.
>> There
>>>>> is
>>>>>>> no
>>>>>>>> end interval or start interval for a dag run. The only time this is
>>>>>>>> relevant is when we calculate the next or previous dagrun.
>>>>>>>> 
>>>>>>>> So I don't Daniels rationale makes sense (?)
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On 27 Aug 2019, at 17:40, Philippe Gagnon <ph...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I agree with Daniel's rationale but I am also worried about
>>>>> backwards
>>>>>>>>> compatibility as this would perhaps be the most disruptive
>> breaking
>>>>>>>> change
>>>>>>>>> possible. I think maybe we should write down the different options
>>>>>>>>> available to us (AIP?) and call for a vote. What does everyone
>>>>> think?
>>>>>>>>> 
>>>>>>>>>> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jc...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Can't execution date can already mean different things depending
>>>>> on
>>>>>> if
>>>>>>>> the
>>>>>>>>>> dag run was initiated via the scheduler or manually via command
>>>>>>>> line/API?
>>>>>>>>>> I agree that making it consistent might make it easier to explain
>>>>> to
>>>>>>> new
>>>>>>>>>> users, but should we exchange that for breaking pretty much every
>>>>>>>> existing
>>>>>>>>>> dag by re-defining what execution date is?
>>>>>>>>>> -James
>>>>>>>>>> 
>>>>>>>>>> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
>>>>>>> dpstandish@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> To Daniel’s concerns, I would argue this is not a change to
>>>>> what a
>>>>>>> dag
>>>>>>>>>>> run
>>>>>>>>>>>> is, it is rather a change to WHEN that dag run will be
>>>>> scheduled.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Execution date is part of the definition of a dag_run; it is
>>>>>> uniquely
>>>>>>>>>>> identified by an execution_date and dag_id.
>>>>>>>>>>> 
>>>>>>>>>>> When someone asks what is a dag_run, we should be able to
>> provide
>>>>>> an
>>>>>>>>>>> answer.
>>>>>>>>>>> 
>>>>>>>>>>> Imagine trying to explain what a dag run is, when execution_date
>>>>>> can
>>>>>>>> mean
>>>>>>>>>>> different things.
>>>>>>>>>>>  Admin: "A dag run is an execution_date and a dag_id".
>>>>>>>>>>>  New user: "Ok. Clear as a bell. What's an execution_date?"
>>>>>>>>>>>  Admin: "Well, it can be one of two things.  It *could* be when
>>>>>> the
>>>>>>>>>> dag
>>>>>>>>>>> will be run... but it could *also* be 'the time when dag should
>>>>> be
>>>>>>> run
>>>>>>>>>>> minus one schedule interval".  It depends on whether you choose
>>>>>> 'end'
>>>>>>>> or
>>>>>>>>>>> 'start' for 'schedule_interval_edge.'  If you choose 'start'
>> then
>>>>>>>>>>> execution_date means 'when dag will be run'.  If you choose
>> 'end'
>>>>>>> then
>>>>>>>>>>> execution_date means 'when dag will be run minus one interval.'
>>>>> If
>>>>>>> you
>>>>>>>>>>> change the parameter after some time, then we don't necessarily
>>>>>> know
>>>>>>>> what
>>>>>>>>>>> it means at all times".
>>>>>>>>>>> 
>>>>>>>>>>> Why would we do this to ourselves?
>>>>>>>>>>> 
>>>>>>>>>>> Alternatively, we can give dag_run a clear, unambiguous meaning:
>>>>>>>>>>> * dag_run is dag_id + execution_date
>>>>>>>>>>> * execution_date is when dag will be run (notwithstanding
>>>>> scheduler
>>>>>>>>>> delay,
>>>>>>>>>>> queuing)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Execution_date is defined as "run-at date minus 1 interval".
>> The
>>>>>>>>>>> assumption in this is that you tasks care about this particular
>>>>>> date.
>>>>>>>>>>> Obviously this makes sense for some tasks but not for others.
>>>>>>>>>>> 
>>>>>>>>>>> I would prop
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <
>> jcoder01@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I think this is a great improvement and should be merged. To
>>>>>>> Daniel’s
>>>>>>>>>>>> concerns, I would argue this is not a change to what a dag run
>>>>> is,
>>>>>>> it
>>>>>>>>>> is
>>>>>>>>>>>> rather a change to WHEN that dag run will be scheduled.
>>>>>>>>>>>> I had implemented a similar change in my own version but
>>>>>> ultimately
>>>>>>>>>>> backed
>>>>>>>>>>>> so I didn’t have to patch after each new release. In my opinion
>>>>>> the
>>>>>>>>>> main
>>>>>>>>>>>> flaw in the current scheduler, and I have brought this up
>>>>> before,
>>>>>> is
>>>>>>>>>> when
>>>>>>>>>>>> you don’t have a consistent schedule interval (e.g. only run
>>>>> M-F).
>>>>>>>>>> After
>>>>>>>>>>>> backing out the “schedule at interval start” I had to switch to
>>>>> a
>>>>>>>> daily
>>>>>>>>>>>> schedule and go through and put a short circuit operator in
>> each
>>>>>> of
>>>>>>> my
>>>>>>>>>>> M-F
>>>>>>>>>>>> dags to get the behavior that I wanted. This results in putting
>>>>>>>>>>> scheduling
>>>>>>>>>>>> logic inside the dag, when scheduling logic should be in the
>>>>>>>> scheduler.
>>>>>>>>>>>> 
>>>>>>>>>>>> -James
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <
>>>>>> dpstandish@gmail.com
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Re
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What are people's feelings on changing the default execution
>>>>> to
>>>>>>>>>>> schedule
>>>>>>>>>>>>>> interval start
>>>>>>>>>>>>> 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm in favor of doing that, but then exposing new variables
>> of
>>>>>>>>>>>>>> "interval_start" and "interval_end", etc. so that people
>> write
>>>>>>>>>>>>>> clearer-looking at-a-glance DAGs
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While I am def on board with the spirit of this PR, I would
>>>>> vote
>>>>>> we
>>>>>>>>>> do
>>>>>>>>>>>> not
>>>>>>>>>>>>> accept this PR as is, because it cements a confusing option.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *What is the right representation of a dag run?*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Right now the representation is "dag run-at date minus 1
>>>>>> interval".
>>>>>>>>>> It
>>>>>>>>>>>>> should just be "dag run-at date".
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We don't need to address the question of whether execution
>> date
>>>>>> is
>>>>>>>>>> the
>>>>>>>>>>>>> start or the end of an interval; it doesn't matter.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In all cases, a given dag run will be targeted for *some*
>>>>> initial
>>>>>>>>>>> "run-at
>>>>>>>>>>>>> time"; so *that* should be the time that is part of the PK of
>> a
>>>>>> dag
>>>>>>>>>>> run,
>>>>>>>>>>>>> and *that *is the time that should be exposed as the dag run
>>>>>>>>>> "execution
>>>>>>>>>>>>> date"
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Interval of interest is not a dag_run attribute*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We also mix in this question of the date interval that the
>>>>>> *tasks*
>>>>>>>>>> are
>>>>>>>>>>>>> interested in.  But the *dag run* need not concern itself with
>>>>>> this
>>>>>>>>>> in
>>>>>>>>>>>> any
>>>>>>>>>>>>> way.  That is for the tasks to figure out: if they happen to
>>>>> need
>>>>>>>>>> "dag
>>>>>>>>>>>>> run-at date," then they can reference that; if they want the
>>>>>> prior
>>>>>>>>>> one,
>>>>>>>>>>>> ask
>>>>>>>>>>>>> for the prior one.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Previously, I was in the camp that thought it was a great idea
>>>>> to
>>>>>>>>>>> rename
>>>>>>>>>>>>> "execution_date" to "period_start" or "interval_start".  But I
>>>>>> now
>>>>>>>>>>> think
>>>>>>>>>>>>> this is folly.  It invokes this question of the "interval of
>>>>>>>>>> interest"
>>>>>>>>>>> or
>>>>>>>>>>>>> "period of interest".  But the dag doesn't need to know
>>>>> anything
>>>>>>>>>> about
>>>>>>>>>>>>> that.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Within the same dag you may have tasks with different
>> intervals
>>>>>> of
>>>>>>>>>>>>> interest.  So why make assumptions in the dag; just give the
>>>>>> facts:
>>>>>>>>>>> this
>>>>>>>>>>>> is
>>>>>>>>>>>>> my run date; this is the prior run date, etc.  It would be a
>>>>>>>>>> regression
>>>>>>>>>>>>> from the perspective of providing accurate names.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Proposal*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So, I would propose we change "execution_date" to mean "dag
>>>>>> run-at
>>>>>>>>>>> date"
>>>>>>>>>>>> as
>>>>>>>>>>>>> opposed to "dag run-at date minus 1".  But we should do so
>>>>>> without
>>>>>>>>>>>>> reference to interval end or interval start.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Configurability*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The more configuration options we have, the more noise there
>> is
>>>>>> as
>>>>>>> a
>>>>>>>>>>> user
>>>>>>>>>>>>> trying to understand how to use airflow, so I'd rather us not
>>>>>> make
>>>>>>>>>> this
>>>>>>>>>>>>> configurable at all.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That said, perhaps a more clear and more explicit means making
>>>>>> this
>>>>>>>>>>>>> configurable would be to define an integer param
>>>>>>>>>>>>> "dag_run_execution_date_interval_offset", which would control
>>>>> how
>>>>>>>>>> many
>>>>>>>>>>>>> intervals back from actual "dag run-at date" the "execution
>>>>> date"
>>>>>>>>>>> should
>>>>>>>>>>>>> be.  (current behavior = 1, new behavior = 0).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Side note*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hopefully not to derail discussion: I think there are
>>>>> additional,
>>>>>>>>>>> related
>>>>>>>>>>>>> task attributes that may want to come into being: namely,
>>>>>>>>>> low_watermark
>>>>>>>>>>>> and
>>>>>>>>>>>>> high_watermark.  There is the potential, with attributes like
>>>>>> this,
>>>>>>>>>> for
>>>>>>>>>>>>> adding better out-of-the-box support for common data workflows
>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>> now
>>>>>>>>>>>>> need to use xcom for, namely incremental loads.  But I want to
>>>>>> give
>>>>>>>>>> it
>>>>>>>>>>>> more
>>>>>>>>>>>>> thought before proposing anything specific.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
>>>>>>>>>> Jarek.Potiuk@polidea.com
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Good one Damian. I will have a list of issues that can be
>>>>>> possible
>>>>>>>>>> to
>>>>>>>>>>>>>> handle at the workshop, so that one goes there.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> J.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Principal Software Engineer
>>>>>>>>>>>>>> Phone: +48660796129
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
>>>>>>>>>>>>>> damian.shaw.2@credit-suisse.com> napisał:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I can't understate what a conceptual improvement this would
>>>>> be
>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>> users of Airflow in our environment. I've written a lot of
>>>>> code
>>>>>>> so
>>>>>>>>>>> all
>>>>>>>>>>>>>> our
>>>>>>>>>>>>>>> configuration works like this anyway. But the UI still shows
>>>>>> the
>>>>>>>>>>>> Airflow
>>>>>>>>>>>>>>> dates which still to this day sometimes confuse me.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some
>>>>> of
>>>>>>> my
>>>>>>>>>>>> first
>>>>>>>>>>>>>>> PRs could be additional test cases around edge cases to do
>>>>> with
>>>>>>> DST
>>>>>>>>>>> and
>>>>>>>>>>>>>>> cron scheduling that I have concerns about :)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Damian
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
>>>>>>>>>>>>>>> Sent: Friday, August 23, 2019 6:50 AM
>>>>>>>>>>>>>>> To: dev@airflow.apache.org
>>>>>>>>>>>>>>> Subject: Setting to add choice of schedule at end or
>> schedule
>>>>>> at
>>>>>>>>>>> start
>>>>>>>>>>>> of
>>>>>>>>>>>>>>> interval
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This has come up a few times before, someone has now opened
>> a
>>>>>> PR
>>>>>>>>>> that
>>>>>>>>>>>>>>> makes this a global+per-dag setting:
>>>>>>>>>>>>>>> https://github.com/apache/airflow/pull/5787 and it also
>>>>>> includes
>>>>>>>>>>> docs
>>>>>>>>>>>>>>> that I think does a good job of illustrating the two modes.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Does anyone object to this being merged? If no one says
>>>>>> anything
>>>>>>> by
>>>>>>>>>>>>>> midday
>>>>>>>>>>>>>>> on Tuesday I will take that as assent and will merge it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The docs from the PR included below.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ash
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Scheduled Time vs Execution Time
>>>>>>>>>>>>>>> ''''''''''''''''''''''''''''''''
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> A DAG with a ``schedule_interval`` will execute once per
>>>>>>> interval.
>>>>>>>>>> By
>>>>>>>>>>>>>>> default, the execution of a DAG will occur at the **end** of
>>>>>> the
>>>>>>>>>>>>>>> schedule interval.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> A few examples:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run
>>>>> that
>>>>>>>>>>>> processes
>>>>>>>>>>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16
>>>>>>> 17:59:59,
>>>>>>>>>>>>>>> i.e. once that hour is over.
>>>>>>>>>>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run
>> that
>>>>>>>>>>> processes
>>>>>>>>>>>>>>> 2019-08-16 will start running shortly after 2019-08-17
>> 00:00.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The reasoning behind this execution vs scheduling behaviour
>>>>> is
>>>>>>> that
>>>>>>>>>>>>>>> data for the interval to be processed won't be fully
>>>>> available
>>>>>>>>>> until
>>>>>>>>>>>>>>> the interval has elapsed.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In cases where you wish the DAG to be executed at the
>>>>> **start**
>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>>>>>> interval, specify ``schedule_at_interval_end=False``, either
>>>>> in
>>>>>>>>>>>>>>> ``airflow.cfg``, or on a per-DAG basis.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> ===============================================================================
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Please access the attached hyperlink for an important
>>>>>> electronic
>>>>>>>>>>>>>>> communications disclaimer:
>>>>>>>>>>>>>>> 
>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> ===============================================================================
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>

Re: Setting to add choice of schedule at end or schedule at start of interval

Posted by Maxime Beauchemin <ma...@gmail.com>.

Just had a thought and looked a tiny bit at the source code to assess
feasibility, but it seems like you could just derive the DAG class and
override `previous_schedule` and `following_schedule` methods. The
signature of both is you get a `datetime.datetime` and have to return
another. It's pretty easy to put your arbitrarily complex logic in there.

There may be a few hiccups to sort out things like like
`airflow.utils.dates.date_range` (where duplicated time-step logic exist)
to make sure that all time-step logic aligns with these two methods I just
mentioned, but from that point it could be become the official way to
incorporate funky date-step logic.

Max

On Wed, Sep 4, 2019 at 12:54 PM Daniel Standish <dp...@gmail.com>
wrote:

> Re:
>
> >  For example, if I need to run a DAG every 20 minutes between 8 AM and 4
> > PM...
>
>
> This makes a lot of sense!  Thank you for providing this example.  My
> initial thought of course is "well can't you just set it to run */20
> between 7:40am and 3:40pm," but I don't think that is possible in cron.
> Which is why you have to do hacky shit as you've said and it indeed sounds
> terrible.  I never had to achieve a schedule like this, and yeah -- it
> should not be this hard.
>
> Re:
>
> > I can’t see how adding a property to Dagrun that is essentially
> > identical to next_execution_date would add any benefit.
>
> That's why i was like what the hell is the point of this thing!   I thought
> it was just purely cosmetic, so that in effect "execution_date" would
> optionally mean "run_date".
>
>
>
>
>
>
>
> On Wed, Sep 4, 2019 at 12:10 PM James Coder <jc...@gmail.com> wrote:
>
> > I can’t see how adding a property to Dagrun that is essentially
> > identical to next_execution_date would add any benefit. The way I see
> > it the issue at hand here is not the availability of dates. There are
> > plenty of options in the template context for dates before and after
> > execution date. My view point is the problem this is trying to solve
> > is that waiting until the right edge of an interval has passed to
> > schedule a dag run has some shortcomings. Mainly that if your
> > intervals vary in length you are forced to put scheduling logic that
> > should reside in the scheduler in your DAGs. For example, if I need to
> > run a DAG every 20 minutes between 8 AM and 4 PM, in it's current
> > form, the scheduler won't schedule that 4PM run until 8 AM the next
> > day. "Just use next_execution_date" you say, well that's all well and
> > good between 8AM and 3:40 PM, but when 4:01 PM rolls around and you
> > don't have the results because they won't be available until after 8
> > the next day, that doesn't sound so good, does it? In order to work
> > around this, you have to add additional runs and short circuit
> > operators over and over. It's a hassle.  Allowing for scheduling dags
> > at the left edge of an interval and allowing it to behave more like
> > cron, where it runs at the time specified, not schedule + interval,
> > would make things much less complicated for users like myself that
> > can't always wait until the right edge of the interval.
> >
> >
> > James Coder
> >
> > > On Sep 3, 2019, at 11:14 PM, Daniel Standish <dp...@gmail.com>
> > wrote:
> > >
> > > What if we merely add a property "run_date" to DagRun?  At present
> > > this would be essentially same as "next_execution_date".
> > >
> > > Then no change to scheduler would be required, and no new dag parameter
> > or
> > > config.  Perhaps you could add a toggle to the DAGs UI view that lets
> you
> > > choose whether to display "last run" by "run_date" or "execution_date".
> > >
> > > If you want your dags to be parameterized by the date when they meant
> to
> > be
> > > run -- as opposed to their implicit interval-of-interest -- then you
> can
> > > reference "run_date".
> > >
> > > One potential source of confusion with this is backfilling: what does
> > > "run_date" mean in the context of a backfill?  You could say it means
> > > essentially "initial run date", i.e. "do not run before date", i.e.
> "run
> > > after date" or "run-at date".  So, for a daily, job the 2019-01-02
> > > "run_date" corresponds to a 2019-01-01 execution_date.  This makes
> sense
> > > right?
> > >
> > > Perhaps in the future, the relationship between "run_date" and
> > > "execution_date" can be more dynamic.  Perhaps in the future we rename
> > > "execution_date" for clarity, or to be more generic.  But it makes
> sense
> > > that a dag run will always have a run date, so it doesn't seem like a
> > > terrible idea to add a property representing this.
> > >
> > > Would this meet the goals of the PR?
> > >
> > >
> > >
> > >
> > > On Wed, Aug 28, 2019 at 11:50 AM James Meickle
> > > <jm...@quantopian.com.invalid> wrote:
> > >
> > >> Totally agree with Daniel here. I think that if we implement this
> > feature
> > >> as proposed, it will actively discourage us from implementing a better
> > >> data-aware feature that would remain invisible to most users while
> > neatly
> > >> addressing a lot of edge cases that currently require really ugly
> > hacks. I
> > >> believe that having more data awareness features in Airflow (like the
> > data
> > >> lineage work, or other metadata integrations) is worth investing in if
> > we
> > >> can do it without too much required user-facing complexity. The
> Airflow
> > >> project isn't a full data warehouse suite but it's also not just "cron
> > with
> > >> a UI", so we should try to be pragmatic and fit in power-user features
> > >> where we can do so without compromising the project's overall goals.
> > >>
> > >> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dpstandish@gmail.com
> >
> > >> wrote:
> > >>
> > >>> I am just thinking there is the potential for a more comprehensive
> > >>> enhancement here, and I worry that this is a band-aid that, like all
> > new
> > >>> features has the potential to constrain future options.  It does not
> > help
> > >>> us to do anything we cannot already do.
> > >>>
> > >>> The source of this problem is that scheduling and
> interval-of-interest
> > >> are
> > >>> mixed together.
> > >>>
> > >>> My thought is there may be a way to separate scheduling and
> > >>> interval-of-interest to uniformly resolve "execution_date" vs
> > "run_date"
> > >>> confusion.  We could make *explicit* instead of *implicit* the
> > >> relationship
> > >>> between run_date *(not currently a concept in airflow)* and
> > >>> "interval-of-interest" *(currently represented by execution_date)*.
> > >>>
> > >>> I also see in this the potential to unlock some other improvements:
> > >>> * support a greater diversity of incremental processes
> > >>> * allow more flexible backfilling
> > >>> * provide better views of data you have vs data you don't.
> > >>>
> > >>> The canonical airflow job is date-partitioned idempotent data pull.
> > Your
> > >>> interval of interest is from execution_date to execution_date + 1
> > >>> interval.  Schedule_interval is not just the scheduling cadence but
> it
> > is
> > >>> also your interval-of-interest partition function.   If that doesn't
> > work
> > >>> for your job, you set catchup=False and roll your own.
> > >>>
> > >>> What if there was a way to generalize?  E.g. could we allow for more
> > >>> flexible partition function that deviated from scheduler cadence?
> E.g.
> > >>> what if your interval-of-interest partitions could be governed by
> "min
> > 1
> > >>> day, max 30 days".  Then on on-going basis, your daily loads would
> be a
> > >>> range of 1 day but then if server down for couple days, this could be
> > >>> caught up in one task and if you backfill it could be up to 30-day
> > >> batches.
> > >>>
> > >>> Perhaps there is an abstraction that could be used by a greater
> > diversity
> > >>> of incremental processes.  Such a thing could support a nice "data
> > >>> contiguity view". I imagine a horizontal bar that is solid where we
> > have
> > >>> the data and empty where we don't.  Then you click on a "missing"
> > section
> > >>> and you can  trigger a backfill task with that date interval
> according
> > to
> > >>> your partitioning rules.
> > >>>
> > >>> I can imagine using this for an incremental job where each time we
> pull
> > >> the
> > >>> new data since last time; in the `execute` method the operator could
> > set
> > >>> `self.high_watermark` with the max datetime processed.  Or maybe a
> > >> callback
> > >>> function could be used to gather this value.  This value could be
> used
> > in
> > >>> next run, and cold be depicted in a view.
> > >>>
> > >>> Default intervals of interest could be status quo -- i.e. partitions
> > >> equal
> > >>> to schedule interval -- but could be overwritten using templating or
> > >>> callbacks or setting it during `execute`.
> > >>>
> > >>> So anyway, I don't have a master plan all figured out.  But I think
> > there
> > >>> is opportunity in this area for more comprehensive enhancement that
> > goes
> > >>> more directly at the root of the problem.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
> > >>> maximebeauchemin@gmail.com> wrote:
> > >>>
> > >>>> How about an alternative approach that would introduce 2 new keyword
> > >>>> arguments that are clear (something like, but maybe better than
> > >>>> `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
> > >>>> unchanged, but plan it's deprecation. As a first step
> `execution_date`
> > >>>> would be inferred from the new args, and warn about deprecation when
> > >>> used.
> > >>>>
> > >>>> Max
> > >>>>
> > >>>> On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bd...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> Execution date is execution date for a dag run no matter what.
> There
> > >> is
> > >>>> no
> > >>>>> end interval or start interval for a dag run. The only time this is
> > >>>>> relevant is when we calculate the next or previous dagrun.
> > >>>>>
> > >>>>> So I don't Daniels rationale makes sense (?)
> > >>>>>
> > >>>>> Sent from my iPhone
> > >>>>>
> > >>>>>> On 27 Aug 2019, at 17:40, Philippe Gagnon <ph...@gmail.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>> I agree with Daniel's rationale but I am also worried about
> > >> backwards
> > >>>>>> compatibility as this would perhaps be the most disruptive
> breaking
> > >>>>> change
> > >>>>>> possible. I think maybe we should write down the different options
> > >>>>>> available to us (AIP?) and call for a vote. What does everyone
> > >> think?
> > >>>>>>
> > >>>>>>> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jc...@gmail.com>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>> Can't execution date can already mean different things depending
> > >> on
> > >>> if
> > >>>>> the
> > >>>>>>> dag run was initiated via the scheduler or manually via command
> > >>>>> line/API?
> > >>>>>>> I agree that making it consistent might make it easier to explain
> > >> to
> > >>>> new
> > >>>>>>> users, but should we exchange that for breaking pretty much every
> > >>>>> existing
> > >>>>>>> dag by re-defining what execution date is?
> > >>>>>>> -James
> > >>>>>>>
> > >>>>>>> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
> > >>>> dpstandish@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> To Daniel’s concerns, I would argue this is not a change to
> > >> what a
> > >>>> dag
> > >>>>>>>> run
> > >>>>>>>>> is, it is rather a change to WHEN that dag run will be
> > >> scheduled.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Execution date is part of the definition of a dag_run; it is
> > >>> uniquely
> > >>>>>>>> identified by an execution_date and dag_id.
> > >>>>>>>>
> > >>>>>>>> When someone asks what is a dag_run, we should be able to
> provide
> > >>> an
> > >>>>>>>> answer.
> > >>>>>>>>
> > >>>>>>>> Imagine trying to explain what a dag run is, when execution_date
> > >>> can
> > >>>>> mean
> > >>>>>>>> different things.
> > >>>>>>>>   Admin: "A dag run is an execution_date and a dag_id".
> > >>>>>>>>   New user: "Ok. Clear as a bell. What's an execution_date?"
> > >>>>>>>>   Admin: "Well, it can be one of two things.  It *could* be when
> > >>> the
> > >>>>>>> dag
> > >>>>>>>> will be run... but it could *also* be 'the time when dag should
> > >> be
> > >>>> run
> > >>>>>>>> minus one schedule interval".  It depends on whether you choose
> > >>> 'end'
> > >>>>> or
> > >>>>>>>> 'start' for 'schedule_interval_edge.'  If you choose 'start'
> then
> > >>>>>>>> execution_date means 'when dag will be run'.  If you choose
> 'end'
> > >>>> then
> > >>>>>>>> execution_date means 'when dag will be run minus one interval.'
> > >> If
> > >>>> you
> > >>>>>>>> change the parameter after some time, then we don't necessarily
> > >>> know
> > >>>>> what
> > >>>>>>>> it means at all times".
> > >>>>>>>>
> > >>>>>>>> Why would we do this to ourselves?
> > >>>>>>>>
> > >>>>>>>> Alternatively, we can give dag_run a clear, unambiguous meaning:
> > >>>>>>>> * dag_run is dag_id + execution_date
> > >>>>>>>> * execution_date is when dag will be run (notwithstanding
> > >> scheduler
> > >>>>>>> delay,
> > >>>>>>>> queuing)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Execution_date is defined as "run-at date minus 1 interval".
> The
> > >>>>>>>> assumption in this is that you tasks care about this particular
> > >>> date.
> > >>>>>>>> Obviously this makes sense for some tasks but not for others.
> > >>>>>>>>
> > >>>>>>>> I would prop
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <
> jcoder01@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> I think this is a great improvement and should be merged. To
> > >>>> Daniel’s
> > >>>>>>>>> concerns, I would argue this is not a change to what a dag run
> > >> is,
> > >>>> it
> > >>>>>>> is
> > >>>>>>>>> rather a change to WHEN that dag run will be scheduled.
> > >>>>>>>>> I had implemented a similar change in my own version but
> > >>> ultimately
> > >>>>>>>> backed
> > >>>>>>>>> so I didn’t have to patch after each new release. In my opinion
> > >>> the
> > >>>>>>> main
> > >>>>>>>>> flaw in the current scheduler, and I have brought this up
> > >> before,
> > >>> is
> > >>>>>>> when
> > >>>>>>>>> you don’t have a consistent schedule interval (e.g. only run
> > >> M-F).
> > >>>>>>> After
> > >>>>>>>>> backing out the “schedule at interval start” I had to switch to
> > >> a
> > >>>>> daily
> > >>>>>>>>> schedule and go through and put a short circuit operator in
> each
> > >>> of
> > >>>> my
> > >>>>>>>> M-F
> > >>>>>>>>> dags to get the behavior that I wanted. This results in putting
> > >>>>>>>> scheduling
> > >>>>>>>>> logic inside the dag, when scheduling logic should be in the
> > >>>>> scheduler.
> > >>>>>>>>>
> > >>>>>>>>> -James
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <
> > >>> dpstandish@gmail.com
> > >>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Re
> > >>>>>>>>>>
> > >>>>>>>>>>> What are people's feelings on changing the default execution
> > >> to
> > >>>>>>>> schedule
> > >>>>>>>>>>> interval start
> > >>>>>>>>>>
> > >>>>>>>>>> and
> > >>>>>>>>>>
> > >>>>>>>>>>> I'm in favor of doing that, but then exposing new variables
> of
> > >>>>>>>>>>> "interval_start" and "interval_end", etc. so that people
> write
> > >>>>>>>>>>> clearer-looking at-a-glance DAGs
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> While I am def on board with the spirit of this PR, I would
> > >> vote
> > >>> we
> > >>>>>>> do
> > >>>>>>>>> not
> > >>>>>>>>>> accept this PR as is, because it cements a confusing option.
> > >>>>>>>>>>
> > >>>>>>>>>> *What is the right representation of a dag run?*
> > >>>>>>>>>>
> > >>>>>>>>>> Right now the representation is "dag run-at date minus 1
> > >>> interval".
> > >>>>>>> It
> > >>>>>>>>>> should just be "dag run-at date".
> > >>>>>>>>>>
> > >>>>>>>>>> We don't need to address the question of whether execution
> date
> > >>> is
> > >>>>>>> the
> > >>>>>>>>>> start or the end of an interval; it doesn't matter.
> > >>>>>>>>>>
> > >>>>>>>>>> In all cases, a given dag run will be targeted for *some*
> > >> initial
> > >>>>>>>> "run-at
> > >>>>>>>>>> time"; so *that* should be the time that is part of the PK of
> a
> > >>> dag
> > >>>>>>>> run,
> > >>>>>>>>>> and *that *is the time that should be exposed as the dag run
> > >>>>>>> "execution
> > >>>>>>>>>> date"
> > >>>>>>>>>>
> > >>>>>>>>>> *Interval of interest is not a dag_run attribute*
> > >>>>>>>>>>
> > >>>>>>>>>> We also mix in this question of the date interval that the
> > >>> *tasks*
> > >>>>>>> are
> > >>>>>>>>>> interested in.  But the *dag run* need not concern itself with
> > >>> this
> > >>>>>>> in
> > >>>>>>>>> any
> > >>>>>>>>>> way.  That is for the tasks to figure out: if they happen to
> > >> need
> > >>>>>>> "dag
> > >>>>>>>>>> run-at date," then they can reference that; if they want the
> > >>> prior
> > >>>>>>> one,
> > >>>>>>>>> ask
> > >>>>>>>>>> for the prior one.
> > >>>>>>>>>>
> > >>>>>>>>>> Previously, I was in the camp that thought it was a great idea
> > >> to
> > >>>>>>>> rename
> > >>>>>>>>>> "execution_date" to "period_start" or "interval_start".  But I
> > >>> now
> > >>>>>>>> think
> > >>>>>>>>>> this is folly.  It invokes this question of the "interval of
> > >>>>>>> interest"
> > >>>>>>>> or
> > >>>>>>>>>> "period of interest".  But the dag doesn't need to know
> > >> anything
> > >>>>>>> about
> > >>>>>>>>>> that.
> > >>>>>>>>>>
> > >>>>>>>>>> Within the same dag you may have tasks with different
> intervals
> > >>> of
> > >>>>>>>>>> interest.  So why make assumptions in the dag; just give the
> > >>> facts:
> > >>>>>>>> this
> > >>>>>>>>> is
> > >>>>>>>>>> my run date; this is the prior run date, etc.  It would be a
> > >>>>>>> regression
> > >>>>>>>>>> from the perspective of providing accurate names.
> > >>>>>>>>>>
> > >>>>>>>>>> *Proposal*
> > >>>>>>>>>>
> > >>>>>>>>>> So, I would propose we change "execution_date" to mean "dag
> > >>> run-at
> > >>>>>>>> date"
> > >>>>>>>>> as
> > >>>>>>>>>> opposed to "dag run-at date minus 1".  But we should do so
> > >>> without
> > >>>>>>>>>> reference to interval end or interval start.
> > >>>>>>>>>>
> > >>>>>>>>>> *Configurability*
> > >>>>>>>>>>
> > >>>>>>>>>> The more configuration options we have, the more noise there
> is
> > >>> as
> > >>>> a
> > >>>>>>>> user
> > >>>>>>>>>> trying to understand how to use airflow, so I'd rather us not
> > >>> make
> > >>>>>>> this
> > >>>>>>>>>> configurable at all.
> > >>>>>>>>>>
> > >>>>>>>>>> That said, perhaps a more clear and more explicit means making
> > >>> this
> > >>>>>>>>>> configurable would be to define an integer param
> > >>>>>>>>>> "dag_run_execution_date_interval_offset", which would control
> > >> how
> > >>>>>>> many
> > >>>>>>>>>> intervals back from actual "dag run-at date" the "execution
> > >> date"
> > >>>>>>>> should
> > >>>>>>>>>> be.  (current behavior = 1, new behavior = 0).
> > >>>>>>>>>>
> > >>>>>>>>>> *Side note*
> > >>>>>>>>>>
> > >>>>>>>>>> Hopefully not to derail discussion: I think there are
> > >> additional,
> > >>>>>>>> related
> > >>>>>>>>>> task attributes that may want to come into being: namely,
> > >>>>>>> low_watermark
> > >>>>>>>>> and
> > >>>>>>>>>> high_watermark.  There is the potential, with attributes like
> > >>> this,
> > >>>>>>> for
> > >>>>>>>>>> adding better out-of-the-box support for common data workflows
> > >>> that
> > >>>>>>> we
> > >>>>>>>>> now
> > >>>>>>>>>> need to use xcom for, namely incremental loads.  But I want to
> > >>> give
> > >>>>>>> it
> > >>>>>>>>> more
> > >>>>>>>>>> thought before proposing anything specific.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
> > >>>>>>> Jarek.Potiuk@polidea.com
> > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Good one Damian. I will have a list of issues that can be
> > >>> possible
> > >>>>>>> to
> > >>>>>>>>>>> handle at the workshop, so that one goes there.
> > >>>>>>>>>>>
> > >>>>>>>>>>> J.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Principal Software Engineer
> > >>>>>>>>>>> Phone: +48660796129
> > >>>>>>>>>>>
> > >>>>>>>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> > >>>>>>>>>>> damian.shaw.2@credit-suisse.com> napisał:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I can't understate what a conceptual improvement this would
> > >> be
> > >>>> for
> > >>>>>>>> the
> > >>>>>>>>>>> end
> > >>>>>>>>>>>> users of Airflow in our environment. I've written a lot of
> > >> code
> > >>>> so
> > >>>>>>>> all
> > >>>>>>>>>>> our
> > >>>>>>>>>>>> configuration works like this anyway. But the UI still shows
> > >>> the
> > >>>>>>>>> Airflow
> > >>>>>>>>>>>> dates which still to this day sometimes confuse me.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some
> > >> of
> > >>>> my
> > >>>>>>>>> first
> > >>>>>>>>>>>> PRs could be additional test cases around edge cases to do
> > >> with
> > >>>> DST
> > >>>>>>>> and
> > >>>>>>>>>>>> cron scheduling that I have concerns about :)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Damian
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
> > >>>>>>>>>>>> Sent: Friday, August 23, 2019 6:50 AM
> > >>>>>>>>>>>> To: dev@airflow.apache.org
> > >>>>>>>>>>>> Subject: Setting to add choice of schedule at end or
> schedule
> > >>> at
> > >>>>>>>> start
> > >>>>>>>>> of
> > >>>>>>>>>>>> interval
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This has come up a few times before, someone has now opened
> a
> > >>> PR
> > >>>>>>> that
> > >>>>>>>>>>>> makes this a global+per-dag setting:
> > >>>>>>>>>>>> https://github.com/apache/airflow/pull/5787 and it also
> > >>> includes
> > >>>>>>>> docs
> > >>>>>>>>>>>> that I think does a good job of illustrating the two modes.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Does anyone object to this being merged? If no one says
> > >>> anything
> > >>>> by
> > >>>>>>>>>>> midday
> > >>>>>>>>>>>> on Tuesday I will take that as assent and will merge it.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The docs from the PR included below.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>> Ash
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Scheduled Time vs Execution Time
> > >>>>>>>>>>>> ''''''''''''''''''''''''''''''''
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A DAG with a ``schedule_interval`` will execute once per
> > >>>> interval.
> > >>>>>>> By
> > >>>>>>>>>>>> default, the execution of a DAG will occur at the **end** of
> > >>> the
> > >>>>>>>>>>>> schedule interval.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A few examples:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run
> > >> that
> > >>>>>>>>> processes
> > >>>>>>>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16
> > >>>> 17:59:59,
> > >>>>>>>>>>>> i.e. once that hour is over.
> > >>>>>>>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run
> that
> > >>>>>>>> processes
> > >>>>>>>>>>>> 2019-08-16 will start running shortly after 2019-08-17
> 00:00.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The reasoning behind this execution vs scheduling behaviour
> > >> is
> > >>>> that
> > >>>>>>>>>>>> data for the interval to be processed won't be fully
> > >> available
> > >>>>>>> until
> > >>>>>>>>>>>> the interval has elapsed.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In cases where you wish the DAG to be executed at the
> > >> **start**
> > >>>> of
> > >>>>>>>> the
> > >>>>>>>>>>>> interval, specify ``schedule_at_interval_end=False``, either
> > >> in
> > >>>>>>>>>>>> ``airflow.cfg``, or on a per-DAG basis.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> ===============================================================================
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Please access the attached hyperlink for an important
> > >>> electronic
> > >>>>>>>>>>>> communications disclaimer:
> > >>>>>>>>>>>>
> > >> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> ===============================================================================
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>

Re: Setting to add choice of schedule at end or schedule at start of interval

Posted by Daniel Standish <dp...@gmail.com>.

Re:

>  For example, if I need to run a DAG every 20 minutes between 8 AM and 4
> PM...


This makes a lot of sense!  Thank you for providing this example.  My
initial thought of course is "well can't you just set it to run */20
between 7:40am and 3:40pm," but I don't think that is possible in cron.
Which is why you have to do hacky shit as you've said and it indeed sounds
terrible.  I never had to achieve a schedule like this, and yeah -- it
should not be this hard.

Re:

> I can’t see how adding a property to Dagrun that is essentially
> identical to next_execution_date would add any benefit.

That's why i was like what the hell is the point of this thing!   I thought
it was just purely cosmetic, so that in effect "execution_date" would
optionally mean "run_date".







On Wed, Sep 4, 2019 at 12:10 PM James Coder <jc...@gmail.com> wrote:

> I can’t see how adding a property to Dagrun that is essentially
> identical to next_execution_date would add any benefit. The way I see
> it the issue at hand here is not the availability of dates. There are
> plenty of options in the template context for dates before and after
> execution date. My view point is the problem this is trying to solve
> is that waiting until the right edge of an interval has passed to
> schedule a dag run has some shortcomings. Mainly that if your
> intervals vary in length you are forced to put scheduling logic that
> should reside in the scheduler in your DAGs. For example, if I need to
> run a DAG every 20 minutes between 8 AM and 4 PM, in it's current
> form, the scheduler won't schedule that 4PM run until 8 AM the next
> day. "Just use next_execution_date" you say, well that's all well and
> good between 8AM and 3:40 PM, but when 4:01 PM rolls around and you
> don't have the results because they won't be available until after 8
> the next day, that doesn't sound so good, does it? In order to work
> around this, you have to add additional runs and short circuit
> operators over and over. It's a hassle.  Allowing for scheduling dags
> at the left edge of an interval and allowing it to behave more like
> cron, where it runs at the time specified, not schedule + interval,
> would make things much less complicated for users like myself that
> can't always wait until the right edge of the interval.
>
>
> James Coder
>
> > On Sep 3, 2019, at 11:14 PM, Daniel Standish <dp...@gmail.com>
> wrote:
> >
> > What if we merely add a property "run_date" to DagRun?  At present
> > this would be essentially same as "next_execution_date".
> >
> > Then no change to scheduler would be required, and no new dag parameter
> or
> > config.  Perhaps you could add a toggle to the DAGs UI view that lets you
> > choose whether to display "last run" by "run_date" or "execution_date".
> >
> > If you want your dags to be parameterized by the date when they meant to
> be
> > run -- as opposed to their implicit interval-of-interest -- then you can
> > reference "run_date".
> >
> > One potential source of confusion with this is backfilling: what does
> > "run_date" mean in the context of a backfill?  You could say it means
> > essentially "initial run date", i.e. "do not run before date", i.e. "run
> > after date" or "run-at date".  So, for a daily, job the 2019-01-02
> > "run_date" corresponds to a 2019-01-01 execution_date.  This makes sense
> > right?
> >
> > Perhaps in the future, the relationship between "run_date" and
> > "execution_date" can be more dynamic.  Perhaps in the future we rename
> > "execution_date" for clarity, or to be more generic.  But it makes sense
> > that a dag run will always have a run date, so it doesn't seem like a
> > terrible idea to add a property representing this.
> >
> > Would this meet the goals of the PR?
> >
> >
> >
> >
> > On Wed, Aug 28, 2019 at 11:50 AM James Meickle
> > <jm...@quantopian.com.invalid> wrote:
> >
> >> Totally agree with Daniel here. I think that if we implement this
> feature
> >> as proposed, it will actively discourage us from implementing a better
> >> data-aware feature that would remain invisible to most users while
> neatly
> >> addressing a lot of edge cases that currently require really ugly
> hacks. I
> >> believe that having more data awareness features in Airflow (like the
> data
> >> lineage work, or other metadata integrations) is worth investing in if
> we
> >> can do it without too much required user-facing complexity. The Airflow
> >> project isn't a full data warehouse suite but it's also not just "cron
> with
> >> a UI", so we should try to be pragmatic and fit in power-user features
> >> where we can do so without compromising the project's overall goals.
> >>
> >> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dp...@gmail.com>
> >> wrote:
> >>
> >>> I am just thinking there is the potential for a more comprehensive
> >>> enhancement here, and I worry that this is a band-aid that, like all
> new
> >>> features has the potential to constrain future options.  It does not
> help
> >>> us to do anything we cannot already do.
> >>>
> >>> The source of this problem is that scheduling and interval-of-interest
> >> are
> >>> mixed together.
> >>>
> >>> My thought is there may be a way to separate scheduling and
> >>> interval-of-interest to uniformly resolve "execution_date" vs
> "run_date"
> >>> confusion.  We could make *explicit* instead of *implicit* the
> >> relationship
> >>> between run_date *(not currently a concept in airflow)* and
> >>> "interval-of-interest" *(currently represented by execution_date)*.
> >>>
> >>> I also see in this the potential to unlock some other improvements:
> >>> * support a greater diversity of incremental processes
> >>> * allow more flexible backfilling
> >>> * provide better views of data you have vs data you don't.
> >>>
> >>> The canonical airflow job is date-partitioned idempotent data pull.
> Your
> >>> interval of interest is from execution_date to execution_date + 1
> >>> interval.  Schedule_interval is not just the scheduling cadence but it
> is
> >>> also your interval-of-interest partition function.   If that doesn't
> work
> >>> for your job, you set catchup=False and roll your own.
> >>>
> >>> What if there was a way to generalize?  E.g. could we allow for more
> >>> flexible partition function that deviated from scheduler cadence?  E.g.
> >>> what if your interval-of-interest partitions could be governed by "min
> 1
> >>> day, max 30 days".  Then on on-going basis, your daily loads would be a
> >>> range of 1 day but then if server down for couple days, this could be
> >>> caught up in one task and if you backfill it could be up to 30-day
> >> batches.
> >>>
> >>> Perhaps there is an abstraction that could be used by a greater
> diversity
> >>> of incremental processes.  Such a thing could support a nice "data
> >>> contiguity view". I imagine a horizontal bar that is solid where we
> have
> >>> the data and empty where we don't.  Then you click on a "missing"
> section
> >>> and you can  trigger a backfill task with that date interval according
> to
> >>> your partitioning rules.
> >>>
> >>> I can imagine using this for an incremental job where each time we pull
> >> the
> >>> new data since last time; in the `execute` method the operator could
> set
> >>> `self.high_watermark` with the max datetime processed.  Or maybe a
> >> callback
> >>> function could be used to gather this value.  This value could be used
> in
> >>> next run, and cold be depicted in a view.
> >>>
> >>> Default intervals of interest could be status quo -- i.e. partitions
> >> equal
> >>> to schedule interval -- but could be overwritten using templating or
> >>> callbacks or setting it during `execute`.
> >>>
> >>> So anyway, I don't have a master plan all figured out.  But I think
> there
> >>> is opportunity in this area for more comprehensive enhancement that
> goes
> >>> more directly at the root of the problem.
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
> >>> maximebeauchemin@gmail.com> wrote:
> >>>
> >>>> How about an alternative approach that would introduce 2 new keyword
> >>>> arguments that are clear (something like, but maybe better than
> >>>> `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
> >>>> unchanged, but plan it's deprecation. As a first step `execution_date`
> >>>> would be inferred from the new args, and warn about deprecation when
> >>> used.
> >>>>
> >>>> Max
> >>>>
> >>>> On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bd...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> Execution date is execution date for a dag run no matter what. There
> >> is
> >>>> no
> >>>>> end interval or start interval for a dag run. The only time this is
> >>>>> relevant is when we calculate the next or previous dagrun.
> >>>>>
> >>>>> So I don't Daniels rationale makes sense (?)
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On 27 Aug 2019, at 17:40, Philippe Gagnon <ph...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> I agree with Daniel's rationale but I am also worried about
> >> backwards
> >>>>>> compatibility as this would perhaps be the most disruptive breaking
> >>>>> change
> >>>>>> possible. I think maybe we should write down the different options
> >>>>>> available to us (AIP?) and call for a vote. What does everyone
> >> think?
> >>>>>>
> >>>>>>> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jc...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> Can't execution date can already mean different things depending
> >> on
> >>> if
> >>>>> the
> >>>>>>> dag run was initiated via the scheduler or manually via command
> >>>>> line/API?
> >>>>>>> I agree that making it consistent might make it easier to explain
> >> to
> >>>> new
> >>>>>>> users, but should we exchange that for breaking pretty much every
> >>>>> existing
> >>>>>>> dag by re-defining what execution date is?
> >>>>>>> -James
> >>>>>>>
> >>>>>>> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
> >>>> dpstandish@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> To Daniel’s concerns, I would argue this is not a change to
> >> what a
> >>>> dag
> >>>>>>>> run
> >>>>>>>>> is, it is rather a change to WHEN that dag run will be
> >> scheduled.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Execution date is part of the definition of a dag_run; it is
> >>> uniquely
> >>>>>>>> identified by an execution_date and dag_id.
> >>>>>>>>
> >>>>>>>> When someone asks what is a dag_run, we should be able to provide
> >>> an
> >>>>>>>> answer.
> >>>>>>>>
> >>>>>>>> Imagine trying to explain what a dag run is, when execution_date
> >>> can
> >>>>> mean
> >>>>>>>> different things.
> >>>>>>>>   Admin: "A dag run is an execution_date and a dag_id".
> >>>>>>>>   New user: "Ok. Clear as a bell. What's an execution_date?"
> >>>>>>>>   Admin: "Well, it can be one of two things.  It *could* be when
> >>> the
> >>>>>>> dag
> >>>>>>>> will be run... but it could *also* be 'the time when dag should
> >> be
> >>>> run
> >>>>>>>> minus one schedule interval".  It depends on whether you choose
> >>> 'end'
> >>>>> or
> >>>>>>>> 'start' for 'schedule_interval_edge.'  If you choose 'start' then
> >>>>>>>> execution_date means 'when dag will be run'.  If you choose 'end'
> >>>> then
> >>>>>>>> execution_date means 'when dag will be run minus one interval.'
> >> If
> >>>> you
> >>>>>>>> change the parameter after some time, then we don't necessarily
> >>> know
> >>>>> what
> >>>>>>>> it means at all times".
> >>>>>>>>
> >>>>>>>> Why would we do this to ourselves?
> >>>>>>>>
> >>>>>>>> Alternatively, we can give dag_run a clear, unambiguous meaning:
> >>>>>>>> * dag_run is dag_id + execution_date
> >>>>>>>> * execution_date is when dag will be run (notwithstanding
> >> scheduler
> >>>>>>> delay,
> >>>>>>>> queuing)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Execution_date is defined as "run-at date minus 1 interval".  The
> >>>>>>>> assumption in this is that you tasks care about this particular
> >>> date.
> >>>>>>>> Obviously this makes sense for some tasks but not for others.
> >>>>>>>>
> >>>>>>>> I would prop
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcoder01@gmail.com
> >>>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> I think this is a great improvement and should be merged. To
> >>>> Daniel’s
> >>>>>>>>> concerns, I would argue this is not a change to what a dag run
> >> is,
> >>>> it
> >>>>>>> is
> >>>>>>>>> rather a change to WHEN that dag run will be scheduled.
> >>>>>>>>> I had implemented a similar change in my own version but
> >>> ultimately
> >>>>>>>> backed
> >>>>>>>>> so I didn’t have to patch after each new release. In my opinion
> >>> the
> >>>>>>> main
> >>>>>>>>> flaw in the current scheduler, and I have brought this up
> >> before,
> >>> is
> >>>>>>> when
> >>>>>>>>> you don’t have a consistent schedule interval (e.g. only run
> >> M-F).
> >>>>>>> After
> >>>>>>>>> backing out the “schedule at interval start” I had to switch to
> >> a
> >>>>> daily
> >>>>>>>>> schedule and go through and put a short circuit operator in each
> >>> of
> >>>> my
> >>>>>>>> M-F
> >>>>>>>>> dags to get the behavior that I wanted. This results in putting
> >>>>>>>> scheduling
> >>>>>>>>> logic inside the dag, when scheduling logic should be in the
> >>>>> scheduler.
> >>>>>>>>>
> >>>>>>>>> -James
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <
> >>> dpstandish@gmail.com
> >>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Re
> >>>>>>>>>>
> >>>>>>>>>>> What are people's feelings on changing the default execution
> >> to
> >>>>>>>> schedule
> >>>>>>>>>>> interval start
> >>>>>>>>>>
> >>>>>>>>>> and
> >>>>>>>>>>
> >>>>>>>>>>> I'm in favor of doing that, but then exposing new variables of
> >>>>>>>>>>> "interval_start" and "interval_end", etc. so that people write
> >>>>>>>>>>> clearer-looking at-a-glance DAGs
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> While I am def on board with the spirit of this PR, I would
> >> vote
> >>> we
> >>>>>>> do
> >>>>>>>>> not
> >>>>>>>>>> accept this PR as is, because it cements a confusing option.
> >>>>>>>>>>
> >>>>>>>>>> *What is the right representation of a dag run?*
> >>>>>>>>>>
> >>>>>>>>>> Right now the representation is "dag run-at date minus 1
> >>> interval".
> >>>>>>> It
> >>>>>>>>>> should just be "dag run-at date".
> >>>>>>>>>>
> >>>>>>>>>> We don't need to address the question of whether execution date
> >>> is
> >>>>>>> the
> >>>>>>>>>> start or the end of an interval; it doesn't matter.
> >>>>>>>>>>
> >>>>>>>>>> In all cases, a given dag run will be targeted for *some*
> >> initial
> >>>>>>>> "run-at
> >>>>>>>>>> time"; so *that* should be the time that is part of the PK of a
> >>> dag
> >>>>>>>> run,
> >>>>>>>>>> and *that *is the time that should be exposed as the dag run
> >>>>>>> "execution
> >>>>>>>>>> date"
> >>>>>>>>>>
> >>>>>>>>>> *Interval of interest is not a dag_run attribute*
> >>>>>>>>>>
> >>>>>>>>>> We also mix in this question of the date interval that the
> >>> *tasks*
> >>>>>>> are
> >>>>>>>>>> interested in.  But the *dag run* need not concern itself with
> >>> this
> >>>>>>> in
> >>>>>>>>> any
> >>>>>>>>>> way.  That is for the tasks to figure out: if they happen to
> >> need
> >>>>>>> "dag
> >>>>>>>>>> run-at date," then they can reference that; if they want the
> >>> prior
> >>>>>>> one,
> >>>>>>>>> ask
> >>>>>>>>>> for the prior one.
> >>>>>>>>>>
> >>>>>>>>>> Previously, I was in the camp that thought it was a great idea
> >> to
> >>>>>>>> rename
> >>>>>>>>>> "execution_date" to "period_start" or "interval_start".  But I
> >>> now
> >>>>>>>> think
> >>>>>>>>>> this is folly.  It invokes this question of the "interval of
> >>>>>>> interest"
> >>>>>>>> or
> >>>>>>>>>> "period of interest".  But the dag doesn't need to know
> >> anything
> >>>>>>> about
> >>>>>>>>>> that.
> >>>>>>>>>>
> >>>>>>>>>> Within the same dag you may have tasks with different intervals
> >>> of
> >>>>>>>>>> interest.  So why make assumptions in the dag; just give the
> >>> facts:
> >>>>>>>> this
> >>>>>>>>> is
> >>>>>>>>>> my run date; this is the prior run date, etc.  It would be a
> >>>>>>> regression
> >>>>>>>>>> from the perspective of providing accurate names.
> >>>>>>>>>>
> >>>>>>>>>> *Proposal*
> >>>>>>>>>>
> >>>>>>>>>> So, I would propose we change "execution_date" to mean "dag
> >>> run-at
> >>>>>>>> date"
> >>>>>>>>> as
> >>>>>>>>>> opposed to "dag run-at date minus 1".  But we should do so
> >>> without
> >>>>>>>>>> reference to interval end or interval start.
> >>>>>>>>>>
> >>>>>>>>>> *Configurability*
> >>>>>>>>>>
> >>>>>>>>>> The more configuration options we have, the more noise there is
> >>> as
> >>>> a
> >>>>>>>> user
> >>>>>>>>>> trying to understand how to use airflow, so I'd rather us not
> >>> make
> >>>>>>> this
> >>>>>>>>>> configurable at all.
> >>>>>>>>>>
> >>>>>>>>>> That said, perhaps a more clear and more explicit means making
> >>> this
> >>>>>>>>>> configurable would be to define an integer param
> >>>>>>>>>> "dag_run_execution_date_interval_offset", which would control
> >> how
> >>>>>>> many
> >>>>>>>>>> intervals back from actual "dag run-at date" the "execution
> >> date"
> >>>>>>>> should
> >>>>>>>>>> be.  (current behavior = 1, new behavior = 0).
> >>>>>>>>>>
> >>>>>>>>>> *Side note*
> >>>>>>>>>>
> >>>>>>>>>> Hopefully not to derail discussion: I think there are
> >> additional,
> >>>>>>>> related
> >>>>>>>>>> task attributes that may want to come into being: namely,
> >>>>>>> low_watermark
> >>>>>>>>> and
> >>>>>>>>>> high_watermark.  There is the potential, with attributes like
> >>> this,
> >>>>>>> for
> >>>>>>>>>> adding better out-of-the-box support for common data workflows
> >>> that
> >>>>>>> we
> >>>>>>>>> now
> >>>>>>>>>> need to use xcom for, namely incremental loads.  But I want to
> >>> give
> >>>>>>> it
> >>>>>>>>> more
> >>>>>>>>>> thought before proposing anything specific.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
> >>>>>>> Jarek.Potiuk@polidea.com
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Good one Damian. I will have a list of issues that can be
> >>> possible
> >>>>>>> to
> >>>>>>>>>>> handle at the workshop, so that one goes there.
> >>>>>>>>>>>
> >>>>>>>>>>> J.
> >>>>>>>>>>>
> >>>>>>>>>>> Principal Software Engineer
> >>>>>>>>>>> Phone: +48660796129
> >>>>>>>>>>>
> >>>>>>>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> >>>>>>>>>>> damian.shaw.2@credit-suisse.com> napisał:
> >>>>>>>>>>>
> >>>>>>>>>>>> I can't understate what a conceptual improvement this would
> >> be
> >>>> for
> >>>>>>>> the
> >>>>>>>>>>> end
> >>>>>>>>>>>> users of Airflow in our environment. I've written a lot of
> >> code
> >>>> so
> >>>>>>>> all
> >>>>>>>>>>> our
> >>>>>>>>>>>> configuration works like this anyway. But the UI still shows
> >>> the
> >>>>>>>>> Airflow
> >>>>>>>>>>>> dates which still to this day sometimes confuse me.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some
> >> of
> >>>> my
> >>>>>>>>> first
> >>>>>>>>>>>> PRs could be additional test cases around edge cases to do
> >> with
> >>>> DST
> >>>>>>>> and
> >>>>>>>>>>>> cron scheduling that I have concerns about :)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Damian
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
> >>>>>>>>>>>> Sent: Friday, August 23, 2019 6:50 AM
> >>>>>>>>>>>> To: dev@airflow.apache.org
> >>>>>>>>>>>> Subject: Setting to add choice of schedule at end or schedule
> >>> at
> >>>>>>>> start
> >>>>>>>>> of
> >>>>>>>>>>>> interval
> >>>>>>>>>>>>
> >>>>>>>>>>>> This has come up a few times before, someone has now opened a
> >>> PR
> >>>>>>> that
> >>>>>>>>>>>> makes this a global+per-dag setting:
> >>>>>>>>>>>> https://github.com/apache/airflow/pull/5787 and it also
> >>> includes
> >>>>>>>> docs
> >>>>>>>>>>>> that I think does a good job of illustrating the two modes.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Does anyone object to this being merged? If no one says
> >>> anything
> >>>> by
> >>>>>>>>>>> midday
> >>>>>>>>>>>> on Tuesday I will take that as assent and will merge it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The docs from the PR included below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Ash
> >>>>>>>>>>>>
> >>>>>>>>>>>> Scheduled Time vs Execution Time
> >>>>>>>>>>>> ''''''''''''''''''''''''''''''''
> >>>>>>>>>>>>
> >>>>>>>>>>>> A DAG with a ``schedule_interval`` will execute once per
> >>>> interval.
> >>>>>>> By
> >>>>>>>>>>>> default, the execution of a DAG will occur at the **end** of
> >>> the
> >>>>>>>>>>>> schedule interval.
> >>>>>>>>>>>>
> >>>>>>>>>>>> A few examples:
> >>>>>>>>>>>>
> >>>>>>>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run
> >> that
> >>>>>>>>> processes
> >>>>>>>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16
> >>>> 17:59:59,
> >>>>>>>>>>>> i.e. once that hour is over.
> >>>>>>>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
> >>>>>>>> processes
> >>>>>>>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The reasoning behind this execution vs scheduling behaviour
> >> is
> >>>> that
> >>>>>>>>>>>> data for the interval to be processed won't be fully
> >> available
> >>>>>>> until
> >>>>>>>>>>>> the interval has elapsed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In cases where you wish the DAG to be executed at the
> >> **start**
> >>>> of
> >>>>>>>> the
> >>>>>>>>>>>> interval, specify ``schedule_at_interval_end=False``, either
> >> in
> >>>>>>>>>>>> ``airflow.cfg``, or on a per-DAG basis.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> ===============================================================================
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please access the attached hyperlink for an important
> >>> electronic
> >>>>>>>>>>>> communications disclaimer:
> >>>>>>>>>>>>
> >> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> ===============================================================================
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: Setting to add choice of schedule at end or schedule at start of interval

Posted by James Coder <jc...@gmail.com>.

I can’t see how adding a property to Dagrun that is essentially
identical to next_execution_date would add any benefit. The way I see
it the issue at hand here is not the availability of dates. There are
plenty of options in the template context for dates before and after
execution date. My view point is the problem this is trying to solve
is that waiting until the right edge of an interval has passed to
schedule a dag run has some shortcomings. Mainly that if your
intervals vary in length you are forced to put scheduling logic that
should reside in the scheduler in your DAGs. For example, if I need to
run a DAG every 20 minutes between 8 AM and 4 PM, in it's current
form, the scheduler won't schedule that 4PM run until 8 AM the next
day. "Just use next_execution_date" you say, well that's all well and
good between 8AM and 3:40 PM, but when 4:01 PM rolls around and you
don't have the results because they won't be available until after 8
the next day, that doesn't sound so good, does it? In order to work
around this, you have to add additional runs and short circuit
operators over and over. It's a hassle.  Allowing for scheduling dags
at the left edge of an interval and allowing it to behave more like
cron, where it runs at the time specified, not schedule + interval,
would make things much less complicated for users like myself that
can't always wait until the right edge of the interval.


James Coder

> On Sep 3, 2019, at 11:14 PM, Daniel Standish <dp...@gmail.com> wrote:
>
> What if we merely add a property "run_date" to DagRun?  At present
> this would be essentially same as "next_execution_date".
>
> Then no change to scheduler would be required, and no new dag parameter or
> config.  Perhaps you could add a toggle to the DAGs UI view that lets you
> choose whether to display "last run" by "run_date" or "execution_date".
>
> If you want your dags to be parameterized by the date when they meant to be
> run -- as opposed to their implicit interval-of-interest -- then you can
> reference "run_date".
>
> One potential source of confusion with this is backfilling: what does
> "run_date" mean in the context of a backfill?  You could say it means
> essentially "initial run date", i.e. "do not run before date", i.e. "run
> after date" or "run-at date".  So, for a daily, job the 2019-01-02
> "run_date" corresponds to a 2019-01-01 execution_date.  This makes sense
> right?
>
> Perhaps in the future, the relationship between "run_date" and
> "execution_date" can be more dynamic.  Perhaps in the future we rename
> "execution_date" for clarity, or to be more generic.  But it makes sense
> that a dag run will always have a run date, so it doesn't seem like a
> terrible idea to add a property representing this.
>
> Would this meet the goals of the PR?
>
>
>
>
> On Wed, Aug 28, 2019 at 11:50 AM James Meickle
> <jm...@quantopian.com.invalid> wrote:
>
>> Totally agree with Daniel here. I think that if we implement this feature
>> as proposed, it will actively discourage us from implementing a better
>> data-aware feature that would remain invisible to most users while neatly
>> addressing a lot of edge cases that currently require really ugly hacks. I
>> believe that having more data awareness features in Airflow (like the data
>> lineage work, or other metadata integrations) is worth investing in if we
>> can do it without too much required user-facing complexity. The Airflow
>> project isn't a full data warehouse suite but it's also not just "cron with
>> a UI", so we should try to be pragmatic and fit in power-user features
>> where we can do so without compromising the project's overall goals.
>>
>> On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dp...@gmail.com>
>> wrote:
>>
>>> I am just thinking there is the potential for a more comprehensive
>>> enhancement here, and I worry that this is a band-aid that, like all new
>>> features has the potential to constrain future options.  It does not help
>>> us to do anything we cannot already do.
>>>
>>> The source of this problem is that scheduling and interval-of-interest
>> are
>>> mixed together.
>>>
>>> My thought is there may be a way to separate scheduling and
>>> interval-of-interest to uniformly resolve "execution_date" vs "run_date"
>>> confusion.  We could make *explicit* instead of *implicit* the
>> relationship
>>> between run_date *(not currently a concept in airflow)* and
>>> "interval-of-interest" *(currently represented by execution_date)*.
>>>
>>> I also see in this the potential to unlock some other improvements:
>>> * support a greater diversity of incremental processes
>>> * allow more flexible backfilling
>>> * provide better views of data you have vs data you don't.
>>>
>>> The canonical airflow job is date-partitioned idempotent data pull.  Your
>>> interval of interest is from execution_date to execution_date + 1
>>> interval.  Schedule_interval is not just the scheduling cadence but it is
>>> also your interval-of-interest partition function.   If that doesn't work
>>> for your job, you set catchup=False and roll your own.
>>>
>>> What if there was a way to generalize?  E.g. could we allow for more
>>> flexible partition function that deviated from scheduler cadence?  E.g.
>>> what if your interval-of-interest partitions could be governed by "min 1
>>> day, max 30 days".  Then on on-going basis, your daily loads would be a
>>> range of 1 day but then if server down for couple days, this could be
>>> caught up in one task and if you backfill it could be up to 30-day
>> batches.
>>>
>>> Perhaps there is an abstraction that could be used by a greater diversity
>>> of incremental processes.  Such a thing could support a nice "data
>>> contiguity view". I imagine a horizontal bar that is solid where we have
>>> the data and empty where we don't.  Then you click on a "missing" section
>>> and you can  trigger a backfill task with that date interval according to
>>> your partitioning rules.
>>>
>>> I can imagine using this for an incremental job where each time we pull
>> the
>>> new data since last time; in the `execute` method the operator could set
>>> `self.high_watermark` with the max datetime processed.  Or maybe a
>> callback
>>> function could be used to gather this value.  This value could be used in
>>> next run, and cold be depicted in a view.
>>>
>>> Default intervals of interest could be status quo -- i.e. partitions
>> equal
>>> to schedule interval -- but could be overwritten using templating or
>>> callbacks or setting it during `execute`.
>>>
>>> So anyway, I don't have a master plan all figured out.  But I think there
>>> is opportunity in this area for more comprehensive enhancement that goes
>>> more directly at the root of the problem.
>>>
>>>
>>>
>>>
>>> On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
>>> maximebeauchemin@gmail.com> wrote:
>>>
>>>> How about an alternative approach that would introduce 2 new keyword
>>>> arguments that are clear (something like, but maybe better than
>>>> `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
>>>> unchanged, but plan it's deprecation. As a first step `execution_date`
>>>> would be inferred from the new args, and warn about deprecation when
>>> used.
>>>>
>>>> Max
>>>>
>>>> On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bd...@gmail.com>
>>> wrote:
>>>>
>>>>> Execution date is execution date for a dag run no matter what. There
>> is
>>>> no
>>>>> end interval or start interval for a dag run. The only time this is
>>>>> relevant is when we calculate the next or previous dagrun.
>>>>>
>>>>> So I don't Daniels rationale makes sense (?)
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On 27 Aug 2019, at 17:40, Philippe Gagnon <ph...@gmail.com>
>>>> wrote:
>>>>>>
>>>>>> I agree with Daniel's rationale but I am also worried about
>> backwards
>>>>>> compatibility as this would perhaps be the most disruptive breaking
>>>>> change
>>>>>> possible. I think maybe we should write down the different options
>>>>>> available to us (AIP?) and call for a vote. What does everyone
>> think?
>>>>>>
>>>>>>> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jc...@gmail.com>
>>>> wrote:
>>>>>>>
>>>>>>> Can't execution date can already mean different things depending
>> on
>>> if
>>>>> the
>>>>>>> dag run was initiated via the scheduler or manually via command
>>>>> line/API?
>>>>>>> I agree that making it consistent might make it easier to explain
>> to
>>>> new
>>>>>>> users, but should we exchange that for breaking pretty much every
>>>>> existing
>>>>>>> dag by re-defining what execution date is?
>>>>>>> -James
>>>>>>>
>>>>>>> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
>>>> dpstandish@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> To Daniel’s concerns, I would argue this is not a change to
>> what a
>>>> dag
>>>>>>>> run
>>>>>>>>> is, it is rather a change to WHEN that dag run will be
>> scheduled.
>>>>>>>>
>>>>>>>>
>>>>>>>> Execution date is part of the definition of a dag_run; it is
>>> uniquely
>>>>>>>> identified by an execution_date and dag_id.
>>>>>>>>
>>>>>>>> When someone asks what is a dag_run, we should be able to provide
>>> an
>>>>>>>> answer.
>>>>>>>>
>>>>>>>> Imagine trying to explain what a dag run is, when execution_date
>>> can
>>>>> mean
>>>>>>>> different things.
>>>>>>>>   Admin: "A dag run is an execution_date and a dag_id".
>>>>>>>>   New user: "Ok. Clear as a bell. What's an execution_date?"
>>>>>>>>   Admin: "Well, it can be one of two things.  It *could* be when
>>> the
>>>>>>> dag
>>>>>>>> will be run... but it could *also* be 'the time when dag should
>> be
>>>> run
>>>>>>>> minus one schedule interval".  It depends on whether you choose
>>> 'end'
>>>>> or
>>>>>>>> 'start' for 'schedule_interval_edge.'  If you choose 'start' then
>>>>>>>> execution_date means 'when dag will be run'.  If you choose 'end'
>>>> then
>>>>>>>> execution_date means 'when dag will be run minus one interval.'
>> If
>>>> you
>>>>>>>> change the parameter after some time, then we don't necessarily
>>> know
>>>>> what
>>>>>>>> it means at all times".
>>>>>>>>
>>>>>>>> Why would we do this to ourselves?
>>>>>>>>
>>>>>>>> Alternatively, we can give dag_run a clear, unambiguous meaning:
>>>>>>>> * dag_run is dag_id + execution_date
>>>>>>>> * execution_date is when dag will be run (notwithstanding
>> scheduler
>>>>>>> delay,
>>>>>>>> queuing)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Execution_date is defined as "run-at date minus 1 interval".  The
>>>>>>>> assumption in this is that you tasks care about this particular
>>> date.
>>>>>>>> Obviously this makes sense for some tasks but not for others.
>>>>>>>>
>>>>>>>> I would prop
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcoder01@gmail.com
>>>
>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I think this is a great improvement and should be merged. To
>>>> Daniel’s
>>>>>>>>> concerns, I would argue this is not a change to what a dag run
>> is,
>>>> it
>>>>>>> is
>>>>>>>>> rather a change to WHEN that dag run will be scheduled.
>>>>>>>>> I had implemented a similar change in my own version but
>>> ultimately
>>>>>>>> backed
>>>>>>>>> so I didn’t have to patch after each new release. In my opinion
>>> the
>>>>>>> main
>>>>>>>>> flaw in the current scheduler, and I have brought this up
>> before,
>>> is
>>>>>>> when
>>>>>>>>> you don’t have a consistent schedule interval (e.g. only run
>> M-F).
>>>>>>> After
>>>>>>>>> backing out the “schedule at interval start” I had to switch to
>> a
>>>>> daily
>>>>>>>>> schedule and go through and put a short circuit operator in each
>>> of
>>>> my
>>>>>>>> M-F
>>>>>>>>> dags to get the behavior that I wanted. This results in putting
>>>>>>>> scheduling
>>>>>>>>> logic inside the dag, when scheduling logic should be in the
>>>>> scheduler.
>>>>>>>>>
>>>>>>>>> -James
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <
>>> dpstandish@gmail.com
>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Re
>>>>>>>>>>
>>>>>>>>>>> What are people's feelings on changing the default execution
>> to
>>>>>>>> schedule
>>>>>>>>>>> interval start
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> I'm in favor of doing that, but then exposing new variables of
>>>>>>>>>>> "interval_start" and "interval_end", etc. so that people write
>>>>>>>>>>> clearer-looking at-a-glance DAGs
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> While I am def on board with the spirit of this PR, I would
>> vote
>>> we
>>>>>>> do
>>>>>>>>> not
>>>>>>>>>> accept this PR as is, because it cements a confusing option.
>>>>>>>>>>
>>>>>>>>>> *What is the right representation of a dag run?*
>>>>>>>>>>
>>>>>>>>>> Right now the representation is "dag run-at date minus 1
>>> interval".
>>>>>>> It
>>>>>>>>>> should just be "dag run-at date".
>>>>>>>>>>
>>>>>>>>>> We don't need to address the question of whether execution date
>>> is
>>>>>>> the
>>>>>>>>>> start or the end of an interval; it doesn't matter.
>>>>>>>>>>
>>>>>>>>>> In all cases, a given dag run will be targeted for *some*
>> initial
>>>>>>>> "run-at
>>>>>>>>>> time"; so *that* should be the time that is part of the PK of a
>>> dag
>>>>>>>> run,
>>>>>>>>>> and *that *is the time that should be exposed as the dag run
>>>>>>> "execution
>>>>>>>>>> date"
>>>>>>>>>>
>>>>>>>>>> *Interval of interest is not a dag_run attribute*
>>>>>>>>>>
>>>>>>>>>> We also mix in this question of the date interval that the
>>> *tasks*
>>>>>>> are
>>>>>>>>>> interested in.  But the *dag run* need not concern itself with
>>> this
>>>>>>> in
>>>>>>>>> any
>>>>>>>>>> way.  That is for the tasks to figure out: if they happen to
>> need
>>>>>>> "dag
>>>>>>>>>> run-at date," then they can reference that; if they want the
>>> prior
>>>>>>> one,
>>>>>>>>> ask
>>>>>>>>>> for the prior one.
>>>>>>>>>>
>>>>>>>>>> Previously, I was in the camp that thought it was a great idea
>> to
>>>>>>>> rename
>>>>>>>>>> "execution_date" to "period_start" or "interval_start".  But I
>>> now
>>>>>>>> think
>>>>>>>>>> this is folly.  It invokes this question of the "interval of
>>>>>>> interest"
>>>>>>>> or
>>>>>>>>>> "period of interest".  But the dag doesn't need to know
>> anything
>>>>>>> about
>>>>>>>>>> that.
>>>>>>>>>>
>>>>>>>>>> Within the same dag you may have tasks with different intervals
>>> of
>>>>>>>>>> interest.  So why make assumptions in the dag; just give the
>>> facts:
>>>>>>>> this
>>>>>>>>> is
>>>>>>>>>> my run date; this is the prior run date, etc.  It would be a
>>>>>>> regression
>>>>>>>>>> from the perspective of providing accurate names.
>>>>>>>>>>
>>>>>>>>>> *Proposal*
>>>>>>>>>>
>>>>>>>>>> So, I would propose we change "execution_date" to mean "dag
>>> run-at
>>>>>>>> date"
>>>>>>>>> as
>>>>>>>>>> opposed to "dag run-at date minus 1".  But we should do so
>>> without
>>>>>>>>>> reference to interval end or interval start.
>>>>>>>>>>
>>>>>>>>>> *Configurability*
>>>>>>>>>>
>>>>>>>>>> The more configuration options we have, the more noise there is
>>> as
>>>> a
>>>>>>>> user
>>>>>>>>>> trying to understand how to use airflow, so I'd rather us not
>>> make
>>>>>>> this
>>>>>>>>>> configurable at all.
>>>>>>>>>>
>>>>>>>>>> That said, perhaps a more clear and more explicit means making
>>> this
>>>>>>>>>> configurable would be to define an integer param
>>>>>>>>>> "dag_run_execution_date_interval_offset", which would control
>> how
>>>>>>> many
>>>>>>>>>> intervals back from actual "dag run-at date" the "execution
>> date"
>>>>>>>> should
>>>>>>>>>> be.  (current behavior = 1, new behavior = 0).
>>>>>>>>>>
>>>>>>>>>> *Side note*
>>>>>>>>>>
>>>>>>>>>> Hopefully not to derail discussion: I think there are
>> additional,
>>>>>>>> related
>>>>>>>>>> task attributes that may want to come into being: namely,
>>>>>>> low_watermark
>>>>>>>>> and
>>>>>>>>>> high_watermark.  There is the potential, with attributes like
>>> this,
>>>>>>> for
>>>>>>>>>> adding better out-of-the-box support for common data workflows
>>> that
>>>>>>> we
>>>>>>>>> now
>>>>>>>>>> need to use xcom for, namely incremental loads.  But I want to
>>> give
>>>>>>> it
>>>>>>>>> more
>>>>>>>>>> thought before proposing anything specific.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
>>>>>>> Jarek.Potiuk@polidea.com
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Good one Damian. I will have a list of issues that can be
>>> possible
>>>>>>> to
>>>>>>>>>>> handle at the workshop, so that one goes there.
>>>>>>>>>>>
>>>>>>>>>>> J.
>>>>>>>>>>>
>>>>>>>>>>> Principal Software Engineer
>>>>>>>>>>> Phone: +48660796129
>>>>>>>>>>>
>>>>>>>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
>>>>>>>>>>> damian.shaw.2@credit-suisse.com> napisał:
>>>>>>>>>>>
>>>>>>>>>>>> I can't understate what a conceptual improvement this would
>> be
>>>> for
>>>>>>>> the
>>>>>>>>>>> end
>>>>>>>>>>>> users of Airflow in our environment. I've written a lot of
>> code
>>>> so
>>>>>>>> all
>>>>>>>>>>> our
>>>>>>>>>>>> configuration works like this anyway. But the UI still shows
>>> the
>>>>>>>>> Airflow
>>>>>>>>>>>> dates which still to this day sometimes confuse me.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some
>> of
>>>> my
>>>>>>>>> first
>>>>>>>>>>>> PRs could be additional test cases around edge cases to do
>> with
>>>> DST
>>>>>>>> and
>>>>>>>>>>>> cron scheduling that I have concerns about :)
>>>>>>>>>>>>
>>>>>>>>>>>> Damian
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Ash Berlin-Taylor [mailto:ash@apache.org]
>>>>>>>>>>>> Sent: Friday, August 23, 2019 6:50 AM
>>>>>>>>>>>> To: dev@airflow.apache.org
>>>>>>>>>>>> Subject: Setting to add choice of schedule at end or schedule
>>> at
>>>>>>>> start
>>>>>>>>> of
>>>>>>>>>>>> interval
>>>>>>>>>>>>
>>>>>>>>>>>> This has come up a few times before, someone has now opened a
>>> PR
>>>>>>> that
>>>>>>>>>>>> makes this a global+per-dag setting:
>>>>>>>>>>>> https://github.com/apache/airflow/pull/5787 and it also
>>> includes
>>>>>>>> docs
>>>>>>>>>>>> that I think does a good job of illustrating the two modes.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone object to this being merged? If no one says
>>> anything
>>>> by
>>>>>>>>>>> midday
>>>>>>>>>>>> on Tuesday I will take that as assent and will merge it.
>>>>>>>>>>>>
>>>>>>>>>>>> The docs from the PR included below.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ash
>>>>>>>>>>>>
>>>>>>>>>>>> Scheduled Time vs Execution Time
>>>>>>>>>>>> ''''''''''''''''''''''''''''''''
>>>>>>>>>>>>
>>>>>>>>>>>> A DAG with a ``schedule_interval`` will execute once per
>>>> interval.
>>>>>>> By
>>>>>>>>>>>> default, the execution of a DAG will occur at the **end** of
>>> the
>>>>>>>>>>>> schedule interval.
>>>>>>>>>>>>
>>>>>>>>>>>> A few examples:
>>>>>>>>>>>>
>>>>>>>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run
>> that
>>>>>>>>> processes
>>>>>>>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16
>>>> 17:59:59,
>>>>>>>>>>>> i.e. once that hour is over.
>>>>>>>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
>>>>>>>> processes
>>>>>>>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
>>>>>>>>>>>>
>>>>>>>>>>>> The reasoning behind this execution vs scheduling behaviour
>> is
>>>> that
>>>>>>>>>>>> data for the interval to be processed won't be fully
>> available
>>>>>>> until
>>>>>>>>>>>> the interval has elapsed.
>>>>>>>>>>>>
>>>>>>>>>>>> In cases where you wish the DAG to be executed at the
>> **start**
>>>> of
>>>>>>>> the
>>>>>>>>>>>> interval, specify ``schedule_at_interval_end=False``, either
>> in
>>>>>>>>>>>> ``airflow.cfg``, or on a per-DAG basis.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>> ===============================================================================
>>>>>>>>>>>>
>>>>>>>>>>>> Please access the attached hyperlink for an important
>>> electronic
>>>>>>>>>>>> communications disclaimer:
>>>>>>>>>>>>
>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>> ===============================================================================
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>