You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Shubham Gupta <sh...@gmail.com> on 2018/07/20 21:58:27 UTC

Fwd: Catchup By default = False vs LatestOnlyOperator

---------- Forwarded message ---------
From: Shubham Gupta <sh...@gmail.com>
Date: Fri, Jul 20, 2018 at 2:38 PM
Subject: Catchup By default = False vs LatestOnlyOperator
To: <de...@airflow.incubator.apache.org>


Hi!

Can someone please explain the difference b/w catchup by default = False
and LatestOnlyOperator?

Regarding
Shubham Gupta

Re: Catchup By default = False vs LatestOnlyOperator

Posted by Shubham Gupta <sh...@gmail.com>.
Thanks a lot for the useful info.

Regards
Shubham Gupta

On Wed, Jul 25, 2018 at 7:48 PM Sid Anand <sa...@apache.org> wrote:

> I will +1 James comment and add to it. At Agari, one of our DAGs had as a
> final step the sending of an alert. The alerts only made sense when the DAG
> was current. But, sometimes, we did need to recompute some metrics based on
> historical data, but not alert on them. The LatestOnlyOperator was a good
> fit for this case.
>
> George/Ben,
> It would be great to document this discussion -- i.e. when to use one over
> another.
>
> -s
>
>
> On Mon, Jul 23, 2018 at 2:03 PM George Leslie-Waksman <wa...@gmail.com>
> wrote:
>
> > Ok, not so fringe; I'm glad it's working well for your use case, James.
> >
> > I retract my suggestion of deprecation.
> >
> > On Mon, Jul 23, 2018 at 12:58 PM James Meickle
> > <jm...@quantopian.com.invalid> wrote:
> >
> > > We use LatestOnlyOperator in production. Generally our data is
> available
> > on
> > > a regular schedule, and we update production services with it as soon
> as
> > it
> > > is available; we might occasionally want to re-run historical days, in
> > > which case we want to run the same DAG but without interacting with
> live
> > > production services at all.
> > >
> > > On Mon, Jul 23, 2018 at 2:18 PM, George Leslie-Waksman <
> > waksman@gmail.com>
> > > wrote:
> > >
> > > > As the author of LatestOnlyOperator, the goal was as a stopgap until
> > > > catchup=False landed.
> > > >
> > > > There are some (very) fringe use cases where you might still want
> > > > LatestOnlyOperator but in almost all cases what you want is probably
> > > > catchup=False.
> > > >
> > > > The situations where LatestOnlyOperator is still useful are where you
> > > want
> > > > to run most of your DAG for every schedule interval but you want some
> > of
> > > > the tasks to run only on the latest run (not catching up, not
> > > backfilling).
> > > >
> > > > It may be best to deprecate LatestOnlyOperator at this point to avoid
> > > > confusion.
> > > >
> > > > --George
> > > >
> > > > On Sat, Jul 21, 2018 at 7:34 PM Ben Tallman <bt...@gmail.com>
> > wrote:
> > > >
> > > > > As the author of catch-up, the idea is that in many cases your data
> > > > > doesn't "window" nicely and you want instead to just run as if it
> > were
> > > a
> > > > > brilliant Cron...
> > > > >
> > > > > Ben
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Jul 20, 2018, at 11:39 PM, Shah Altaf <me...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Hi my understanding is: if you use the LatestOnlyOperator then
> when
> > > you
> > > > > run
> > > > > > the DAG for the first time you'll see a whole bunch of DAG runs
> > > queued
> > > > > up,
> > > > > > and in each run the LatestOnlyOperator will cause the rest of the
> > DAG
> > > > run
> > > > > > to be skipped.  Only the latest DAG will run in 'full'.
> > > > > >
> > > > > > With catchup = False, you should just get just the latest DAG
> run.
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <
> > > > > shubham180695.sg@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> ---------- Forwarded message ---------
> > > > > >> From: Shubham Gupta <sh...@gmail.com>
> > > > > >> Date: Fri, Jul 20, 2018 at 2:38 PM
> > > > > >> Subject: Catchup By default = False vs LatestOnlyOperator
> > > > > >> To: <de...@airflow.incubator.apache.org>
> > > > > >>
> > > > > >>
> > > > > >> Hi!
> > > > > >>
> > > > > >> Can someone please explain the difference b/w catchup by
> default =
> > > > False
> > > > > >> and LatestOnlyOperator?
> > > > > >>
> > > > > >> Regarding
> > > > > >> Shubham Gupta
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Catchup By default = False vs LatestOnlyOperator

Posted by Sid Anand <sa...@apache.org>.
I will +1 James comment and add to it. At Agari, one of our DAGs had as a
final step the sending of an alert. The alerts only made sense when the DAG
was current. But, sometimes, we did need to recompute some metrics based on
historical data, but not alert on them. The LatestOnlyOperator was a good
fit for this case.

George/Ben,
It would be great to document this discussion -- i.e. when to use one over
another.

-s


On Mon, Jul 23, 2018 at 2:03 PM George Leslie-Waksman <wa...@gmail.com>
wrote:

> Ok, not so fringe; I'm glad it's working well for your use case, James.
>
> I retract my suggestion of deprecation.
>
> On Mon, Jul 23, 2018 at 12:58 PM James Meickle
> <jm...@quantopian.com.invalid> wrote:
>
> > We use LatestOnlyOperator in production. Generally our data is available
> on
> > a regular schedule, and we update production services with it as soon as
> it
> > is available; we might occasionally want to re-run historical days, in
> > which case we want to run the same DAG but without interacting with live
> > production services at all.
> >
> > On Mon, Jul 23, 2018 at 2:18 PM, George Leslie-Waksman <
> waksman@gmail.com>
> > wrote:
> >
> > > As the author of LatestOnlyOperator, the goal was as a stopgap until
> > > catchup=False landed.
> > >
> > > There are some (very) fringe use cases where you might still want
> > > LatestOnlyOperator but in almost all cases what you want is probably
> > > catchup=False.
> > >
> > > The situations where LatestOnlyOperator is still useful are where you
> > want
> > > to run most of your DAG for every schedule interval but you want some
> of
> > > the tasks to run only on the latest run (not catching up, not
> > backfilling).
> > >
> > > It may be best to deprecate LatestOnlyOperator at this point to avoid
> > > confusion.
> > >
> > > --George
> > >
> > > On Sat, Jul 21, 2018 at 7:34 PM Ben Tallman <bt...@gmail.com>
> wrote:
> > >
> > > > As the author of catch-up, the idea is that in many cases your data
> > > > doesn't "window" nicely and you want instead to just run as if it
> were
> > a
> > > > brilliant Cron...
> > > >
> > > > Ben
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Jul 20, 2018, at 11:39 PM, Shah Altaf <me...@gmail.com>
> wrote:
> > > > >
> > > > > Hi my understanding is: if you use the LatestOnlyOperator then when
> > you
> > > > run
> > > > > the DAG for the first time you'll see a whole bunch of DAG runs
> > queued
> > > > up,
> > > > > and in each run the LatestOnlyOperator will cause the rest of the
> DAG
> > > run
> > > > > to be skipped.  Only the latest DAG will run in 'full'.
> > > > >
> > > > > With catchup = False, you should just get just the latest DAG run.
> > > > >
> > > > >
> > > > > On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <
> > > > shubham180695.sg@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> ---------- Forwarded message ---------
> > > > >> From: Shubham Gupta <sh...@gmail.com>
> > > > >> Date: Fri, Jul 20, 2018 at 2:38 PM
> > > > >> Subject: Catchup By default = False vs LatestOnlyOperator
> > > > >> To: <de...@airflow.incubator.apache.org>
> > > > >>
> > > > >>
> > > > >> Hi!
> > > > >>
> > > > >> Can someone please explain the difference b/w catchup by default =
> > > False
> > > > >> and LatestOnlyOperator?
> > > > >>
> > > > >> Regarding
> > > > >> Shubham Gupta
> > > > >>
> > > >
> > >
> >
>

Re: Catchup By default = False vs LatestOnlyOperator

Posted by George Leslie-Waksman <wa...@gmail.com>.
Ok, not so fringe; I'm glad it's working well for your use case, James.

I retract my suggestion of deprecation.

On Mon, Jul 23, 2018 at 12:58 PM James Meickle
<jm...@quantopian.com.invalid> wrote:

> We use LatestOnlyOperator in production. Generally our data is available on
> a regular schedule, and we update production services with it as soon as it
> is available; we might occasionally want to re-run historical days, in
> which case we want to run the same DAG but without interacting with live
> production services at all.
>
> On Mon, Jul 23, 2018 at 2:18 PM, George Leslie-Waksman <wa...@gmail.com>
> wrote:
>
> > As the author of LatestOnlyOperator, the goal was as a stopgap until
> > catchup=False landed.
> >
> > There are some (very) fringe use cases where you might still want
> > LatestOnlyOperator but in almost all cases what you want is probably
> > catchup=False.
> >
> > The situations where LatestOnlyOperator is still useful are where you
> want
> > to run most of your DAG for every schedule interval but you want some of
> > the tasks to run only on the latest run (not catching up, not
> backfilling).
> >
> > It may be best to deprecate LatestOnlyOperator at this point to avoid
> > confusion.
> >
> > --George
> >
> > On Sat, Jul 21, 2018 at 7:34 PM Ben Tallman <bt...@gmail.com> wrote:
> >
> > > As the author of catch-up, the idea is that in many cases your data
> > > doesn't "window" nicely and you want instead to just run as if it were
> a
> > > brilliant Cron...
> > >
> > > Ben
> > >
> > > Sent from my iPhone
> > >
> > > > On Jul 20, 2018, at 11:39 PM, Shah Altaf <me...@gmail.com> wrote:
> > > >
> > > > Hi my understanding is: if you use the LatestOnlyOperator then when
> you
> > > run
> > > > the DAG for the first time you'll see a whole bunch of DAG runs
> queued
> > > up,
> > > > and in each run the LatestOnlyOperator will cause the rest of the DAG
> > run
> > > > to be skipped.  Only the latest DAG will run in 'full'.
> > > >
> > > > With catchup = False, you should just get just the latest DAG run.
> > > >
> > > >
> > > > On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <
> > > shubham180695.sg@gmail.com>
> > > > wrote:
> > > >
> > > >> ---------- Forwarded message ---------
> > > >> From: Shubham Gupta <sh...@gmail.com>
> > > >> Date: Fri, Jul 20, 2018 at 2:38 PM
> > > >> Subject: Catchup By default = False vs LatestOnlyOperator
> > > >> To: <de...@airflow.incubator.apache.org>
> > > >>
> > > >>
> > > >> Hi!
> > > >>
> > > >> Can someone please explain the difference b/w catchup by default =
> > False
> > > >> and LatestOnlyOperator?
> > > >>
> > > >> Regarding
> > > >> Shubham Gupta
> > > >>
> > >
> >
>

Re: Catchup By default = False vs LatestOnlyOperator

Posted by James Meickle <jm...@quantopian.com.INVALID>.
We use LatestOnlyOperator in production. Generally our data is available on
a regular schedule, and we update production services with it as soon as it
is available; we might occasionally want to re-run historical days, in
which case we want to run the same DAG but without interacting with live
production services at all.

On Mon, Jul 23, 2018 at 2:18 PM, George Leslie-Waksman <wa...@gmail.com>
wrote:

> As the author of LatestOnlyOperator, the goal was as a stopgap until
> catchup=False landed.
>
> There are some (very) fringe use cases where you might still want
> LatestOnlyOperator but in almost all cases what you want is probably
> catchup=False.
>
> The situations where LatestOnlyOperator is still useful are where you want
> to run most of your DAG for every schedule interval but you want some of
> the tasks to run only on the latest run (not catching up, not backfilling).
>
> It may be best to deprecate LatestOnlyOperator at this point to avoid
> confusion.
>
> --George
>
> On Sat, Jul 21, 2018 at 7:34 PM Ben Tallman <bt...@gmail.com> wrote:
>
> > As the author of catch-up, the idea is that in many cases your data
> > doesn't "window" nicely and you want instead to just run as if it were a
> > brilliant Cron...
> >
> > Ben
> >
> > Sent from my iPhone
> >
> > > On Jul 20, 2018, at 11:39 PM, Shah Altaf <me...@gmail.com> wrote:
> > >
> > > Hi my understanding is: if you use the LatestOnlyOperator then when you
> > run
> > > the DAG for the first time you'll see a whole bunch of DAG runs queued
> > up,
> > > and in each run the LatestOnlyOperator will cause the rest of the DAG
> run
> > > to be skipped.  Only the latest DAG will run in 'full'.
> > >
> > > With catchup = False, you should just get just the latest DAG run.
> > >
> > >
> > > On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <
> > shubham180695.sg@gmail.com>
> > > wrote:
> > >
> > >> ---------- Forwarded message ---------
> > >> From: Shubham Gupta <sh...@gmail.com>
> > >> Date: Fri, Jul 20, 2018 at 2:38 PM
> > >> Subject: Catchup By default = False vs LatestOnlyOperator
> > >> To: <de...@airflow.incubator.apache.org>
> > >>
> > >>
> > >> Hi!
> > >>
> > >> Can someone please explain the difference b/w catchup by default =
> False
> > >> and LatestOnlyOperator?
> > >>
> > >> Regarding
> > >> Shubham Gupta
> > >>
> >
>

Re: Catchup By default = False vs LatestOnlyOperator

Posted by George Leslie-Waksman <wa...@gmail.com>.
As the author of LatestOnlyOperator, the goal was as a stopgap until
catchup=False landed.

There are some (very) fringe use cases where you might still want
LatestOnlyOperator but in almost all cases what you want is probably
catchup=False.

The situations where LatestOnlyOperator is still useful are where you want
to run most of your DAG for every schedule interval but you want some of
the tasks to run only on the latest run (not catching up, not backfilling).

It may be best to deprecate LatestOnlyOperator at this point to avoid
confusion.

--George

On Sat, Jul 21, 2018 at 7:34 PM Ben Tallman <bt...@gmail.com> wrote:

> As the author of catch-up, the idea is that in many cases your data
> doesn't "window" nicely and you want instead to just run as if it were a
> brilliant Cron...
>
> Ben
>
> Sent from my iPhone
>
> > On Jul 20, 2018, at 11:39 PM, Shah Altaf <me...@gmail.com> wrote:
> >
> > Hi my understanding is: if you use the LatestOnlyOperator then when you
> run
> > the DAG for the first time you'll see a whole bunch of DAG runs queued
> up,
> > and in each run the LatestOnlyOperator will cause the rest of the DAG run
> > to be skipped.  Only the latest DAG will run in 'full'.
> >
> > With catchup = False, you should just get just the latest DAG run.
> >
> >
> > On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <
> shubham180695.sg@gmail.com>
> > wrote:
> >
> >> ---------- Forwarded message ---------
> >> From: Shubham Gupta <sh...@gmail.com>
> >> Date: Fri, Jul 20, 2018 at 2:38 PM
> >> Subject: Catchup By default = False vs LatestOnlyOperator
> >> To: <de...@airflow.incubator.apache.org>
> >>
> >>
> >> Hi!
> >>
> >> Can someone please explain the difference b/w catchup by default = False
> >> and LatestOnlyOperator?
> >>
> >> Regarding
> >> Shubham Gupta
> >>
>

Re: Catchup By default = False vs LatestOnlyOperator

Posted by Ben Tallman <bt...@gmail.com>.
As the author of catch-up, the idea is that in many cases your data doesn't "window" nicely and you want instead to just run as if it were a brilliant Cron...

Ben

Sent from my iPhone

> On Jul 20, 2018, at 11:39 PM, Shah Altaf <me...@gmail.com> wrote:
> 
> Hi my understanding is: if you use the LatestOnlyOperator then when you run
> the DAG for the first time you'll see a whole bunch of DAG runs queued up,
> and in each run the LatestOnlyOperator will cause the rest of the DAG run
> to be skipped.  Only the latest DAG will run in 'full'.
> 
> With catchup = False, you should just get just the latest DAG run.
> 
> 
> On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <sh...@gmail.com>
> wrote:
> 
>> ---------- Forwarded message ---------
>> From: Shubham Gupta <sh...@gmail.com>
>> Date: Fri, Jul 20, 2018 at 2:38 PM
>> Subject: Catchup By default = False vs LatestOnlyOperator
>> To: <de...@airflow.incubator.apache.org>
>> 
>> 
>> Hi!
>> 
>> Can someone please explain the difference b/w catchup by default = False
>> and LatestOnlyOperator?
>> 
>> Regarding
>> Shubham Gupta
>> 

Re: Catchup By default = False vs LatestOnlyOperator

Posted by Shah Altaf <me...@gmail.com>.
Hi my understanding is: if you use the LatestOnlyOperator then when you run
the DAG for the first time you'll see a whole bunch of DAG runs queued up,
and in each run the LatestOnlyOperator will cause the rest of the DAG run
to be skipped.  Only the latest DAG will run in 'full'.

With catchup = False, you should just get just the latest DAG run.


On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <sh...@gmail.com>
wrote:

> ---------- Forwarded message ---------
> From: Shubham Gupta <sh...@gmail.com>
> Date: Fri, Jul 20, 2018 at 2:38 PM
> Subject: Catchup By default = False vs LatestOnlyOperator
> To: <de...@airflow.incubator.apache.org>
>
>
> Hi!
>
> Can someone please explain the difference b/w catchup by default = False
> and LatestOnlyOperator?
>
> Regarding
> Shubham Gupta
>