You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by siddharth anand <sa...@apache.org> on 2016/09/28 00:57:49 UTC

New Operator : LatestOnlyOperator

@gwax added the LatestOnlyOperator
https://github.com/apache/incubator-airflow/pull/1752


This is a really nifty operator, so I wanted to let folks know about it. A
lot of people run cron for a mix of workloads. Some jobs map to traditional
ETL workloads (e.g. load hourly data summarization for 2016-10-01T00:00:00Z).
Some are simple cron tasks -- run a database backup every night. In the
latter case, if you miss 3 runs (e.g. your dag is paused or your start date
is a few days/weeks/months/ago), you don't want to make up for lost time
and backfill all of those days. Essentially, running N database backups at
once will take your database down... We'd prefer traditional cron behavior
in these cases, not ETL behavior.

*Enter the LatestOnlyOperator.*

Place this operator upstream of any tasks that you want to skip unless the
Dagrun is the latest. You can place a trigger rule downstream to "end" its
effect. By combining a Trigger Rule with this operator, you can ensure only
portions of your dag honor this "latest only" requirement. Or simply, have
an entire DAG run in "latest only" mode by using the LatestOnlyOperator
alone, i.e. not pairing it with a TriggerRule downstream.

This is a useful pattern that I have been coding around for some time by
using a ShortCircuitOperator with a python callable, where the callable
evaluates the "latest"-ness of the dag run. I suspect we have all been
re-inventing this wheel, which is where Airflow's Operators shine.

Thanks to @gwax for implementing this and sticking with a long and often
delayed review/merge process.

-s

Re: New Operator : LatestOnlyOperator

Posted by Chris Riccomini <cr...@apache.org>.
This is really really awesome work. We will definitely be using this!

On Tue, Sep 27, 2016 at 5:57 PM, siddharth anand <sa...@apache.org> wrote:
> @gwax added the LatestOnlyOperator
> https://github.com/apache/incubator-airflow/pull/1752
>
>
> This is a really nifty operator, so I wanted to let folks know about it. A
> lot of people run cron for a mix of workloads. Some jobs map to traditional
> ETL workloads (e.g. load hourly data summarization for 2016-10-01T00:00:00Z).
> Some are simple cron tasks -- run a database backup every night. In the
> latter case, if you miss 3 runs (e.g. your dag is paused or your start date
> is a few days/weeks/months/ago), you don't want to make up for lost time
> and backfill all of those days. Essentially, running N database backups at
> once will take your database down... We'd prefer traditional cron behavior
> in these cases, not ETL behavior.
>
> *Enter the LatestOnlyOperator.*
>
> Place this operator upstream of any tasks that you want to skip unless the
> Dagrun is the latest. You can place a trigger rule downstream to "end" its
> effect. By combining a Trigger Rule with this operator, you can ensure only
> portions of your dag honor this "latest only" requirement. Or simply, have
> an entire DAG run in "latest only" mode by using the LatestOnlyOperator
> alone, i.e. not pairing it with a TriggerRule downstream.
>
> This is a useful pattern that I have been coding around for some time by
> using a ShortCircuitOperator with a python callable, where the callable
> evaluates the "latest"-ness of the dag run. I suspect we have all been
> re-inventing this wheel, which is where Airflow's Operators shine.
>
> Thanks to @gwax for implementing this and sticking with a long and often
> delayed review/merge process.
>
> -s

Re: New Operator : LatestOnlyOperator

Posted by siddharth anand <sa...@apache.org>.
Have a look at my last comment on the PR. I included 5 screen shots from my
own test DAGs. The dag runs are all shown as success in my example since
the LOO completes successfully and some portion of downstream are skipped
while other downstream operators are successfully run. In short, it acts
the same as using a ShortCircuitOperator with a custom callable to
determine "latest"-ness.

On Tuesday, September 27, 2016, Bolke de Bruin <bd...@gmail.com> wrote:

> Great stuff! What happens to state of the DagRuns that are not part of
> “latest”?
>
> - Bolke
>
> > Op 28 sep. 2016, om 02:57 heeft siddharth anand <sanand@apache.org
> <javascript:;>> het volgende geschreven:
> >
> > @gwax added the LatestOnlyOperator
> > https://github.com/apache/incubator-airflow/pull/1752
> >
> >
> > This is a really nifty operator, so I wanted to let folks know about it.
> A
> > lot of people run cron for a mix of workloads. Some jobs map to
> traditional
> > ETL workloads (e.g. load hourly data summarization for
> 2016-10-01T00:00:00Z).
> > Some are simple cron tasks -- run a database backup every night. In the
> > latter case, if you miss 3 runs (e.g. your dag is paused or your start
> date
> > is a few days/weeks/months/ago), you don't want to make up for lost time
> > and backfill all of those days. Essentially, running N database backups
> at
> > once will take your database down... We'd prefer traditional cron
> behavior
> > in these cases, not ETL behavior.
> >
> > *Enter the LatestOnlyOperator.*
> >
> > Place this operator upstream of any tasks that you want to skip unless
> the
> > Dagrun is the latest. You can place a trigger rule downstream to "end"
> its
> > effect. By combining a Trigger Rule with this operator, you can ensure
> only
> > portions of your dag honor this "latest only" requirement. Or simply,
> have
> > an entire DAG run in "latest only" mode by using the LatestOnlyOperator
> > alone, i.e. not pairing it with a TriggerRule downstream.
> >
> > This is a useful pattern that I have been coding around for some time by
> > using a ShortCircuitOperator with a python callable, where the callable
> > evaluates the "latest"-ness of the dag run. I suspect we have all been
> > re-inventing this wheel, which is where Airflow's Operators shine.
> >
> > Thanks to @gwax for implementing this and sticking with a long and often
> > delayed review/merge process.
> >
> > -s
>
>

-- 
Sent from Gmail Mobile

Re: New Operator : LatestOnlyOperator

Posted by Bolke de Bruin <bd...@gmail.com>.
Great stuff! What happens to state of the DagRuns that are not part of “latest”?

- Bolke

> Op 28 sep. 2016, om 02:57 heeft siddharth anand <sa...@apache.org> het volgende geschreven:
> 
> @gwax added the LatestOnlyOperator
> https://github.com/apache/incubator-airflow/pull/1752
> 
> 
> This is a really nifty operator, so I wanted to let folks know about it. A
> lot of people run cron for a mix of workloads. Some jobs map to traditional
> ETL workloads (e.g. load hourly data summarization for 2016-10-01T00:00:00Z).
> Some are simple cron tasks -- run a database backup every night. In the
> latter case, if you miss 3 runs (e.g. your dag is paused or your start date
> is a few days/weeks/months/ago), you don't want to make up for lost time
> and backfill all of those days. Essentially, running N database backups at
> once will take your database down... We'd prefer traditional cron behavior
> in these cases, not ETL behavior.
> 
> *Enter the LatestOnlyOperator.*
> 
> Place this operator upstream of any tasks that you want to skip unless the
> Dagrun is the latest. You can place a trigger rule downstream to "end" its
> effect. By combining a Trigger Rule with this operator, you can ensure only
> portions of your dag honor this "latest only" requirement. Or simply, have
> an entire DAG run in "latest only" mode by using the LatestOnlyOperator
> alone, i.e. not pairing it with a TriggerRule downstream.
> 
> This is a useful pattern that I have been coding around for some time by
> using a ShortCircuitOperator with a python callable, where the callable
> evaluates the "latest"-ness of the dag run. I suspect we have all been
> re-inventing this wheel, which is where Airflow's Operators shine.
> 
> Thanks to @gwax for implementing this and sticking with a long and often
> delayed review/merge process.
> 
> -s