You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Gerard Toonstra <gt...@gmail.com> on 2016/10/16 21:23:34 UTC

ETL best practices for airflow

Hi all,

About a year ago, I contributed the HTTPOperator/Sensor and I've been
tracking airflow since. Right now it looks like we're going to adopt
airflow at the company I'm currently working at.

In preparation for that, I've done a bit of research work how airflow
pipelines should fit together, how important ETL principles are covered and
decided to write this up on a documentation site. The airflow documentation
site contains everything on how all airflow works and the constructs that
you have available to build pipelines, but it can still be a challenge for
newcomers to figure out how to put those constructs together to use it
effectively.

The articles I found online don't go into a lot of detail either. Airflow
is built around an important philosophy towards ETL and there's a risk that
newcomers simply pick up a really great tool and start off in the wrong way
when using it.


This weekend, I set off to write some documentation to try to fill this
gap. It starts off with a generic understanding of important ETL principles
and I'm currently working on a practical step-by-step example that adheres
to these principles with DAG implementations in airflow; i.e. showing how
it can all fit together.

You can find the current version here:

https://gtoonstra.github.io/etl-with-airflow/index.html


Looking forward to your comments. If you have better ideas how I can make
this contribution, don't hesitate to contact me with your suggestions.

Best regards,

Gerard

Re: ETL best practices for airflow

Posted by Gerard Toonstra <gt...@gmail.com>.

Hi Laura,

Looks very good. What I had to do first when I started was to figure out
relevant concepts for ETL, I don't have a BI background.
When I follow the tutorial and look at the examples, it's clear what
airflow can do conceptually, but as soon as I want to get started on
something, there's no clear idea what real life DAGs look like at all. So I
see myself and probably others stare at blank screens a lot,
getting lost.

I like your approach and thinking on putting data in a common place between
operators and worrying about where data resides between steps.
Huge monolith operators aren't reusable and it makes sense to break down
code into logical units of generic pieces of code that
are easy to configure through lambda functions or parameters.

I'm not sure if the dev list is the right way to have these 'user' type of
discussions, but let's see what they have to say about this.

Rgds,

Gerard


On Tue, Oct 18, 2016 at 3:12 AM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Thanks for sharing your slides, Laura! I think I've watched all the airflow
> related slides I could find and you did a very good job - adding your
> slides to my collection :)  I especially liked how were explaining
> execution date concept but I wish you could elaborate on a backfill concept
> and running the same dag in parallel (if you guys do this sort of thing) -
> I think this the most confusing thing of Airflow that needs good
> explanation / examples.
>
> On Mon, Oct 17, 2016 at 5:19 PM, Laura Lorenz <ll...@industrydive.com>
> wrote:
>
> > Same! I actually recently gave a talk about how my company uses airflow
> at
> > PyData DC. The video isn't live yet, but the slides are here
> > <http://www.slideshare.net/LauraLorenz4/how-i-learned-to-
> > time-travel-or-data-pipelining-and-scheduling-with-airflow>.
> > In substance it's actually very similar to what you've written.
> >
> > I have some airflow-specific ideas about ways to write custom sensors
> that
> > poll job apis (pretty common for us). We do dynamic generation of tasks
> > using external metadata by embedding an API call in the DAG definition
> > file, which I'm not sure is a best practice or not...
> >
> > Anyways, if it makes sense to contribute these case studies for
> > consideration as a 'best practice', if this is the place or way to do it,
> > I'm game. I agree that the resources and thought leadership on ETL design
> > is fragmented, and think the Airflow community is fertile ground to
> provide
> > discussion about it.
> >
> > On Sun, Oct 16, 2016 at 6:40 PM, Boris Tyukin <bo...@boristyukin.com>
> > wrote:
> >
> > > I really look forward to it, Gerard! I've read what you you wrote so
> far
> > > and I really liked it - please keep up the great job!
> > >
> > > I am hoping to see some best practices for the design of incremental
> > loads
> > > and using timestamps from source database systems (not being on UTC so
> > > still confused about it in Airflow). Also practical use of subdags and
> > > dynamic generation of tasks using some external metadata (maybe
> describe
> > in
> > > details something similar that wepay did
> > > https://wecode.wepay.com/posts/airflow-wepay)
> > >
> > >
> > > On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gt...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > About a year ago, I contributed the HTTPOperator/Sensor and I've been
> > > > tracking airflow since. Right now it looks like we're going to adopt
> > > > airflow at the company I'm currently working at.
> > > >
> > > > In preparation for that, I've done a bit of research work how airflow
> > > > pipelines should fit together, how important ETL principles are
> covered
> > > and
> > > > decided to write this up on a documentation site. The airflow
> > > documentation
> > > > site contains everything on how all airflow works and the constructs
> > that
> > > > you have available to build pipelines, but it can still be a
> challenge
> > > for
> > > > newcomers to figure out how to put those constructs together to use
> it
> > > > effectively.
> > > >
> > > > The articles I found online don't go into a lot of detail either.
> > Airflow
> > > > is built around an important philosophy towards ETL and there's a
> risk
> > > that
> > > > newcomers simply pick up a really great tool and start off in the
> wrong
> > > way
> > > > when using it.
> > > >
> > > >
> > > > This weekend, I set off to write some documentation to try to fill
> this
> > > > gap. It starts off with a generic understanding of important ETL
> > > principles
> > > > and I'm currently working on a practical step-by-step example that
> > > adheres
> > > > to these principles with DAG implementations in airflow; i.e. showing
> > how
> > > > it can all fit together.
> > > >
> > > > You can find the current version here:
> > > >
> > > > https://gtoonstra.github.io/etl-with-airflow/index.html
> > > >
> > > >
> > > > Looking forward to your comments. If you have better ideas how I can
> > make
> > > > this contribution, don't hesitate to contact me with your
> suggestions.
> > > >
> > > > Best regards,
> > > >
> > > > Gerard
> > > >
> > >
> >
>

Re: ETL best practices for airflow

Posted by Gerard Toonstra <gt...@gmail.com>.

Hi Boris,

Thanks very much!  These are all valid points and I'll work them out in the
next couple of days.
Indeed, the documentation of what the code actually does is rather limited,
I should probably have described
the overall strategy , so the code is easier to follow.

Thanks for the ideas for another page (support). That's also a very cool
thing to explain and work on.
If anyone has additional points of maintenance that happen regularly,
please add to the discussion.

Rgds,

Gerard





On Tue, Oct 25, 2016 at 6:45 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi Gerard, I like your examples but compared to the first sections of your
> document, that last section felt a bit rushed. By looking at actual example
> and your comments above (the issues you discovered) I was able to
> comprehend but for people new to Airflow it might be a bit confusing. Why
> you did not document these pitfalls/issues right in the doc? All of them
> are very valid point and would save time for other people.
>
> The big thing for me is exactly this:
>
>  a better strategy to process a large backfill if the desired schedule is
> 1 day. Processing 700+
>   days is going to take a lot of time and overhead when processing per
> month is an option.
>   Is a duplicate of the DAG with a different interval a better choice, or
> are there strategies to
>   detect this in an operator and use the output of that to specify the date
> window boundaries?
>
> Personally the more I think about it, the more I am inclined to use new
> Sid's feature (only run latest) and store ETL timestamps outside of Airflow
> - so very much traditional incremental ETL process. I am also reading that
> many people stay away from backfills.
>
> Something else for you consider for your document is to describe typical
> support scenarios once you have your pipeline running:
>
> 1) retries of tasks / Dags (not possible to retry entire DAG right now I
> think)
> 2) reload of already loaded data (exact steps to do that)
> 3) typical troubleshooting steps
>
>
> On Sat, Oct 22, 2016 at 7:07 PM, Gerard Toonstra <gt...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > So I worked out a full pipeline for a toy data warehouse on postgres:
> >
> > https://gtoonstra.github.io/etl-with-airflow/fullexample.html
> >
> > https://github.com/gtoonstra/etl-with-airflow/tree/master/
> > examples/full-example
> >
> > It demonstrates pretty much all listed principles for ETL work except for
> > alerting and monitoring.
> > Just some work TBD on the DDL and a full code review on naming
> conventions.
> >
> > Things I ran into:
> > - Issue 137, max_active_runs doesn't work after clearing tasks, it does
> in
> > the very first run.
> > - parameters for standard PostgresqlOperator are not templated, so
> couldn't
> > use the core operator.
> > - it's a good idea to specify "depends_on_past" when using sensors,
> > otherwise sensors could
> >   exhaust available processing slots.
> > - a better strategy to process a large backfill if the desired schedule
> is
> > 1 day. Processing 700+
> >   days is going to take a lot of time and overhead when processing per
> > month is an option.
> >   Is a duplicate of the DAG with a different interval a better choice, or
> > are there strategies to
> >   detect this in an operator and use the output of that to specify the
> date
> > window boundaries?
> > - when pooling is active, scheduling takes a lot more time. Even when the
> > pool is 10 and the number
> >    of instances 7, it takes longer for the instances to actually run.
> >
> > Looking forward to your comments on how some approaches could be
> improved.
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <gt...@gmail.com>
> > wrote:
> >
> > >
> > > Thanks Max,
> > >
> > > I think it always helps when new people start using software to see
> what
> > > their issues are.
> > >
> > > Some of it was also taken from the video on best practices in nov. 2015
> > on
> > > this page:
> > >
> > > https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be
> > >
> > > ----
> > >
> > > I made some more progress yesterday, but ran into issue 137. I think I
> > > solved it by depends_on_past,
> > > but I'm going to rely on the LatestOnlyOperator instead (it's better)
> and
> > > then work out something better
> > > from there.
> > >
> > > Rgds,
> > >
> > > Gerard
> > >
> > >
> > > On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > >> This is an amazing thread to follow! I'm really interested to watch
> best
> > >> practices documentation emerge out of the community.
> > >>
> > >> Gerard, I enjoyed reading your docs and would love to see this grow.
> > I've
> > >> been meaning to write a series of blog posts on the subject for quite
> > some
> > >> time. It seems like you have a really good start. We could integrate
> > this
> > >> as a "Best Practice" section to our current documentation once we
> build
> > >> consensus about the content.
> > >>
> > >> Laura, please post on this mailing list once the talk is up as a
> video,
> > >> I'd
> > >> love to watch it.
> > >>
> > >> A related best practice I'd like to write about is the idea of
> applying
> > >> some concepts of functional programing to ETL. The idea is to use
> > >> immutable
> > >> datasets/datablocks systematically as sources to your computations, in
> > >> ways
> > >> that any task instance sources from immutable datasets that are
> > persisted
> > >> in your backend. That allows to satisfy the guarantee that re-running
> > any
> > >> chunk of ETL at different point in time should lead to the exact same
> > >> result. It also usually means that you need to 1-do incremental loads,
> > and
> > >> 2- "snapshot" your dimension/referential/small tables in time to make
> > sure
> > >> that running the ETL from 26 days ago sources from the dimension
> > snapshot
> > >> as it was back then and yields the exact same result.
> > >>
> > >> Anyhow, it's a complex and important subject I should probably write
> > about
> > >> in a structured way sometime.
> > >>
> > >> Max
> > >>
> > >> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <bo...@boristyukin.com>
> > >> wrote:
> > >>
> > >>
> > >>
> > >
> >
>

Re: ETL best practices for airflow

Posted by Boris Tyukin <bo...@boristyukin.com>.

Hi Gerard, I like your examples but compared to the first sections of your
document, that last section felt a bit rushed. By looking at actual example
and your comments above (the issues you discovered) I was able to
comprehend but for people new to Airflow it might be a bit confusing. Why
you did not document these pitfalls/issues right in the doc? All of them
are very valid point and would save time for other people.

The big thing for me is exactly this:

 a better strategy to process a large backfill if the desired schedule is
1 day. Processing 700+
  days is going to take a lot of time and overhead when processing per
month is an option.
  Is a duplicate of the DAG with a different interval a better choice, or
are there strategies to
  detect this in an operator and use the output of that to specify the date
window boundaries?

Personally the more I think about it, the more I am inclined to use new
Sid's feature (only run latest) and store ETL timestamps outside of Airflow
- so very much traditional incremental ETL process. I am also reading that
many people stay away from backfills.

Something else for you consider for your document is to describe typical
support scenarios once you have your pipeline running:

1) retries of tasks / Dags (not possible to retry entire DAG right now I
think)
2) reload of already loaded data (exact steps to do that)
3) typical troubleshooting steps


On Sat, Oct 22, 2016 at 7:07 PM, Gerard Toonstra <gt...@gmail.com>
wrote:

> Hi all,
>
> So I worked out a full pipeline for a toy data warehouse on postgres:
>
> https://gtoonstra.github.io/etl-with-airflow/fullexample.html
>
> https://github.com/gtoonstra/etl-with-airflow/tree/master/
> examples/full-example
>
> It demonstrates pretty much all listed principles for ETL work except for
> alerting and monitoring.
> Just some work TBD on the DDL and a full code review on naming conventions.
>
> Things I ran into:
> - Issue 137, max_active_runs doesn't work after clearing tasks, it does in
> the very first run.
> - parameters for standard PostgresqlOperator are not templated, so couldn't
> use the core operator.
> - it's a good idea to specify "depends_on_past" when using sensors,
> otherwise sensors could
>   exhaust available processing slots.
> - a better strategy to process a large backfill if the desired schedule is
> 1 day. Processing 700+
>   days is going to take a lot of time and overhead when processing per
> month is an option.
>   Is a duplicate of the DAG with a different interval a better choice, or
> are there strategies to
>   detect this in an operator and use the output of that to specify the date
> window boundaries?
> - when pooling is active, scheduling takes a lot more time. Even when the
> pool is 10 and the number
>    of instances 7, it takes longer for the instances to actually run.
>
> Looking forward to your comments on how some approaches could be improved.
>
> Rgds,
>
> Gerard
>
>
> On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <gt...@gmail.com>
> wrote:
>
> >
> > Thanks Max,
> >
> > I think it always helps when new people start using software to see what
> > their issues are.
> >
> > Some of it was also taken from the video on best practices in nov. 2015
> on
> > this page:
> >
> > https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be
> >
> > ----
> >
> > I made some more progress yesterday, but ran into issue 137. I think I
> > solved it by depends_on_past,
> > but I'm going to rely on the LatestOnlyOperator instead (it's better) and
> > then work out something better
> > from there.
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> >> This is an amazing thread to follow! I'm really interested to watch best
> >> practices documentation emerge out of the community.
> >>
> >> Gerard, I enjoyed reading your docs and would love to see this grow.
> I've
> >> been meaning to write a series of blog posts on the subject for quite
> some
> >> time. It seems like you have a really good start. We could integrate
> this
> >> as a "Best Practice" section to our current documentation once we build
> >> consensus about the content.
> >>
> >> Laura, please post on this mailing list once the talk is up as a video,
> >> I'd
> >> love to watch it.
> >>
> >> A related best practice I'd like to write about is the idea of applying
> >> some concepts of functional programing to ETL. The idea is to use
> >> immutable
> >> datasets/datablocks systematically as sources to your computations, in
> >> ways
> >> that any task instance sources from immutable datasets that are
> persisted
> >> in your backend. That allows to satisfy the guarantee that re-running
> any
> >> chunk of ETL at different point in time should lead to the exact same
> >> result. It also usually means that you need to 1-do incremental loads,
> and
> >> 2- "snapshot" your dimension/referential/small tables in time to make
> sure
> >> that running the ETL from 26 days ago sources from the dimension
> snapshot
> >> as it was back then and yields the exact same result.
> >>
> >> Anyhow, it's a complex and important subject I should probably write
> about
> >> in a structured way sometime.
> >>
> >> Max
> >>
> >> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <bo...@boristyukin.com>
> >> wrote:
> >>
> >>
> >>
> >
>

Re: ETL best practices for airflow

Posted by Gerard Toonstra <gt...@gmail.com>.

Hi all,

So I worked out a full pipeline for a toy data warehouse on postgres:

https://gtoonstra.github.io/etl-with-airflow/fullexample.html

https://github.com/gtoonstra/etl-with-airflow/tree/master/
examples/full-example

It demonstrates pretty much all listed principles for ETL work except for
alerting and monitoring.
Just some work TBD on the DDL and a full code review on naming conventions.

Things I ran into:
- Issue 137, max_active_runs doesn't work after clearing tasks, it does in
the very first run.
- parameters for standard PostgresqlOperator are not templated, so couldn't
use the core operator.
- it's a good idea to specify "depends_on_past" when using sensors,
otherwise sensors could
  exhaust available processing slots.
- a better strategy to process a large backfill if the desired schedule is
1 day. Processing 700+
  days is going to take a lot of time and overhead when processing per
month is an option.
  Is a duplicate of the DAG with a different interval a better choice, or
are there strategies to
  detect this in an operator and use the output of that to specify the date
window boundaries?
- when pooling is active, scheduling takes a lot more time. Even when the
pool is 10 and the number
   of instances 7, it takes longer for the instances to actually run.

Looking forward to your comments on how some approaches could be improved.

Rgds,

Gerard


On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <gt...@gmail.com>
wrote:

>
> Thanks Max,
>
> I think it always helps when new people start using software to see what
> their issues are.
>
> Some of it was also taken from the video on best practices in nov. 2015 on
> this page:
>
> https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be
>
> ----
>
> I made some more progress yesterday, but ran into issue 137. I think I
> solved it by depends_on_past,
> but I'm going to rely on the LatestOnlyOperator instead (it's better) and
> then work out something better
> from there.
>
> Rgds,
>
> Gerard
>
>
> On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> This is an amazing thread to follow! I'm really interested to watch best
>> practices documentation emerge out of the community.
>>
>> Gerard, I enjoyed reading your docs and would love to see this grow. I've
>> been meaning to write a series of blog posts on the subject for quite some
>> time. It seems like you have a really good start. We could integrate this
>> as a "Best Practice" section to our current documentation once we build
>> consensus about the content.
>>
>> Laura, please post on this mailing list once the talk is up as a video,
>> I'd
>> love to watch it.
>>
>> A related best practice I'd like to write about is the idea of applying
>> some concepts of functional programing to ETL. The idea is to use
>> immutable
>> datasets/datablocks systematically as sources to your computations, in
>> ways
>> that any task instance sources from immutable datasets that are persisted
>> in your backend. That allows to satisfy the guarantee that re-running any
>> chunk of ETL at different point in time should lead to the exact same
>> result. It also usually means that you need to 1-do incremental loads, and
>> 2- "snapshot" your dimension/referential/small tables in time to make sure
>> that running the ETL from 26 days ago sources from the dimension snapshot
>> as it was back then and yields the exact same result.
>>
>> Anyhow, it's a complex and important subject I should probably write about
>> in a structured way sometime.
>>
>> Max
>>
>> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>
>>
>

Re: ETL best practices for airflow

Posted by Gerard Toonstra <gt...@gmail.com>.

Thanks Max,

I think it always helps when new people start using software to see what
their issues are.

Some of it was also taken from the video on best practices in nov. 2015 on
this page:

https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be

----

I made some more progress yesterday, but ran into issue 137. I think I
solved it by depends_on_past,
but I'm going to rely on the LatestOnlyOperator instead (it's better) and
then work out something better
from there.

Rgds,

Gerard


On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> This is an amazing thread to follow! I'm really interested to watch best
> practices documentation emerge out of the community.
>
> Gerard, I enjoyed reading your docs and would love to see this grow. I've
> been meaning to write a series of blog posts on the subject for quite some
> time. It seems like you have a really good start. We could integrate this
> as a "Best Practice" section to our current documentation once we build
> consensus about the content.
>
> Laura, please post on this mailing list once the talk is up as a video, I'd
> love to watch it.
>
> A related best practice I'd like to write about is the idea of applying
> some concepts of functional programing to ETL. The idea is to use immutable
> datasets/datablocks systematically as sources to your computations, in ways
> that any task instance sources from immutable datasets that are persisted
> in your backend. That allows to satisfy the guarantee that re-running any
> chunk of ETL at different point in time should lead to the exact same
> result. It also usually means that you need to 1-do incremental loads, and
> 2- "snapshot" your dimension/referential/small tables in time to make sure
> that running the ETL from 26 days ago sources from the dimension snapshot
> as it was back then and yields the exact same result.
>
> Anyhow, it's a complex and important subject I should probably write about
> in a structured way sometime.
>
> Max
>
> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>
>

Re: ETL best practices for airflow

Posted by Maxime Beauchemin <ma...@gmail.com>.

This is an amazing thread to follow! I'm really interested to watch best
practices documentation emerge out of the community.

Gerard, I enjoyed reading your docs and would love to see this grow. I've
been meaning to write a series of blog posts on the subject for quite some
time. It seems like you have a really good start. We could integrate this
as a "Best Practice" section to our current documentation once we build
consensus about the content.

Laura, please post on this mailing list once the talk is up as a video, I'd
love to watch it.

A related best practice I'd like to write about is the idea of applying
some concepts of functional programing to ETL. The idea is to use immutable
datasets/datablocks systematically as sources to your computations, in ways
that any task instance sources from immutable datasets that are persisted
in your backend. That allows to satisfy the guarantee that re-running any
chunk of ETL at different point in time should lead to the exact same
result. It also usually means that you need to 1-do incremental loads, and
2- "snapshot" your dimension/referential/small tables in time to make sure
that running the ETL from 26 days ago sources from the dimension snapshot
as it was back then and yields the exact same result.

Anyhow, it's a complex and important subject I should probably write about
in a structured way sometime.

Max

On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Thanks for sharing your slides, Laura! I think I've watched all the airflow
> related slides I could find and you did a very good job - adding your
> slides to my collection :)  I especially liked how were explaining
> execution date concept but I wish you could elaborate on a backfill concept
> and running the same dag in parallel (if you guys do this sort of thing) -
> I think this the most confusing thing of Airflow that needs good
> explanation / examples.
>
> On Mon, Oct 17, 2016 at 5:19 PM, Laura Lorenz <ll...@industrydive.com>
> wrote:
>
> > Same! I actually recently gave a talk about how my company uses airflow
> at
> > PyData DC. The video isn't live yet, but the slides are here
> > <http://www.slideshare.net/LauraLorenz4/how-i-learned-to-
> > time-travel-or-data-pipelining-and-scheduling-with-airflow>.
> > In substance it's actually very similar to what you've written.
> >
> > I have some airflow-specific ideas about ways to write custom sensors
> that
> > poll job apis (pretty common for us). We do dynamic generation of tasks
> > using external metadata by embedding an API call in the DAG definition
> > file, which I'm not sure is a best practice or not...
> >
> > Anyways, if it makes sense to contribute these case studies for
> > consideration as a 'best practice', if this is the place or way to do it,
> > I'm game. I agree that the resources and thought leadership on ETL design
> > is fragmented, and think the Airflow community is fertile ground to
> provide
> > discussion about it.
> >
> > On Sun, Oct 16, 2016 at 6:40 PM, Boris Tyukin <bo...@boristyukin.com>
> > wrote:
> >
> > > I really look forward to it, Gerard! I've read what you you wrote so
> far
> > > and I really liked it - please keep up the great job!
> > >
> > > I am hoping to see some best practices for the design of incremental
> > loads
> > > and using timestamps from source database systems (not being on UTC so
> > > still confused about it in Airflow). Also practical use of subdags and
> > > dynamic generation of tasks using some external metadata (maybe
> describe
> > in
> > > details something similar that wepay did
> > > https://wecode.wepay.com/posts/airflow-wepay)
> > >
> > >
> > > On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gt...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > About a year ago, I contributed the HTTPOperator/Sensor and I've been
> > > > tracking airflow since. Right now it looks like we're going to adopt
> > > > airflow at the company I'm currently working at.
> > > >
> > > > In preparation for that, I've done a bit of research work how airflow
> > > > pipelines should fit together, how important ETL principles are
> covered
> > > and
> > > > decided to write this up on a documentation site. The airflow
> > > documentation
> > > > site contains everything on how all airflow works and the constructs
> > that
> > > > you have available to build pipelines, but it can still be a
> challenge
> > > for
> > > > newcomers to figure out how to put those constructs together to use
> it
> > > > effectively.
> > > >
> > > > The articles I found online don't go into a lot of detail either.
> > Airflow
> > > > is built around an important philosophy towards ETL and there's a
> risk
> > > that
> > > > newcomers simply pick up a really great tool and start off in the
> wrong
> > > way
> > > > when using it.
> > > >
> > > >
> > > > This weekend, I set off to write some documentation to try to fill
> this
> > > > gap. It starts off with a generic understanding of important ETL
> > > principles
> > > > and I'm currently working on a practical step-by-step example that
> > > adheres
> > > > to these principles with DAG implementations in airflow; i.e. showing
> > how
> > > > it can all fit together.
> > > >
> > > > You can find the current version here:
> > > >
> > > > https://gtoonstra.github.io/etl-with-airflow/index.html
> > > >
> > > >
> > > > Looking forward to your comments. If you have better ideas how I can
> > make
> > > > this contribution, don't hesitate to contact me with your
> suggestions.
> > > >
> > > > Best regards,
> > > >
> > > > Gerard
> > > >
> > >
> >
>

Re: ETL best practices for airflow

Posted by Boris Tyukin <bo...@boristyukin.com>.

Thanks for sharing your slides, Laura! I think I've watched all the airflow
related slides I could find and you did a very good job - adding your
slides to my collection :)  I especially liked how were explaining
execution date concept but I wish you could elaborate on a backfill concept
and running the same dag in parallel (if you guys do this sort of thing) -
I think this the most confusing thing of Airflow that needs good
explanation / examples.

On Mon, Oct 17, 2016 at 5:19 PM, Laura Lorenz <ll...@industrydive.com>
wrote:

> Same! I actually recently gave a talk about how my company uses airflow at
> PyData DC. The video isn't live yet, but the slides are here
> <http://www.slideshare.net/LauraLorenz4/how-i-learned-to-
> time-travel-or-data-pipelining-and-scheduling-with-airflow>.
> In substance it's actually very similar to what you've written.
>
> I have some airflow-specific ideas about ways to write custom sensors that
> poll job apis (pretty common for us). We do dynamic generation of tasks
> using external metadata by embedding an API call in the DAG definition
> file, which I'm not sure is a best practice or not...
>
> Anyways, if it makes sense to contribute these case studies for
> consideration as a 'best practice', if this is the place or way to do it,
> I'm game. I agree that the resources and thought leadership on ETL design
> is fragmented, and think the Airflow community is fertile ground to provide
> discussion about it.
>
> On Sun, Oct 16, 2016 at 6:40 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
> > I really look forward to it, Gerard! I've read what you you wrote so far
> > and I really liked it - please keep up the great job!
> >
> > I am hoping to see some best practices for the design of incremental
> loads
> > and using timestamps from source database systems (not being on UTC so
> > still confused about it in Airflow). Also practical use of subdags and
> > dynamic generation of tasks using some external metadata (maybe describe
> in
> > details something similar that wepay did
> > https://wecode.wepay.com/posts/airflow-wepay)
> >
> >
> > On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gt...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > About a year ago, I contributed the HTTPOperator/Sensor and I've been
> > > tracking airflow since. Right now it looks like we're going to adopt
> > > airflow at the company I'm currently working at.
> > >
> > > In preparation for that, I've done a bit of research work how airflow
> > > pipelines should fit together, how important ETL principles are covered
> > and
> > > decided to write this up on a documentation site. The airflow
> > documentation
> > > site contains everything on how all airflow works and the constructs
> that
> > > you have available to build pipelines, but it can still be a challenge
> > for
> > > newcomers to figure out how to put those constructs together to use it
> > > effectively.
> > >
> > > The articles I found online don't go into a lot of detail either.
> Airflow
> > > is built around an important philosophy towards ETL and there's a risk
> > that
> > > newcomers simply pick up a really great tool and start off in the wrong
> > way
> > > when using it.
> > >
> > >
> > > This weekend, I set off to write some documentation to try to fill this
> > > gap. It starts off with a generic understanding of important ETL
> > principles
> > > and I'm currently working on a practical step-by-step example that
> > adheres
> > > to these principles with DAG implementations in airflow; i.e. showing
> how
> > > it can all fit together.
> > >
> > > You can find the current version here:
> > >
> > > https://gtoonstra.github.io/etl-with-airflow/index.html
> > >
> > >
> > > Looking forward to your comments. If you have better ideas how I can
> make
> > > this contribution, don't hesitate to contact me with your suggestions.
> > >
> > > Best regards,
> > >
> > > Gerard
> > >
> >
>

Re: ETL best practices for airflow

Posted by Laura Lorenz <ll...@industrydive.com>.

Same! I actually recently gave a talk about how my company uses airflow at
PyData DC. The video isn't live yet, but the slides are here
<http://www.slideshare.net/LauraLorenz4/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow>.
In substance it's actually very similar to what you've written.

I have some airflow-specific ideas about ways to write custom sensors that
poll job apis (pretty common for us). We do dynamic generation of tasks
using external metadata by embedding an API call in the DAG definition
file, which I'm not sure is a best practice or not...

Anyways, if it makes sense to contribute these case studies for
consideration as a 'best practice', if this is the place or way to do it,
I'm game. I agree that the resources and thought leadership on ETL design
is fragmented, and think the Airflow community is fertile ground to provide
discussion about it.

On Sun, Oct 16, 2016 at 6:40 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> I really look forward to it, Gerard! I've read what you you wrote so far
> and I really liked it - please keep up the great job!
>
> I am hoping to see some best practices for the design of incremental loads
> and using timestamps from source database systems (not being on UTC so
> still confused about it in Airflow). Also practical use of subdags and
> dynamic generation of tasks using some external metadata (maybe describe in
> details something similar that wepay did
> https://wecode.wepay.com/posts/airflow-wepay)
>
>
> On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gt...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > About a year ago, I contributed the HTTPOperator/Sensor and I've been
> > tracking airflow since. Right now it looks like we're going to adopt
> > airflow at the company I'm currently working at.
> >
> > In preparation for that, I've done a bit of research work how airflow
> > pipelines should fit together, how important ETL principles are covered
> and
> > decided to write this up on a documentation site. The airflow
> documentation
> > site contains everything on how all airflow works and the constructs that
> > you have available to build pipelines, but it can still be a challenge
> for
> > newcomers to figure out how to put those constructs together to use it
> > effectively.
> >
> > The articles I found online don't go into a lot of detail either. Airflow
> > is built around an important philosophy towards ETL and there's a risk
> that
> > newcomers simply pick up a really great tool and start off in the wrong
> way
> > when using it.
> >
> >
> > This weekend, I set off to write some documentation to try to fill this
> > gap. It starts off with a generic understanding of important ETL
> principles
> > and I'm currently working on a practical step-by-step example that
> adheres
> > to these principles with DAG implementations in airflow; i.e. showing how
> > it can all fit together.
> >
> > You can find the current version here:
> >
> > https://gtoonstra.github.io/etl-with-airflow/index.html
> >
> >
> > Looking forward to your comments. If you have better ideas how I can make
> > this contribution, don't hesitate to contact me with your suggestions.
> >
> > Best regards,
> >
> > Gerard
> >
>

Re: ETL best practices for airflow

Posted by Gerard Toonstra <gt...@gmail.com>.

Hi all,

Today I was trying to work out a very basic example and very quickly ran
into an hour of trying to solve a problem that ought to be really easy.
I didn't expect that.  I posted about this on gitter.im and someone helped
me out there.

All the simple database operators (mysql, postgres, mssql, etc) run a
simple query against a database which I attempted to test. But I couldn't
work out how
to pass a date window to my sql script based on execution date.  Then I
noticed that the "parameters"  dict isn't being templated.

So I couldn't do:

oper_1 = PostgresOperator(
    sql='copy_order_info.sql',
    parameters={"window_start_date": "{{ ds }}", "window_end_date": "{{
tomorrow_ds }}"},
    task_id='ingest_order',
    dag=dag)

But instead I'd have to work that into the SQL template looking something
like this (without specifying parameters dict at all):

WHERE
      o.create_dtm >= '{{ ds }}'
AND   o.create_dtm <  '{{ tomorrow_ds }}'

What I don't like about this approach is that if you have other parameters
coming from somewhere else, you end up with different ways of resolving
these variables:

oper_1 = PostgresOperator(
    sql='copy_order_info.sql',
    parameters={"client_id":"some_value"},

and resolving client_id in postgres like this:

WHERE
      o.create_dtm >= '{{ ds }}'
AND   o.create_dtm <  '{{ tomorrow_ds }}'
AND   client_id = %(client_id)s

Which mixes up variable substitution methods in the same query, which looks
really ugly.

If there's a really nice way to access "ds" and "tomorrow_ds" when the
operator is inited, then this can become the method of choice.
One obvious way to do that is to also template "parameters" for all the
database operators:

    template_fields = ('sql','parameters')

The reason why I prefer the specific substitution method for databases is
that it provides a single way to do this. The macro methods don't work well
in this
case, because that particular method of coding is subject to SQL injection
attacks and shouldn't be taught as a good way of doing things. Another
reason is that
the regular substitution methods for the database of choice allow for unit
testing code in other frameworks beyond airflow and easier code reuse
should another
scheduler be chosen in the future.

Should I create a bug or a PR for this?

Rgds,

Gerard

On Mon, Oct 17, 2016 at 12:40 AM, Boris Tyukin <bo...@boristyukin.com>
wrote:

> I really look forward to it, Gerard! I've read what you you wrote so far
> and I really liked it - please keep up the great job!
>
> I am hoping to see some best practices for the design of incremental loads
> and using timestamps from source database systems (not being on UTC so
> still confused about it in Airflow). Also practical use of subdags and
> dynamic generation of tasks using some external metadata (maybe describe in
> details something similar that wepay did
> https://wecode.wepay.com/posts/airflow-wepay)
>
>
> On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gt...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > About a year ago, I contributed the HTTPOperator/Sensor and I've been
> > tracking airflow since. Right now it looks like we're going to adopt
> > airflow at the company I'm currently working at.
> >
> > In preparation for that, I've done a bit of research work how airflow
> > pipelines should fit together, how important ETL principles are covered
> and
> > decided to write this up on a documentation site. The airflow
> documentation
> > site contains everything on how all airflow works and the constructs that
> > you have available to build pipelines, but it can still be a challenge
> for
> > newcomers to figure out how to put those constructs together to use it
> > effectively.
> >
> > The articles I found online don't go into a lot of detail either. Airflow
> > is built around an important philosophy towards ETL and there's a risk
> that
> > newcomers simply pick up a really great tool and start off in the wrong
> way
> > when using it.
> >
> >
> > This weekend, I set off to write some documentation to try to fill this
> > gap. It starts off with a generic understanding of important ETL
> principles
> > and I'm currently working on a practical step-by-step example that
> adheres
> > to these principles with DAG implementations in airflow; i.e. showing how
> > it can all fit together.
> >
> > You can find the current version here:
> >
> > https://gtoonstra.github.io/etl-with-airflow/index.html
> >
> >
> > Looking forward to your comments. If you have better ideas how I can make
> > this contribution, don't hesitate to contact me with your suggestions.
> >
> > Best regards,
> >
> > Gerard
> >
>

Re: ETL best practices for airflow

Posted by Boris Tyukin <bo...@boristyukin.com>.

I really look forward to it, Gerard! I've read what you you wrote so far
and I really liked it - please keep up the great job!

I am hoping to see some best practices for the design of incremental loads
and using timestamps from source database systems (not being on UTC so
still confused about it in Airflow). Also practical use of subdags and
dynamic generation of tasks using some external metadata (maybe describe in
details something similar that wepay did
https://wecode.wepay.com/posts/airflow-wepay)


On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gt...@gmail.com>
wrote:

> Hi all,
>
> About a year ago, I contributed the HTTPOperator/Sensor and I've been
> tracking airflow since. Right now it looks like we're going to adopt
> airflow at the company I'm currently working at.
>
> In preparation for that, I've done a bit of research work how airflow
> pipelines should fit together, how important ETL principles are covered and
> decided to write this up on a documentation site. The airflow documentation
> site contains everything on how all airflow works and the constructs that
> you have available to build pipelines, but it can still be a challenge for
> newcomers to figure out how to put those constructs together to use it
> effectively.
>
> The articles I found online don't go into a lot of detail either. Airflow
> is built around an important philosophy towards ETL and there's a risk that
> newcomers simply pick up a really great tool and start off in the wrong way
> when using it.
>
>
> This weekend, I set off to write some documentation to try to fill this
> gap. It starts off with a generic understanding of important ETL principles
> and I'm currently working on a practical step-by-step example that adheres
> to these principles with DAG implementations in airflow; i.e. showing how
> it can all fit together.
>
> You can find the current version here:
>
> https://gtoonstra.github.io/etl-with-airflow/index.html
>
>
> Looking forward to your comments. If you have better ideas how I can make
> this contribution, don't hesitate to contact me with your suggestions.
>
> Best regards,
>
> Gerard
>