You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2016/10/18 15:40:37 UTC

Best practices for dynamically generated tasks and dags

Sorry again for posting in the DEV group since it is user type question but
I do not think we have a user group list and I do not feel that gitter is
appropriate for this sort of discussions.

I am actively testing Airflow for a specific use case which is a generation
of workflows/tasks for 200-300 tables in my source database. Every table
will have a set a pretty standard tasks (~6-8) and tasks will have some
conditions (e.g. on some runs some of the tasks will be skipped).

After looking at Oozie, Luigi, Pinball and Airflow and watching numerous
presentations, I thought Airflow was a perfect match for that. I am really
blown away by features and potential and use cases.

I know many of committers are doing something similar and I'd love to hear
the details and some guidance in terms of best practices. I heard 3 options
I think:

#1 Create one big DAG (in my case it would be a DAG with 300x8 tasks)

#2 Create one DAG that will generate smaller DAGs (as described here
https://wecode.wepay.com/posts/airflow-wepay)

#3 A combination of #1 and #2? like external .py file to generate "static"
DAG on demand (e.g. adding a new table or removing one)

Second question related to this concept. Airflow webserver and scheduler
polls DAG folder to fill up DagBag and it does that very frequently by
default (every minute?) The problem is that it takes time to generate DAG
dynamically - in my case I will be using some metadata from YAML or
database and this process might well take a minute or too. How do you deal
with this?

Thanks again for such an amazing project!

Re: Best practices for dynamically generated tasks and dags

Posted by Boris Tyukin <bo...@boristyukin.com>.

thanks Laura, it helps! i was hoping you would reply :) very good points
about UI / logs / restarts - I think at this point I really like #2 option
myself.

I still wonder if people do something creative to generate complex DAGs
outside of a DAG folder - so this would be an example when it takes
significant time to poll metadata/databases to generate all the tasks. I do
not know if it is possible as I am not strong with Python (actually I have
been learning Python as I am learning Airflow!) The idea is to have an
outside py to generate static .py file for a DAG/s and place these
generated py files under airflow dag_folder once a day or on some schedule.
Is anyone doing this or I am over-complicating things and #2 should just
work?

I think in my case it might take a good minute to parse out metadata files
and some database tables to actually generate DAG tasks. Also I imagine it
will produce a heck of log records since scheduler polls dag folders every
minute and this process will repeat again itself in a minute - so it will
be like doing this non-stop unless I change airflow scheduler settings.

On Fri, Oct 21, 2016 at 11:39 AM, Laura Lorenz <ll...@industrydive.com>
wrote:

> We've been evolving from type 1 you describe to a pull/poll version of the
> type 2 you describe. For type 1, it is really hard to tell what's going on
> (all the UI views become useless because they are so huge). Having one big
> dag also means you can't turn off the scheduler for individual parts, and
> the whole DAG fails if one task does, so if you can functionally separate
> them I think that gives you more configuration options. Our biggest DAG now
> is more like 22*10 tasks, which is still too big in our opinions. We
> leverage ExternalTaskSensors to link dags together which is more of a
> pull/poll paradigm, but you could use a TriggerDagRunOperator if you wanted
> more of a push/trigger paradigm which is what I hea ryou saying in type 2.
>
> To your second question, our DAGs are dynamic based on the results of an
> API call we embed in the DAG and our scheduler is on a 5-second timelapse
> for each attemp to refill the DagBag. I think because of the frequency of
> the scheduler polling the files, because our API call is relatively fast,
> we are working with DAGs that are on a 24 hour schedule_interval, and the
> resultant DAG structure is not too large or complicated, we haven't had any
> issues with that or done anything special. I think it's just the fact of
> the matter that if you give the scheduler a lot of work to do to determine
> the DAG shape, it will take a while.
>
> Laura
>
> On Fri, Oct 21, 2016 at 10:52 AM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
> > Guys, would you mind to chime in and share your experience?
> >
>

Re: Best practices for dynamically generated tasks and dags

Posted by Laura Lorenz <ll...@industrydive.com>.

We've been evolving from type 1 you describe to a pull/poll version of the
type 2 you describe. For type 1, it is really hard to tell what's going on
(all the UI views become useless because they are so huge). Having one big
dag also means you can't turn off the scheduler for individual parts, and
the whole DAG fails if one task does, so if you can functionally separate
them I think that gives you more configuration options. Our biggest DAG now
is more like 22*10 tasks, which is still too big in our opinions. We
leverage ExternalTaskSensors to link dags together which is more of a
pull/poll paradigm, but you could use a TriggerDagRunOperator if you wanted
more of a push/trigger paradigm which is what I hea ryou saying in type 2.

To your second question, our DAGs are dynamic based on the results of an
API call we embed in the DAG and our scheduler is on a 5-second timelapse
for each attemp to refill the DagBag. I think because of the frequency of
the scheduler polling the files, because our API call is relatively fast,
we are working with DAGs that are on a 24 hour schedule_interval, and the
resultant DAG structure is not too large or complicated, we haven't had any
issues with that or done anything special. I think it's just the fact of
the matter that if you give the scheduler a lot of work to do to determine
the DAG shape, it will take a while.

Laura

On Fri, Oct 21, 2016 at 10:52 AM, Boris Tyukin <bo...@boristyukin.com>
wrote:

> Guys, would you mind to chime in and share your experience?
>

Re: Best practices for dynamically generated tasks and dags

Posted by Boris Tyukin <bo...@boristyukin.com>.

Guys, would you mind to chime in and share your experience?