You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Jon McKenzie <jc...@gmail.com> on 2016/07/11 21:46:46 UTC

Dynamic DAG Input/Seed Data?

Unless I'm missing it, it appears like it isn't possible to launch a DAG
job with initial inputs to the first task instance in the workflow (without
specifying those inputs in the DAG definition)

Am I missing something?

So for instance, I want to have user A be able to launch the DAG with
parameter foo = bar, and user B to be able to launch the same DAG with foo
= baz. In my use case, this would be hooked up to a RESTful API, and the
users wouldn't necessarily know anything about DAGs or what's happening
behind the scenes

The closest I can think to accomplishing this is to generate run IDs in my
REST API, store the (run ID, input) pair in a database, and retrieve the
inputs in my first task in my DAG. But this seems like a very hamhanded,
roundabout way of doing it. I'd much rather just create a DagRun with
task_params that the scheduler automatically associates to the first task
instance.

Any thoughts?

Re: Dynamic DAG Input/Seed Data?

Posted by Chris Riccomini <cr...@apache.org>.

This is a really fascinating idea. REST API as plugin. Will have to
think about how this fits in with security, but intriguing
nonetheless.

On Tue, Jul 12, 2016 at 12:04 PM, Cade Markegard
<ca...@gmail.com> wrote:
> I've been playing around with creating a HTTP API using Airflow's plugins
> here's a little bit of the triggering the DagRun:
>
> https://gist.github.com/cademarkegard/e1adc20baf6fbae89bac2dcca3d2159e
>
> Hopefully that helps clear up how you could pass parameters to the DagRun.
> You'd probably also want to add some token based auth for the route.
>
> Cade
>
> On Tue, Jul 12, 2016 at 11:34 AM Paul Minton <pm...@change.org> wrote:
>
>> >
>> > For the use case where the parameters only change parameters or the
>> > behavior of tasks (but not the shape of the DAG itself)
>>
>>
>> This is the use case that I'm thinking of. But it's not clear how to change
>> those parameters from the UI or REST api (if that's at all possible).
>>
>> On Tue, Jul 12, 2016 at 10:48 AM, Maxime Beauchemin <
>> maximebeauchemin@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > A few notes around dynamic DAGs:
>> >
>> > We don't really support mutating a DAG's shape or structure based on
>> source
>> > parameters. Think about it: it would be hard to fit the current paradigm
>> of
>> > the UI like representing that DAG in the tree view. We like to think of
>> > DAGs as pretty static or "slowly changing", similar to how a database
>> > physical model evolves in the lifecycle of an application or a data
>> > warehouse (at a similar rhythm). For those use cases (where input
>> > parameters would change the shape of the DAG), we think of those as
>> > different "singleton" DAGs that are expected to run a single time. To get
>> > this to work, we create a "DAG factory" as a python scripts that outputs
>> > many different DAG objects (with different dag_ids) and where
>> > `schedule_interval='@once'` based on a config file or something
>> equivalent
>> > (db configuration, airflow.models.Variable object, ...).
>> >
>> > For the use case where the parameters only change parameters or the
>> > behavior of tasks (but not the shape of the DAG itself), I recommend
>> using
>> > a DAG where `schedule_interval=None` that is triggered with different
>> > parameters for its conf. Inside templates or operators you can access the
>> > context easily to refer to the related DagRun's conf parameters. You
>> could
>> > potentially do that with a DAG on a schedule using Xcom as well, where an
>> > early task would populate some Xcom parameters that following tasks would
>> > read.
>> >
>> > Max
>> >
>> > On Mon, Jul 11, 2016 at 6:26 PM, Paul Minton <pm...@change.org> wrote:
>> >
>> > > I asked a very similar question in this thread that might provide a
>> > > solution in the form of --conf option in trigger_dag.
>> > >
>> > >
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/201607.mbox/browser
>> > >
>> > > However my last comment on the thread suggests exposing similar
>> > > functionality to the REST api and the UI.
>> > >
>> > > On Mon, Jul 11, 2016 at 3:05 PM, Lance Norskog <
>> lance.norskog@gmail.com>
>> > > wrote:
>> > >
>> > > > XCOM is a data store for passing data to&between tasks. This is how
>> you
>> > > > would pass dynamic data to the starting task of a DAG.
>> > > > Is there a CLI command to add data to XCOM?
>> > > >
>> > > > On Mon, Jul 11, 2016 at 2:46 PM, Jon McKenzie <jc...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Unless I'm missing it, it appears like it isn't possible to launch
>> a
>> > > DAG
>> > > > > job with initial inputs to the first task instance in the workflow
>> > > > (without
>> > > > > specifying those inputs in the DAG definition)
>> > > > >
>> > > > > Am I missing something?
>> > > > >
>> > > > > So for instance, I want to have user A be able to launch the DAG
>> with
>> > > > > parameter foo = bar, and user B to be able to launch the same DAG
>> > with
>> > > > foo
>> > > > > = baz. In my use case, this would be hooked up to a RESTful API,
>> and
>> > > the
>> > > > > users wouldn't necessarily know anything about DAGs or what's
>> > happening
>> > > > > behind the scenes
>> > > > >
>> > > > > The closest I can think to accomplishing this is to generate run
>> IDs
>> > in
>> > > > my
>> > > > > REST API, store the (run ID, input) pair in a database, and
>> retrieve
>> > > the
>> > > > > inputs in my first task in my DAG. But this seems like a very
>> > > hamhanded,
>> > > > > roundabout way of doing it. I'd much rather just create a DagRun
>> with
>> > > > > task_params that the scheduler automatically associates to the
>> first
>> > > task
>> > > > > instance.
>> > > > >
>> > > > > Any thoughts?
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Lance Norskog
>> > > > lance.norskog@gmail.com
>> > > > Redwood City, CA
>> > > >
>> > >
>> >
>>

Re: Dynamic DAG Input/Seed Data?

Posted by Cade Markegard <ca...@gmail.com>.

I've been playing around with creating a HTTP API using Airflow's plugins
here's a little bit of the triggering the DagRun:

https://gist.github.com/cademarkegard/e1adc20baf6fbae89bac2dcca3d2159e

Hopefully that helps clear up how you could pass parameters to the DagRun.
You'd probably also want to add some token based auth for the route.

Cade

On Tue, Jul 12, 2016 at 11:34 AM Paul Minton <pm...@change.org> wrote:

> >
> > For the use case where the parameters only change parameters or the
> > behavior of tasks (but not the shape of the DAG itself)
>
>
> This is the use case that I'm thinking of. But it's not clear how to change
> those parameters from the UI or REST api (if that's at all possible).
>
> On Tue, Jul 12, 2016 at 10:48 AM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Hi,
> >
> > A few notes around dynamic DAGs:
> >
> > We don't really support mutating a DAG's shape or structure based on
> source
> > parameters. Think about it: it would be hard to fit the current paradigm
> of
> > the UI like representing that DAG in the tree view. We like to think of
> > DAGs as pretty static or "slowly changing", similar to how a database
> > physical model evolves in the lifecycle of an application or a data
> > warehouse (at a similar rhythm). For those use cases (where input
> > parameters would change the shape of the DAG), we think of those as
> > different "singleton" DAGs that are expected to run a single time. To get
> > this to work, we create a "DAG factory" as a python scripts that outputs
> > many different DAG objects (with different dag_ids) and where
> > `schedule_interval='@once'` based on a config file or something
> equivalent
> > (db configuration, airflow.models.Variable object, ...).
> >
> > For the use case where the parameters only change parameters or the
> > behavior of tasks (but not the shape of the DAG itself), I recommend
> using
> > a DAG where `schedule_interval=None` that is triggered with different
> > parameters for its conf. Inside templates or operators you can access the
> > context easily to refer to the related DagRun's conf parameters. You
> could
> > potentially do that with a DAG on a schedule using Xcom as well, where an
> > early task would populate some Xcom parameters that following tasks would
> > read.
> >
> > Max
> >
> > On Mon, Jul 11, 2016 at 6:26 PM, Paul Minton <pm...@change.org> wrote:
> >
> > > I asked a very similar question in this thread that might provide a
> > > solution in the form of --conf option in trigger_dag.
> > >
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/201607.mbox/browser
> > >
> > > However my last comment on the thread suggests exposing similar
> > > functionality to the REST api and the UI.
> > >
> > > On Mon, Jul 11, 2016 at 3:05 PM, Lance Norskog <
> lance.norskog@gmail.com>
> > > wrote:
> > >
> > > > XCOM is a data store for passing data to&between tasks. This is how
> you
> > > > would pass dynamic data to the starting task of a DAG.
> > > > Is there a CLI command to add data to XCOM?
> > > >
> > > > On Mon, Jul 11, 2016 at 2:46 PM, Jon McKenzie <jc...@gmail.com>
> > wrote:
> > > >
> > > > > Unless I'm missing it, it appears like it isn't possible to launch
> a
> > > DAG
> > > > > job with initial inputs to the first task instance in the workflow
> > > > (without
> > > > > specifying those inputs in the DAG definition)
> > > > >
> > > > > Am I missing something?
> > > > >
> > > > > So for instance, I want to have user A be able to launch the DAG
> with
> > > > > parameter foo = bar, and user B to be able to launch the same DAG
> > with
> > > > foo
> > > > > = baz. In my use case, this would be hooked up to a RESTful API,
> and
> > > the
> > > > > users wouldn't necessarily know anything about DAGs or what's
> > happening
> > > > > behind the scenes
> > > > >
> > > > > The closest I can think to accomplishing this is to generate run
> IDs
> > in
> > > > my
> > > > > REST API, store the (run ID, input) pair in a database, and
> retrieve
> > > the
> > > > > inputs in my first task in my DAG. But this seems like a very
> > > hamhanded,
> > > > > roundabout way of doing it. I'd much rather just create a DagRun
> with
> > > > > task_params that the scheduler automatically associates to the
> first
> > > task
> > > > > instance.
> > > > >
> > > > > Any thoughts?
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Lance Norskog
> > > > lance.norskog@gmail.com
> > > > Redwood City, CA
> > > >
> > >
> >
>

Re: Dynamic DAG Input/Seed Data?

Posted by Paul Minton <pm...@change.org>.

>
> For the use case where the parameters only change parameters or the
> behavior of tasks (but not the shape of the DAG itself)


This is the use case that I'm thinking of. But it's not clear how to change
those parameters from the UI or REST api (if that's at all possible).

On Tue, Jul 12, 2016 at 10:48 AM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Hi,
>
> A few notes around dynamic DAGs:
>
> We don't really support mutating a DAG's shape or structure based on source
> parameters. Think about it: it would be hard to fit the current paradigm of
> the UI like representing that DAG in the tree view. We like to think of
> DAGs as pretty static or "slowly changing", similar to how a database
> physical model evolves in the lifecycle of an application or a data
> warehouse (at a similar rhythm). For those use cases (where input
> parameters would change the shape of the DAG), we think of those as
> different "singleton" DAGs that are expected to run a single time. To get
> this to work, we create a "DAG factory" as a python scripts that outputs
> many different DAG objects (with different dag_ids) and where
> `schedule_interval='@once'` based on a config file or something equivalent
> (db configuration, airflow.models.Variable object, ...).
>
> For the use case where the parameters only change parameters or the
> behavior of tasks (but not the shape of the DAG itself), I recommend using
> a DAG where `schedule_interval=None` that is triggered with different
> parameters for its conf. Inside templates or operators you can access the
> context easily to refer to the related DagRun's conf parameters. You could
> potentially do that with a DAG on a schedule using Xcom as well, where an
> early task would populate some Xcom parameters that following tasks would
> read.
>
> Max
>
> On Mon, Jul 11, 2016 at 6:26 PM, Paul Minton <pm...@change.org> wrote:
>
> > I asked a very similar question in this thread that might provide a
> > solution in the form of --conf option in trigger_dag.
> >
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/201607.mbox/browser
> >
> > However my last comment on the thread suggests exposing similar
> > functionality to the REST api and the UI.
> >
> > On Mon, Jul 11, 2016 at 3:05 PM, Lance Norskog <la...@gmail.com>
> > wrote:
> >
> > > XCOM is a data store for passing data to&between tasks. This is how you
> > > would pass dynamic data to the starting task of a DAG.
> > > Is there a CLI command to add data to XCOM?
> > >
> > > On Mon, Jul 11, 2016 at 2:46 PM, Jon McKenzie <jc...@gmail.com>
> wrote:
> > >
> > > > Unless I'm missing it, it appears like it isn't possible to launch a
> > DAG
> > > > job with initial inputs to the first task instance in the workflow
> > > (without
> > > > specifying those inputs in the DAG definition)
> > > >
> > > > Am I missing something?
> > > >
> > > > So for instance, I want to have user A be able to launch the DAG with
> > > > parameter foo = bar, and user B to be able to launch the same DAG
> with
> > > foo
> > > > = baz. In my use case, this would be hooked up to a RESTful API, and
> > the
> > > > users wouldn't necessarily know anything about DAGs or what's
> happening
> > > > behind the scenes
> > > >
> > > > The closest I can think to accomplishing this is to generate run IDs
> in
> > > my
> > > > REST API, store the (run ID, input) pair in a database, and retrieve
> > the
> > > > inputs in my first task in my DAG. But this seems like a very
> > hamhanded,
> > > > roundabout way of doing it. I'd much rather just create a DagRun with
> > > > task_params that the scheduler automatically associates to the first
> > task
> > > > instance.
> > > >
> > > > Any thoughts?
> > > >
> > >
> > >
> > >
> > > --
> > > Lance Norskog
> > > lance.norskog@gmail.com
> > > Redwood City, CA
> > >
> >
>

Re: Dynamic DAG Input/Seed Data?

Posted by Maxime Beauchemin <ma...@gmail.com>.

Hi,

A few notes around dynamic DAGs:

We don't really support mutating a DAG's shape or structure based on source
parameters. Think about it: it would be hard to fit the current paradigm of
the UI like representing that DAG in the tree view. We like to think of
DAGs as pretty static or "slowly changing", similar to how a database
physical model evolves in the lifecycle of an application or a data
warehouse (at a similar rhythm). For those use cases (where input
parameters would change the shape of the DAG), we think of those as
different "singleton" DAGs that are expected to run a single time. To get
this to work, we create a "DAG factory" as a python scripts that outputs
many different DAG objects (with different dag_ids) and where
`schedule_interval='@once'` based on a config file or something equivalent
(db configuration, airflow.models.Variable object, ...).

For the use case where the parameters only change parameters or the
behavior of tasks (but not the shape of the DAG itself), I recommend using
a DAG where `schedule_interval=None` that is triggered with different
parameters for its conf. Inside templates or operators you can access the
context easily to refer to the related DagRun's conf parameters. You could
potentially do that with a DAG on a schedule using Xcom as well, where an
early task would populate some Xcom parameters that following tasks would
read.

Max

On Mon, Jul 11, 2016 at 6:26 PM, Paul Minton <pm...@change.org> wrote:

> I asked a very similar question in this thread that might provide a
> solution in the form of --conf option in trigger_dag.
>
>
> http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/201607.mbox/browser
>
> However my last comment on the thread suggests exposing similar
> functionality to the REST api and the UI.
>
> On Mon, Jul 11, 2016 at 3:05 PM, Lance Norskog <la...@gmail.com>
> wrote:
>
> > XCOM is a data store for passing data to&between tasks. This is how you
> > would pass dynamic data to the starting task of a DAG.
> > Is there a CLI command to add data to XCOM?
> >
> > On Mon, Jul 11, 2016 at 2:46 PM, Jon McKenzie <jc...@gmail.com> wrote:
> >
> > > Unless I'm missing it, it appears like it isn't possible to launch a
> DAG
> > > job with initial inputs to the first task instance in the workflow
> > (without
> > > specifying those inputs in the DAG definition)
> > >
> > > Am I missing something?
> > >
> > > So for instance, I want to have user A be able to launch the DAG with
> > > parameter foo = bar, and user B to be able to launch the same DAG with
> > foo
> > > = baz. In my use case, this would be hooked up to a RESTful API, and
> the
> > > users wouldn't necessarily know anything about DAGs or what's happening
> > > behind the scenes
> > >
> > > The closest I can think to accomplishing this is to generate run IDs in
> > my
> > > REST API, store the (run ID, input) pair in a database, and retrieve
> the
> > > inputs in my first task in my DAG. But this seems like a very
> hamhanded,
> > > roundabout way of doing it. I'd much rather just create a DagRun with
> > > task_params that the scheduler automatically associates to the first
> task
> > > instance.
> > >
> > > Any thoughts?
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > lance.norskog@gmail.com
> > Redwood City, CA
> >
>

Re: Dynamic DAG Input/Seed Data?

Posted by Paul Minton <pm...@change.org>.

I asked a very similar question in this thread that might provide a
solution in the form of --conf option in trigger_dag.

http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/201607.mbox/browser

However my last comment on the thread suggests exposing similar
functionality to the REST api and the UI.

On Mon, Jul 11, 2016 at 3:05 PM, Lance Norskog <la...@gmail.com>
wrote:

> XCOM is a data store for passing data to&between tasks. This is how you
> would pass dynamic data to the starting task of a DAG.
> Is there a CLI command to add data to XCOM?
>
> On Mon, Jul 11, 2016 at 2:46 PM, Jon McKenzie <jc...@gmail.com> wrote:
>
> > Unless I'm missing it, it appears like it isn't possible to launch a DAG
> > job with initial inputs to the first task instance in the workflow
> (without
> > specifying those inputs in the DAG definition)
> >
> > Am I missing something?
> >
> > So for instance, I want to have user A be able to launch the DAG with
> > parameter foo = bar, and user B to be able to launch the same DAG with
> foo
> > = baz. In my use case, this would be hooked up to a RESTful API, and the
> > users wouldn't necessarily know anything about DAGs or what's happening
> > behind the scenes
> >
> > The closest I can think to accomplishing this is to generate run IDs in
> my
> > REST API, store the (run ID, input) pair in a database, and retrieve the
> > inputs in my first task in my DAG. But this seems like a very hamhanded,
> > roundabout way of doing it. I'd much rather just create a DagRun with
> > task_params that the scheduler automatically associates to the first task
> > instance.
> >
> > Any thoughts?
> >
>
>
>
> --
> Lance Norskog
> lance.norskog@gmail.com
> Redwood City, CA
>

Re: Dynamic DAG Input/Seed Data?

Posted by Lance Norskog <la...@gmail.com>.

XCOM is a data store for passing data to&between tasks. This is how you
would pass dynamic data to the starting task of a DAG.
Is there a CLI command to add data to XCOM?

On Mon, Jul 11, 2016 at 2:46 PM, Jon McKenzie <jc...@gmail.com> wrote:

> Unless I'm missing it, it appears like it isn't possible to launch a DAG
> job with initial inputs to the first task instance in the workflow (without
> specifying those inputs in the DAG definition)
>
> Am I missing something?
>
> So for instance, I want to have user A be able to launch the DAG with
> parameter foo = bar, and user B to be able to launch the same DAG with foo
> = baz. In my use case, this would be hooked up to a RESTful API, and the
> users wouldn't necessarily know anything about DAGs or what's happening
> behind the scenes
>
> The closest I can think to accomplishing this is to generate run IDs in my
> REST API, store the (run ID, input) pair in a database, and retrieve the
> inputs in my first task in my DAG. But this seems like a very hamhanded,
> roundabout way of doing it. I'd much rather just create a DagRun with
> task_params that the scheduler automatically associates to the first task
> instance.
>
> Any thoughts?
>



-- 
Lance Norskog
lance.norskog@gmail.com
Redwood City, CA