You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Boris <bo...@gmail.com> on 2016/10/14 15:47:14 UTC

Issue with Dynamically created tasks in a DAG

One of the selling points for me to use airflow for our new project is an
ability to create tasks programmatically on every run. People mentioned
that in various talks then they would generate tasks on every run, pulling
a list of files or using some external configs (YAML) etc.

I also found this example
https://gist.github.com/mtustin-handy/ecd80c1cc9dcad1c4cf7

I created a quick dag similar to above and I am observing some weird issues
(using Airflow v1.7.1.3 and Sequential Executor).

My Dag has a list like
tables_list = ['table1','table2']

Then i would create a first task (dummy) and then generate bash operators
for every table in a list and use first dummy task as upstream task,

It works great on a first run - all tasks created properly.

Then I change the list to add a new table3:
tables_list = ['table1','table2','table3']

DAG runs again but I do not see table3 in the Graph or Tree view. I do see
table3 task under Task Instance View so it was generated. But if I click on
it, i would get an error like Task [dynamic_job_proto_v1.t_table3] doesn't
seem to exist at the moment

Then I restarted the scheduler - same thing. New Dags would not show that
task.

then I restarted airflow webserver - this time I was able to see table3
task in views.

After that I removed table2 from my list and DAG ran again - same issue.
Table2 was still in views untill i restarted the webserver. After the
restart, table2 dissappered from previosly ran Dags which is bad because
now i cannot go back in history, cannot compare execution time etc.

Is this a known bug?

Re: Fwd: Issue with Dynamically created tasks in a DAG

Posted by Boris Tyukin <bo...@boristyukin.com>.
thanks Laura, this confirms what I am seeing exactly. Ben Tallman picked
Jira request so hopefully he can come up with something that won't be that
confusing.

On Tue, Oct 18, 2016 at 10:47 AM, Laura Lorenz <ll...@industrydive.com>
wrote:

> I think the source of the confusion you're experiencing is that the UI is
> based off of the DAG definition file at time of webserver load, which I
> believe is on the one hand defensible since the scheduler operates in a
> somewhat similar way, but on the other hand rightfully confusing (and
> doesn't TOTALLY mimic scheduler activities so it's really the worst of both
> worlds IMHO). When you change the DAG definition file, you have to kick the
> webserver to pick up new graph/tree drawings. In the case of removing
> tasks, I think that if you queried the underlying metadata database
> directly, your 'table2' task instances would still exist, but the UI
> doesn't know that it should show them based on the DAG definition files it
> has on hand during webserver process reload.
>
> I have non-dynamic DAGs that when the DAG shape is changed dramatically by
> me (including removing tasks) I usually create an entirely new DAG (in
> practice this is changing the dag_id of the DAG object in the DAG
> definition file, for example 'my_dag' becomes 'my_dag_v2') so that there is
> no confusion of it being tied to previous history. If you choose to keep
> your previous DAG definition file ('my_dag') but have the scheduler for
> that DAG in the off state, and add in the new DAG in the on state
> ('my_dag_v2') the UI will render both as different DAGs and you can
> navigate through history with the UI as normal.
>
> This has been discussed as the preferred workaround for several different
> types of major DAG configuration changes (such as a start_date further in
> the past than the original version was configured for), but I'm not sure if
> anything has been going on (yet?) to redesign it in any way. I believe as
> mentioned it is basically a side effect of depending on the DAG definition
> files to draw graphs and trees as opposed to history.
>
> Sort of an aside but relevant if you are changing DAG shape with any
> frequency: We also see that when we add tasks to an existing DAG, what I
> will see is that the tree view/graph view will fill in the added task for
> all of history with the state 'no status'. If that DAG is set to have
> depends_on_past=True, this will actually clog up a new DagRun unless I do
> something to force the new task in the new DagRun that has no previous task
> instance history to execute regardless.
>
> On Sun, Oct 16, 2016 at 7:04 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
> > I opened a JIRA - looks like based on comments in other threads, it does
> > not work properly right now.
> >
> > [AIRFLOW-574] Show Graph/Tree view and Task Instance logs using executed
> > DagRun, not current
> >
> >
> >
> >
>

Re: Fwd: Issue with Dynamically created tasks in a DAG

Posted by Laura Lorenz <ll...@industrydive.com>.
I think the source of the confusion you're experiencing is that the UI is
based off of the DAG definition file at time of webserver load, which I
believe is on the one hand defensible since the scheduler operates in a
somewhat similar way, but on the other hand rightfully confusing (and
doesn't TOTALLY mimic scheduler activities so it's really the worst of both
worlds IMHO). When you change the DAG definition file, you have to kick the
webserver to pick up new graph/tree drawings. In the case of removing
tasks, I think that if you queried the underlying metadata database
directly, your 'table2' task instances would still exist, but the UI
doesn't know that it should show them based on the DAG definition files it
has on hand during webserver process reload.

I have non-dynamic DAGs that when the DAG shape is changed dramatically by
me (including removing tasks) I usually create an entirely new DAG (in
practice this is changing the dag_id of the DAG object in the DAG
definition file, for example 'my_dag' becomes 'my_dag_v2') so that there is
no confusion of it being tied to previous history. If you choose to keep
your previous DAG definition file ('my_dag') but have the scheduler for
that DAG in the off state, and add in the new DAG in the on state
('my_dag_v2') the UI will render both as different DAGs and you can
navigate through history with the UI as normal.

This has been discussed as the preferred workaround for several different
types of major DAG configuration changes (such as a start_date further in
the past than the original version was configured for), but I'm not sure if
anything has been going on (yet?) to redesign it in any way. I believe as
mentioned it is basically a side effect of depending on the DAG definition
files to draw graphs and trees as opposed to history.

Sort of an aside but relevant if you are changing DAG shape with any
frequency: We also see that when we add tasks to an existing DAG, what I
will see is that the tree view/graph view will fill in the added task for
all of history with the state 'no status'. If that DAG is set to have
depends_on_past=True, this will actually clog up a new DagRun unless I do
something to force the new task in the new DagRun that has no previous task
instance history to execute regardless.

On Sun, Oct 16, 2016 at 7:04 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> I opened a JIRA - looks like based on comments in other threads, it does
> not work properly right now.
>
> [AIRFLOW-574] Show Graph/Tree view and Task Instance logs using executed
> DagRun, not current
>
>
>
>

Re: Fwd: Issue with Dynamically created tasks in a DAG

Posted by Boris Tyukin <bo...@boristyukin.com>.
I opened a JIRA - looks like based on comments in other threads, it does not work properly right now.

[AIRFLOW-574] Show Graph/Tree view and Task Instance logs using executed DagRun, not current




Fwd: Issue with Dynamically created tasks in a DAG

Posted by Boris Tyukin <bo...@boristyukin.com>.
One of the selling points for me to use airflow for our new project is an
ability to create tasks programmatically on every run. People mentioned
that in various talks then they would generate tasks on every run, pulling
a list of files or using some external configs (YAML) etc.

I also found this example https://gist.github.com/mtustin-handy/
ecd80c1cc9dcad1c4cf7

I created a quick dag similar to above and I am observing some weird issues
(using Airflow v1.7.1.3 and Sequential Executor).

My Dag has a list like
tables_list = ['table1','table2']

Then i would create a first task (dummy) and then generate bash operators
for every table in a list and use first dummy task as upstream task,

It works great on a first run - all tasks created properly.

Then I change the list to add a new table3:
tables_list = ['table1','table2','table3']

DAG runs again but I do not see table3 in the Graph or Tree view. I do see
table3 task under Task Instance View so it was generated. But if I click on
it, i would get an error like Task [dynamic_job_proto_v1.t_table3] doesn't
seem to exist at the moment

Then I restarted the scheduler - same thing. New Dags would not show that
task.

then I restarted airflow webserver - this time I was able to see table3
task in views.

After that I removed table2 from my list and DAG ran again - same issue.
Table2 was still in views untill i restarted the webserver. After the
restart, table2 dissappered from previosly ran Dags which is bad because
now i cannot go back in history, cannot compare execution time etc.

Is this a known bug?