You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Chao-Han Tsai <mi...@gmail.com> on 2019/07/20 00:26:47 UTC
Re: [2.0 spring cleaning] Deprecate subdags
Hi all,
Just want to bump this thread up again as I have a PR (https://github.com/apache/airflow/pull/5498 <https://github.com/apache/airflow/pull/5498>) that makes SubDagOperator use normal scheduler instead of backfill scheduler to schedule tasks in subdags and would love to have some feedback from you guys.
Chao-Han
On 2019/05/16 05:40:53, Chao-Han Tsai <m....@gmail.com> wrote:
> Hi all,>
>
> I have a WIP PR <https://github.com/apache/airflow/pull/5279/files> that>
> aims to make SubDagOperator to use normal scheduler instead of the backfill>
> scheduler and would love to have some feedbacks from you with regard to the>
> implementation.>
>
> The basic idea is that we create a DagRun when executing the SubDagOperator>
> and wait until the DagRun to finish. Airflow scheduler picks up the DagRun>
> and executes the tasks.>
>
> Also I have another PR (https://github.com/apache/airflow/pull/5283) that>
> consistently set the is_paused state for parent DAGs and all of its>
> subdags, which I believe is required if we want to utilize normal scheduler>
> to schedule tasks in SubDagOperator.>
>
> Thanks,>
>
> Chao-Han>
>
>
>
> On Mon, Apr 15, 2019 at 4:39 AM Dan Davydov <dd...@twitter.com.invalid>>
> wrote:>
>
> > I don't think fixing subdags to run in the scheduler is enough, although>
> > it's a huge improvement over the current implementation (especially the>
> > part that lets Subdags specify custom executors). From my experience with>
> > Subdags, I think what makes more sense is adding various operators to allow>
> > combining regular DAGs.>
> >>
> > Here are a some other issues with Subdags off the top of my head:>
> > - Confusing/separate UI and clearing/running semantics (e.g. tasks in the>
> > Subdag will not get scheduled if you clear them but not the parent>
> > operator)>
> > - Nested Subdags are hard to work with in the UI (and IIRC don't behave>
> > correctly but I might be wrong on this).>
> > - The abstraction is confusing, e.g. looking at the log for the>
> > SubdagOperator task can be a bit confusing as>
> > - Tons of custom special-case logic in the Airflow code and schemas in the>
> > DB to handle Subdags which have led to a lot of complexity and a constant>
> > source of tricky bugs and upgrade issues>
> > - Additional abstraction that users have to learn>
> >>
> > An alternative would be allowing combining DAGs, e.g. something like:>
> > dag = DAG()>
> > dag_task1 = Op(dag = dag)>
> > dag_task2 = Op(dag = dag)>
> >>
> > subdag = DAG()>
> > subdag_task1 = Op(dag = subdag)>
> > subdag_task2 = Op(dag = subdag)>
> >>
> > dag_task1 >> subdag >> dag_task2>
> > # Results in the following topology:>
> > # ,-> subdag_task1 ---v>
> > # dag_task1 dag_task2>
> > # '-> subdag_task2----^>
> >>
> > This is also a lot more easily composable than subdags, and provides a more>
> > powerful abstraction, e.g. you don't need additional boilerplate to create>
> > subdags such as setting up an operator.>
> >>
> > On Sat, Apr 13, 2019 at 12:14 PM Felix Uellendall <fe...@gmx.de>
> > >>
> > wrote:>
> >>
> > > -1 on deprecating subdags, because of the extra level of abstraction>
> > > some of you already mentioned.>
> > >>
> > > We also use subdags in production.>
> > >>
> > > For example in cases where we get json data from an API but since we>
> > > mostly need it to be in csv format we have a subdag like>
> > > /specific_ap//i_specific_endpoint/_to_s3 that has two tasks one for>
> > > retrieving data from the API and loading it to s3 and one for>
> > > transforming it into csv.>
> > > This has the advantage that you don't need to think of how you transform>
> > > the json to csv. In our case Data Analyst don't want to think about>
> > > that. They want to work with tabular data.>
> > >>
> > > We also have subdags that are handling API cursoring/pagination (by>
> > > using xcom) and merging these multiple API response data into one file.>
> > > So you call one Task, a subdag operator with this subdag and get only>
> > > what you really need - the data.>
> > >>
> > > I really like subdags and I am for improving / maybe redesigning or>
> > > reimplementing of subdags.>
> > >>
> > > -feluelle>
> > >>
> > > Am 13/04/2019 um 07:52 schrieb Chao-Han Tsai:>
> > > > +1 on keeping it.>
> > > >>
> > > > I think we should keep the SubDags as it provides a good abstraction>
> > > layer.>
> > > > It just need some love from us to fix the underlying>
> > > > performance/reliability issues.>
> > > >>
> > > > On Fri, Apr 12, 2019 at 12:06 PM Ash Berlin-Taylor <as...@apache.org>>
> > > wrote:>
> > > >>
> > > >> This is what I was thinking - the dag collector in the scheduler>
> > should>
> > > >> "just" be able to collect the tasks for subdags up to the parent dag.>
> > > I'd>
> > > >> possibly go as far as saying no DagRun object for subdags too.>
> > > >>>
> > > >> (Yes, "just" will never be that simple).>
> > > >>>
> > > >> -a>
> > > >>>
> > > >> On 12 April 2019 18:37:24 BST, Bolke de Bruin <bd...@gmail.com>>
> > > wrote:>
> > > >>> +1>
> > > >>>>
> > > >>> Sub dags should be fixed within the scheduler and run normally.>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>> On 12 April 2019 at 19:36:27, Feng Lu (fenglu@google.com.invalid)>
> > > >>> wrote:>
> > > >>>>
> > > >>> Agree with others who think SubDag should stay, we should fix the>
> > > >>> SubDag>
> > > >>> implementation but not remove the abstraction itself.>
> > > >>>>
> > > >>> On Fri, Apr 12, 2019 at 8:42 AM Chen Tong <ci...@gmail.com> wrote:>
> > > >>>>
> > > >>>> Is it possible to re-implement it in the view-level, not in operator>
> > > >>> level?>
> > > >>>> And this operator is just define a different view in GUI, that these>
> > > >>> tasks>
> > > >>>> will be collapsed into another view.>
> > > >>>>>
> > > >>>> On Fri, Apr 12, 2019 at 11:31 AM James Meickle>
> > > >>>> <jm...@quantopian.com.invalid> wrote:>
> > > >>>>>
> > > >>>>> I have avoided using them because of outstanding issues like the>
> > > >>> open>
> > > >>>> JIRA>
> > > >>>>> issues I linked above, or similar issues that I've read about on>
> > > >>> blog>
> > > >>>>> posts. If it were just GUI or UX issues I'd use them, but many>
> > > >>> people>
> > > >>>> have>
> > > >>>>> reported issues which affect concurrency/stability, consistency, or>
> > > >>>>> correctness of results. I believe that it's working for you, but>
> > > >>> for>
> > > >>> me,>
> > > >>>>> it's not worth the risk to build using them in my environment (even>
> > > >>>> though>
> > > >>>>> they could be handy for many of our workflows).>
> > > >>>>>>
> > > >>>>> On Fri, Apr 12, 2019 at 11:18 AM Kaxil Naik <ka...@gmail.com>>
> > > >>> wrote:>
> > > >>>>>> I have been using SubDags in production and haven't had much>
> > > >>> problem>
> > > >>>> with>
> > > >>>>>> it.>
> > > >>>>>>>
> > > >>>>>> Can you list the issues you had?>
> > > >>>>>>>
> > > >>>>>> Regards,>
> > > >>>>>> Kaxil>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>> On Fri, Apr 12, 2019, 16:16 James Meickle>
> > > >>> <jm...@quantopian.com>
> > > >>>>>> .invalid>>
> > > >>>>>> wrote:>
> > > >>>>>>>
> > > >>>>>>> Given their bad reputation, would it be appropriate to>
> > > >>> deprecate>
> > > >>>>> subDAGs>
> > > >>>>>>> now to advertise that they're no longer considered a suitable>
> > > >>>>>>> implementation? If a new and better implementation is created,>
> > > >>> would>
> > > >>>> it>
> > > >>>>>>> even be similar enough to subDAGs that we'd want to continue to>
> > > >>> call>
> > > >>>>> the>
> > > >>>>>>> feature that?>
> > > >>>>>>>>
> > > >>>>>>> They feel like a "new Airflow user trap" right now - I have had>
> > > >>> to>
> > > >>>> tell>
> > > >>>>>> my>
> > > >>>>>>> team never to use them, because they seem appealing and are in>
> > > >>> the>
> > > >>>>>> official>
> > > >>>>>>> docs.>
> > > >>>>>>>>
> > > >>>>>>> On Fri, Apr 12, 2019 at 10:51 AM Ash Berlin-Taylor>
> > > >>> <as...@apache.org>>
> > > >>>>>> wrote:>
> > > >>>>>>>> I'd like to find time to fix subdags as they do provide a>
> > > >>> useful>
> > > >>>>>>>> abstraction - but I agree right now they aren't great (I>
> > > >>> avoid>
> > > >>> them>
> > > >>>>>>> because>
> > > >>>>>>>> of this)>
> > > >>>>>>>>>
> > > >>>>>>>> I have half thoughts of how to it should work, I just need to>
> > > >>> look>
> > > >>>> at>
> > > >>>>>> the>
> > > >>>>>>>> code in depth to see if that makes sense. Now 1.10.3 is out I>
> > > >>> might>
> > > >>>>>> have>
> > > >>>>>>> a>
> > > >>>>>>>> bit more time to do this.>
> > > >>>>>>>>>
> > > >>>>>>>> -ash>
> > > >>>>>>>>>
> > > >>>>>>>>> On 12 Apr 2019, at 15:48, James Meickle>
> > > >>> <jm...@quantopian.com>
> > > >>>>>>> .INVALID>>
> > > >>>>>>>> wrote:>
> > > >>>>>>>>> I think we should deprecate SubDAGs given the complexity>
> > > >>> they>
> > > >>> add>
> > > >>>>> and>
> > > >>>>>>> the>
> > > >>>>>>>>> limited usage and use cases. Or, we should invest effort in>
> > > >>>>>> redesigning>
> > > >>>>>>>>> their API and implementation. I think that having to>
> > > >>> account>
> > > >>> for>
> > > >>>>>>>>> subdag-introduced complexity makes Airflow's code much>
> > > >>> harder>
> > > >>> to>
> > > >>>>>>> maintain>
> > > >>>>>>>>> and buggier, looking at how many open issues there are that>
> > > >>>>> reference>
> > > >>>>>>>>> subdags (and how unrelated in topic they are):>
> > > >>>>>>>>>>
> > > >>>
> > >>
> > https://issues.apache.org/jira/browse/AIRFLOW-3292?jql=project%20%3D%20AIRFLOW%20AND%20status%20%3D%20Open%20AND%20text%20~%20%22subdag%22>
> > > >>>>>>>>>
> > > >>
> > >>
> >>
>
>
> -- >
>
> Chao-Han Tsai>
>