You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Soma S Dhavala <so...@gmail.com> on 2018/11/22 10:04:09 UTC

programmatically creating and airflow quirks

Hey AirFlow Devs:
In our organization, we build a Machine Learning WorkBench with AirFlow as
an orchestrator of the ML Work Flows, and have wrapped AirFlow python
operators to customize the behaviour. These work flows are specified in
YAML.

We drop a DAG loader (written python) in the default location airflow
expects the DAG files.  This DAG loader reads the specified YAML files and
converts them into airflow DAG objects. Essentially, we are
programmatically creating the DAG objects. In order to support muliple
parsers (yaml, json etc), we separated the DAG creation from loading. But
when a DAG is created (in a separate module) and made available to the DAG
loaders, airflow does not pick it up. As an example, consider that I
created a DAG picked it, and will simply unpickle the DAG and give it to
airflow.

However, in current avatar of airfow, the very creation of DAG has to
happen in the loader itself. As far I am concerned, airflow should not care
where and how the DAG object is created, so long as it is a valid DAG
object. The workaround for us is to mix parser and loader in the same file
and drop it in the airflow default dags folder. During dag_bag creation,
this file is loaded up with import_modules utility and shows up in the UI.
While this is a solution, but it is not clean.

What do DEVs think about a solution to this problem? Will saving the DAG to
the db and reading it from the db work? Or some core changes need to happen
in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.

thanks,
-soma

Re: programmatically creating and airflow quirks

Posted by soma dhavala <so...@gmail.com>.
Great inputs James. I was premature in saying we need micro-services. Any solutioning should  depend on the problem(s) being solved and promise(s) being made.

thanks,
-soma

> On Nov 28, 2018, at 11:24 PM, James Meickle <jm...@quantopian.com.INVALID> wrote:
> 
> I would be very interested in helping draft a rearchitecting AIP. Of
> course, that's a vague statement. I am interested in several specific areas
> of Airflow functionality that would be hard to modify without some
> refactoring taking place first:
> 
> 1) Improving Airflow's data model so it's easier to have functional data
> pipelines (such as addressing information propagation and artifacts via a
> non-xcom mechanism)
> 
> 2) Having point-in-timeness for DAGs: a concept of which revision of a DAG
> was in use at which date, represented in-Airflow.
> 
> 3) Better idioms and loading capabilities for DAG factories (either
> config-driven, or non-Python creation of DAGs, like with boundary-layer).
> 
> 4) Flexible execution dates: in finance we operate day over day, and have
> valid use cases for "t-1", "t+0", and "t+1" dates. The current execution
> date status is incredibly confusing for literally every developer we've
> brought onto Airflow (they understand it eventually but do make mistakes at
> first).
> 
> 5) Scheduler-integrated sensors
> 
> 6) Making Airflow more operator-friendly with better alerting, health
> checks, notifications, deploy-time configuration, etc.
> 
> 7) Improving testability of various components (both within the Airflow
> repo, as well as making it easier to test DAGs and plugins)
> 
> 8) Deprecating "newbie trap" or excess complexity features (like skips), by
> fixing their internal implementation or by providing alternatives that
> address their use cases in more sound ways.
> 
> To my mind, I would need Airflow to be more modular to accomplish several
> of those. Even if these aims don't happen in Airflow contrib (as some are
> quite contentious and have been discussed on this list before), it would
> currently be nearly impossible to maintain an in-house branch that
> attempted to implement them.
> 
> That being said, saying that it requires microservices is IMO incorrect.
> Airflow already scales quite well, so while it needs more modularization,
> we probably would see no benefit from immediately breaking those modules
> into independent services.
> 
> On Wed, Nov 28, 2018 at 11:38 AM Ash Berlin-Taylor <as...@apache.org> wrote:
> 
>> I have similar feelings around the "core" of Airflow and would _love_ to
>> somehow find time to spend a month really getting to grips with the
>> scheduler and the dagbag and see what comes to light with fresh eyes and
>> the benefits of hindsight.
>> 
>> Finding that time is going to be.... A Challenge though.
>> 
>> (Oh, except no to microservices. Airflow is hard enough to operator right
>> now without splitting things in to even more daemons)
>> 
>> -ash
>>> On 26 Nov 2018, at 03:06, soma dhavala <so...@gmail.com> wrote:
>>> 
>>> 
>>> 
>>>> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <
>> maximebeauchemin@gmail.com> wrote:
>>>> 
>>>> The historical reason is that people would check in scripts in the repo
>>>> that had actual compute or other forms or undesired effect in module
>> scope
>>>> (scripts with no "if __name__ == '__main__':") and Airflow would just
>> run
>>>> this script while seeking for DAGs. So we added this mitigation patch
>> that
>>>> would confirm that there's something Airflow-related in the .py file.
>> Not
>>>> elegant, and confusing at times, but it also probably prevented some
>> issues
>>>> over the years.
>>>> 
>>>> The solution here is to have a more explicit way of adding DAGs to the
>>>> DagBag (instead of the folder-crawling approach). The DagFetcher
>> proposal
>>>> offers solutions around that, having a central "manifest" file that
>>>> provides explicit pointers to all DAGs in the environment.
>>> 
>>> Some rebasing needs to happen. When I looked at 1.8 code base almost an
>> year ago, it felt like more complex than necessary.  What airflow is trying
>> to promise from an architectural standpoint — that was not clear to me. It
>> is trying to do too many things, scattered in too many places, is the
>> feeling I got. As a result, I stopped peeping, and just trust that it works
>> — which it does, btw. I tend to think that, airflow outgrew its original
>> intents. A sort of micro-services architecture has to be brought in. I may
>> sound critical, but no offense. I truly appreciate the contributions.
>>> 
>>>> 
>>>> Max
>>>> 
>>>> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <be...@gmail.com>
>>>> wrote:
>>>> 
>>>>> In my opinion this searching for dags is not ideal.
>>>>> 
>>>>> We should be explicitly specifying the dags to load somewhere.
>>>>> 
>>>>> 
>>>>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yr...@gmail.com> wrote:
>>>>>> 
>>>>>> I believe that is mostly because we want to skip parsing/loading .py
>>>>> files
>>>>>> that doesn't contain DAG defs to save time, as scheduler is going to
>>>>>> parse/load the .py files over and over again and some files can take
>>>>> quite
>>>>>> long to load.
>>>>>> 
>>>>>> Cheers,
>>>>>> Kevin Y
>>>>>> 
>>>>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhavala@gmail.com
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> happy to report that the “fix” worked. thanks Alex.
>>>>>>> 
>>>>>>> btw, wondering why was it there in the first place? how does it help
>> —
>>>>>>> saves time, early termination — what?
>>>>>>> 
>>>>>>> 
>>>>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Yup.
>>>>>>>> 
>>>>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <
>> soma.dhavala@gmail.com
>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
>>>>>>> <ma...@airbnb.com>> wrote:
>>>>>>>>> 
>>>>>>>>> It’s because of this
>>>>>>>>> 
>>>>>>>>> “When searching for DAGs, Airflow will only consider files where
>> the
>>>>>>> string “airflow” and “DAG” both appear in the contents of the .py
>> file.”
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Have not noticed it.  From airflow/models.py, in process_file —
>> (both
>>>>> in
>>>>>>> 1.9 and 1.10)
>>>>>>>> ..
>>>>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>>>>>>> ..
>>>>>>>> is looking for those strings and if they are not found, it is
>> returning
>>>>>>> without loading the DAGs.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> So having “airflow” and “DAG”  dummy strings placed somewhere will
>> make
>>>>>>> it work?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <
>> soma.dhavala@gmail.com
>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
>>>>>>> <ma...@airbnb.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I think this is what is going on. The dags are picked by local
>>>>>>> variables. I.E. if you do
>>>>>>>>>> dag = Dag(...)
>>>>>>>>>> dag = Dag(…)
>>>>>>>>> 
>>>>>>>>> from my_module import create_dag
>>>>>>>>> 
>>>>>>>>> for file in yaml_files:
>>>>>>>>>  dag = create_dag(file)
>>>>>>>>>  globals()[dag.dag_id] = dag
>>>>>>>>> 
>>>>>>>>> You notice that create_dag is in a different module. If it is in
>> the
>>>>>>> same scope (file), it will be fine.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Only the second dag will be picked up.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
>>>>> soma.dhavala@gmail.com
>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>> Hey AirFlow Devs:
>>>>>>>>>> In our organization, we build a Machine Learning WorkBench with
>>>>>>> AirFlow as
>>>>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow
>> python
>>>>>>>>>> operators to customize the behaviour. These work flows are
>> specified
>>>>> in
>>>>>>>>>> YAML.
>>>>>>>>>> 
>>>>>>>>>> We drop a DAG loader (written python) in the default location
>> airflow
>>>>>>>>>> expects the DAG files.  This DAG loader reads the specified YAML
>>>>> files
>>>>>>> and
>>>>>>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>>>>>>> programmatically creating the DAG objects. In order to support
>>>>> muliple
>>>>>>>>>> parsers (yaml, json etc), we separated the DAG creation from
>> loading.
>>>>>>> But
>>>>>>>>>> when a DAG is created (in a separate module) and made available to
>>>>> the
>>>>>>> DAG
>>>>>>>>>> loaders, airflow does not pick it up. As an example, consider
>> that I
>>>>>>>>>> created a DAG picked it, and will simply unpickle the DAG and
>> give it
>>>>>>> to
>>>>>>>>>> airflow.
>>>>>>>>>> 
>>>>>>>>>> However, in current avatar of airfow, the very creation of DAG
>> has to
>>>>>>>>>> happen in the loader itself. As far I am concerned, airflow should
>>>>> not
>>>>>>> care
>>>>>>>>>> where and how the DAG object is created, so long as it is a valid
>> DAG
>>>>>>>>>> object. The workaround for us is to mix parser and loader in the
>> same
>>>>>>> file
>>>>>>>>>> and drop it in the airflow default dags folder. During dag_bag
>>>>>>> creation,
>>>>>>>>>> this file is loaded up with import_modules utility and shows up in
>>>>> the
>>>>>>> UI.
>>>>>>>>>> While this is a solution, but it is not clean.
>>>>>>>>>> 
>>>>>>>>>> What do DEVs think about a solution to this problem? Will saving
>> the
>>>>>>> DAG to
>>>>>>>>>> the db and reading it from the db work? Or some core changes need
>> to
>>>>>>> happen
>>>>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created"
>> DAGs.
>>>>>>>>>> 
>>>>>>>>>> thanks,
>>>>>>>>>> -soma
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 
>> 


Re: programmatically creating and airflow quirks

Posted by James Meickle <jm...@quantopian.com.INVALID>.
I would be very interested in helping draft a rearchitecting AIP. Of
course, that's a vague statement. I am interested in several specific areas
of Airflow functionality that would be hard to modify without some
refactoring taking place first:

1) Improving Airflow's data model so it's easier to have functional data
pipelines (such as addressing information propagation and artifacts via a
non-xcom mechanism)

2) Having point-in-timeness for DAGs: a concept of which revision of a DAG
was in use at which date, represented in-Airflow.

3) Better idioms and loading capabilities for DAG factories (either
config-driven, or non-Python creation of DAGs, like with boundary-layer).

4) Flexible execution dates: in finance we operate day over day, and have
valid use cases for "t-1", "t+0", and "t+1" dates. The current execution
date status is incredibly confusing for literally every developer we've
brought onto Airflow (they understand it eventually but do make mistakes at
first).

5) Scheduler-integrated sensors

6) Making Airflow more operator-friendly with better alerting, health
checks, notifications, deploy-time configuration, etc.

7) Improving testability of various components (both within the Airflow
repo, as well as making it easier to test DAGs and plugins)

8) Deprecating "newbie trap" or excess complexity features (like skips), by
fixing their internal implementation or by providing alternatives that
address their use cases in more sound ways.

To my mind, I would need Airflow to be more modular to accomplish several
of those. Even if these aims don't happen in Airflow contrib (as some are
quite contentious and have been discussed on this list before), it would
currently be nearly impossible to maintain an in-house branch that
attempted to implement them.

That being said, saying that it requires microservices is IMO incorrect.
Airflow already scales quite well, so while it needs more modularization,
we probably would see no benefit from immediately breaking those modules
into independent services.

On Wed, Nov 28, 2018 at 11:38 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> I have similar feelings around the "core" of Airflow and would _love_ to
> somehow find time to spend a month really getting to grips with the
> scheduler and the dagbag and see what comes to light with fresh eyes and
> the benefits of hindsight.
>
> Finding that time is going to be.... A Challenge though.
>
> (Oh, except no to microservices. Airflow is hard enough to operator right
> now without splitting things in to even more daemons)
>
> -ash
> > On 26 Nov 2018, at 03:06, soma dhavala <so...@gmail.com> wrote:
> >
> >
> >
> >> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
> >>
> >> The historical reason is that people would check in scripts in the repo
> >> that had actual compute or other forms or undesired effect in module
> scope
> >> (scripts with no "if __name__ == '__main__':") and Airflow would just
> run
> >> this script while seeking for DAGs. So we added this mitigation patch
> that
> >> would confirm that there's something Airflow-related in the .py file.
> Not
> >> elegant, and confusing at times, but it also probably prevented some
> issues
> >> over the years.
> >>
> >> The solution here is to have a more explicit way of adding DAGs to the
> >> DagBag (instead of the folder-crawling approach). The DagFetcher
> proposal
> >> offers solutions around that, having a central "manifest" file that
> >> provides explicit pointers to all DAGs in the environment.
> >
> > Some rebasing needs to happen. When I looked at 1.8 code base almost an
> year ago, it felt like more complex than necessary.  What airflow is trying
> to promise from an architectural standpoint — that was not clear to me. It
> is trying to do too many things, scattered in too many places, is the
> feeling I got. As a result, I stopped peeping, and just trust that it works
> — which it does, btw. I tend to think that, airflow outgrew its original
> intents. A sort of micro-services architecture has to be brought in. I may
> sound critical, but no offense. I truly appreciate the contributions.
> >
> >>
> >> Max
> >>
> >> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <be...@gmail.com>
> >> wrote:
> >>
> >>> In my opinion this searching for dags is not ideal.
> >>>
> >>> We should be explicitly specifying the dags to load somewhere.
> >>>
> >>>
> >>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yr...@gmail.com> wrote:
> >>>>
> >>>> I believe that is mostly because we want to skip parsing/loading .py
> >>> files
> >>>> that doesn't contain DAG defs to save time, as scheduler is going to
> >>>> parse/load the .py files over and over again and some files can take
> >>> quite
> >>>> long to load.
> >>>>
> >>>> Cheers,
> >>>> Kevin Y
> >>>>
> >>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhavala@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> happy to report that the “fix” worked. thanks Alex.
> >>>>>
> >>>>> btw, wondering why was it there in the first place? how does it help
> —
> >>>>> saves time, early termination — what?
> >>>>>
> >>>>>
> >>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com>
> >>> wrote:
> >>>>>>
> >>>>>> Yup.
> >>>>>>
> >>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <
> soma.dhavala@gmail.com
> >>>>> <ma...@gmail.com>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
> >>>>> <ma...@airbnb.com>> wrote:
> >>>>>>>
> >>>>>>> It’s because of this
> >>>>>>>
> >>>>>>> “When searching for DAGs, Airflow will only consider files where
> the
> >>>>> string “airflow” and “DAG” both appear in the contents of the .py
> file.”
> >>>>>>>
> >>>>>>
> >>>>>> Have not noticed it.  From airflow/models.py, in process_file —
> (both
> >>> in
> >>>>> 1.9 and 1.10)
> >>>>>> ..
> >>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
> >>>>>> ..
> >>>>>> is looking for those strings and if they are not found, it is
> returning
> >>>>> without loading the DAGs.
> >>>>>>
> >>>>>>
> >>>>>> So having “airflow” and “DAG”  dummy strings placed somewhere will
> make
> >>>>> it work?
> >>>>>>
> >>>>>>
> >>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <
> soma.dhavala@gmail.com
> >>>>> <ma...@gmail.com>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
> >>>>> <ma...@airbnb.com>> wrote:
> >>>>>>>>
> >>>>>>>> I think this is what is going on. The dags are picked by local
> >>>>> variables. I.E. if you do
> >>>>>>>> dag = Dag(...)
> >>>>>>>> dag = Dag(…)
> >>>>>>>
> >>>>>>> from my_module import create_dag
> >>>>>>>
> >>>>>>> for file in yaml_files:
> >>>>>>>   dag = create_dag(file)
> >>>>>>>   globals()[dag.dag_id] = dag
> >>>>>>>
> >>>>>>> You notice that create_dag is in a different module. If it is in
> the
> >>>>> same scope (file), it will be fine.
> >>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>> Only the second dag will be picked up.
> >>>>>>>>
> >>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
> >>> soma.dhavala@gmail.com
> >>>>> <ma...@gmail.com>> wrote:
> >>>>>>>> Hey AirFlow Devs:
> >>>>>>>> In our organization, we build a Machine Learning WorkBench with
> >>>>> AirFlow as
> >>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow
> python
> >>>>>>>> operators to customize the behaviour. These work flows are
> specified
> >>> in
> >>>>>>>> YAML.
> >>>>>>>>
> >>>>>>>> We drop a DAG loader (written python) in the default location
> airflow
> >>>>>>>> expects the DAG files.  This DAG loader reads the specified YAML
> >>> files
> >>>>> and
> >>>>>>>> converts them into airflow DAG objects. Essentially, we are
> >>>>>>>> programmatically creating the DAG objects. In order to support
> >>> muliple
> >>>>>>>> parsers (yaml, json etc), we separated the DAG creation from
> loading.
> >>>>> But
> >>>>>>>> when a DAG is created (in a separate module) and made available to
> >>> the
> >>>>> DAG
> >>>>>>>> loaders, airflow does not pick it up. As an example, consider
> that I
> >>>>>>>> created a DAG picked it, and will simply unpickle the DAG and
> give it
> >>>>> to
> >>>>>>>> airflow.
> >>>>>>>>
> >>>>>>>> However, in current avatar of airfow, the very creation of DAG
> has to
> >>>>>>>> happen in the loader itself. As far I am concerned, airflow should
> >>> not
> >>>>> care
> >>>>>>>> where and how the DAG object is created, so long as it is a valid
> DAG
> >>>>>>>> object. The workaround for us is to mix parser and loader in the
> same
> >>>>> file
> >>>>>>>> and drop it in the airflow default dags folder. During dag_bag
> >>>>> creation,
> >>>>>>>> this file is loaded up with import_modules utility and shows up in
> >>> the
> >>>>> UI.
> >>>>>>>> While this is a solution, but it is not clean.
> >>>>>>>>
> >>>>>>>> What do DEVs think about a solution to this problem? Will saving
> the
> >>>>> DAG to
> >>>>>>>> the db and reading it from the db work? Or some core changes need
> to
> >>>>> happen
> >>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created"
> DAGs.
> >>>>>>>>
> >>>>>>>> thanks,
> >>>>>>>> -soma
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >
>
>

Re: programmatically creating and airflow quirks

Posted by Ash Berlin-Taylor <as...@apache.org>.
I have similar feelings around the "core" of Airflow and would _love_ to somehow find time to spend a month really getting to grips with the scheduler and the dagbag and see what comes to light with fresh eyes and the benefits of hindsight.

Finding that time is going to be.... A Challenge though.

(Oh, except no to microservices. Airflow is hard enough to operator right now without splitting things in to even more daemons)

-ash
> On 26 Nov 2018, at 03:06, soma dhavala <so...@gmail.com> wrote:
> 
> 
> 
>> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <ma...@gmail.com> wrote:
>> 
>> The historical reason is that people would check in scripts in the repo
>> that had actual compute or other forms or undesired effect in module scope
>> (scripts with no "if __name__ == '__main__':") and Airflow would just run
>> this script while seeking for DAGs. So we added this mitigation patch that
>> would confirm that there's something Airflow-related in the .py file. Not
>> elegant, and confusing at times, but it also probably prevented some issues
>> over the years.
>> 
>> The solution here is to have a more explicit way of adding DAGs to the
>> DagBag (instead of the folder-crawling approach). The DagFetcher proposal
>> offers solutions around that, having a central "manifest" file that
>> provides explicit pointers to all DAGs in the environment.
> 
> Some rebasing needs to happen. When I looked at 1.8 code base almost an year ago, it felt like more complex than necessary.  What airflow is trying to promise from an architectural standpoint — that was not clear to me. It is trying to do too many things, scattered in too many places, is the feeling I got. As a result, I stopped peeping, and just trust that it works — which it does, btw. I tend to think that, airflow outgrew its original intents. A sort of micro-services architecture has to be brought in. I may sound critical, but no offense. I truly appreciate the contributions.    
> 
>> 
>> Max
>> 
>> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <be...@gmail.com>
>> wrote:
>> 
>>> In my opinion this searching for dags is not ideal.
>>> 
>>> We should be explicitly specifying the dags to load somewhere.
>>> 
>>> 
>>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yr...@gmail.com> wrote:
>>>> 
>>>> I believe that is mostly because we want to skip parsing/loading .py
>>> files
>>>> that doesn't contain DAG defs to save time, as scheduler is going to
>>>> parse/load the .py files over and over again and some files can take
>>> quite
>>>> long to load.
>>>> 
>>>> Cheers,
>>>> Kevin Y
>>>> 
>>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <so...@gmail.com>
>>>> wrote:
>>>> 
>>>>> happy to report that the “fix” worked. thanks Alex.
>>>>> 
>>>>> btw, wondering why was it there in the first place? how does it help —
>>>>> saves time, early termination — what?
>>>>> 
>>>>> 
>>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com>
>>> wrote:
>>>>>> 
>>>>>> Yup.
>>>>>> 
>>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@gmail.com
>>>>> <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
>>>>> <ma...@airbnb.com>> wrote:
>>>>>>> 
>>>>>>> It’s because of this
>>>>>>> 
>>>>>>> “When searching for DAGs, Airflow will only consider files where the
>>>>> string “airflow” and “DAG” both appear in the contents of the .py file.”
>>>>>>> 
>>>>>> 
>>>>>> Have not noticed it.  From airflow/models.py, in process_file — (both
>>> in
>>>>> 1.9 and 1.10)
>>>>>> ..
>>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>>>>> ..
>>>>>> is looking for those strings and if they are not found, it is returning
>>>>> without loading the DAGs.
>>>>>> 
>>>>>> 
>>>>>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
>>>>> it work?
>>>>>> 
>>>>>> 
>>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com
>>>>> <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
>>>>> <ma...@airbnb.com>> wrote:
>>>>>>>> 
>>>>>>>> I think this is what is going on. The dags are picked by local
>>>>> variables. I.E. if you do
>>>>>>>> dag = Dag(...)
>>>>>>>> dag = Dag(…)
>>>>>>> 
>>>>>>> from my_module import create_dag
>>>>>>> 
>>>>>>> for file in yaml_files:
>>>>>>>   dag = create_dag(file)
>>>>>>>   globals()[dag.dag_id] = dag
>>>>>>> 
>>>>>>> You notice that create_dag is in a different module. If it is in the
>>>>> same scope (file), it will be fine.
>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>>> Only the second dag will be picked up.
>>>>>>>> 
>>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
>>> soma.dhavala@gmail.com
>>>>> <ma...@gmail.com>> wrote:
>>>>>>>> Hey AirFlow Devs:
>>>>>>>> In our organization, we build a Machine Learning WorkBench with
>>>>> AirFlow as
>>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>>>>>>> operators to customize the behaviour. These work flows are specified
>>> in
>>>>>>>> YAML.
>>>>>>>> 
>>>>>>>> We drop a DAG loader (written python) in the default location airflow
>>>>>>>> expects the DAG files.  This DAG loader reads the specified YAML
>>> files
>>>>> and
>>>>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>>>>> programmatically creating the DAG objects. In order to support
>>> muliple
>>>>>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
>>>>> But
>>>>>>>> when a DAG is created (in a separate module) and made available to
>>> the
>>>>> DAG
>>>>>>>> loaders, airflow does not pick it up. As an example, consider that I
>>>>>>>> created a DAG picked it, and will simply unpickle the DAG and give it
>>>>> to
>>>>>>>> airflow.
>>>>>>>> 
>>>>>>>> However, in current avatar of airfow, the very creation of DAG has to
>>>>>>>> happen in the loader itself. As far I am concerned, airflow should
>>> not
>>>>> care
>>>>>>>> where and how the DAG object is created, so long as it is a valid DAG
>>>>>>>> object. The workaround for us is to mix parser and loader in the same
>>>>> file
>>>>>>>> and drop it in the airflow default dags folder. During dag_bag
>>>>> creation,
>>>>>>>> this file is loaded up with import_modules utility and shows up in
>>> the
>>>>> UI.
>>>>>>>> While this is a solution, but it is not clean.
>>>>>>>> 
>>>>>>>> What do DEVs think about a solution to this problem? Will saving the
>>>>> DAG to
>>>>>>>> the db and reading it from the db work? Or some core changes need to
>>>>> happen
>>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>>>>>>> 
>>>>>>>> thanks,
>>>>>>>> -soma
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
> 


Re: programmatically creating and airflow quirks

Posted by soma dhavala <so...@gmail.com>.

> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <ma...@gmail.com> wrote:
> 
> The historical reason is that people would check in scripts in the repo
> that had actual compute or other forms or undesired effect in module scope
> (scripts with no "if __name__ == '__main__':") and Airflow would just run
> this script while seeking for DAGs. So we added this mitigation patch that
> would confirm that there's something Airflow-related in the .py file. Not
> elegant, and confusing at times, but it also probably prevented some issues
> over the years.
> 
> The solution here is to have a more explicit way of adding DAGs to the
> DagBag (instead of the folder-crawling approach). The DagFetcher proposal
> offers solutions around that, having a central "manifest" file that
> provides explicit pointers to all DAGs in the environment.

Some rebasing needs to happen. When I looked at 1.8 code base almost an year ago, it felt like more complex than necessary.  What airflow is trying to promise from an architectural standpoint — that was not clear to me. It is trying to do too many things, scattered in too many places, is the feeling I got. As a result, I stopped peeping, and just trust that it works — which it does, btw. I tend to think that, airflow outgrew its original intents. A sort of micro-services architecture has to be brought in. I may sound critical, but no offense. I truly appreciate the contributions.    

> 
> Max
> 
> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <be...@gmail.com>
> wrote:
> 
>> In my opinion this searching for dags is not ideal.
>> 
>> We should be explicitly specifying the dags to load somewhere.
>> 
>> 
>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yr...@gmail.com> wrote:
>>> 
>>> I believe that is mostly because we want to skip parsing/loading .py
>> files
>>> that doesn't contain DAG defs to save time, as scheduler is going to
>>> parse/load the .py files over and over again and some files can take
>> quite
>>> long to load.
>>> 
>>> Cheers,
>>> Kevin Y
>>> 
>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <so...@gmail.com>
>>> wrote:
>>> 
>>>> happy to report that the “fix” worked. thanks Alex.
>>>> 
>>>> btw, wondering why was it there in the first place? how does it help —
>>>> saves time, early termination — what?
>>>> 
>>>> 
>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com>
>> wrote:
>>>>> 
>>>>> Yup.
>>>>> 
>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
>>>> <ma...@airbnb.com>> wrote:
>>>>>> 
>>>>>> It’s because of this
>>>>>> 
>>>>>> “When searching for DAGs, Airflow will only consider files where the
>>>> string “airflow” and “DAG” both appear in the contents of the .py file.”
>>>>>> 
>>>>> 
>>>>> Have not noticed it.  From airflow/models.py, in process_file — (both
>> in
>>>> 1.9 and 1.10)
>>>>> ..
>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>>>> ..
>>>>> is looking for those strings and if they are not found, it is returning
>>>> without loading the DAGs.
>>>>> 
>>>>> 
>>>>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
>>>> it work?
>>>>> 
>>>>> 
>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
>>>> <ma...@airbnb.com>> wrote:
>>>>>>> 
>>>>>>> I think this is what is going on. The dags are picked by local
>>>> variables. I.E. if you do
>>>>>>> dag = Dag(...)
>>>>>>> dag = Dag(…)
>>>>>> 
>>>>>> from my_module import create_dag
>>>>>> 
>>>>>> for file in yaml_files:
>>>>>>    dag = create_dag(file)
>>>>>>    globals()[dag.dag_id] = dag
>>>>>> 
>>>>>> You notice that create_dag is in a different module. If it is in the
>>>> same scope (file), it will be fine.
>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>>> Only the second dag will be picked up.
>>>>>>> 
>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
>> soma.dhavala@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>>>> Hey AirFlow Devs:
>>>>>>> In our organization, we build a Machine Learning WorkBench with
>>>> AirFlow as
>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>>>>>> operators to customize the behaviour. These work flows are specified
>> in
>>>>>>> YAML.
>>>>>>> 
>>>>>>> We drop a DAG loader (written python) in the default location airflow
>>>>>>> expects the DAG files.  This DAG loader reads the specified YAML
>> files
>>>> and
>>>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>>>> programmatically creating the DAG objects. In order to support
>> muliple
>>>>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
>>>> But
>>>>>>> when a DAG is created (in a separate module) and made available to
>> the
>>>> DAG
>>>>>>> loaders, airflow does not pick it up. As an example, consider that I
>>>>>>> created a DAG picked it, and will simply unpickle the DAG and give it
>>>> to
>>>>>>> airflow.
>>>>>>> 
>>>>>>> However, in current avatar of airfow, the very creation of DAG has to
>>>>>>> happen in the loader itself. As far I am concerned, airflow should
>> not
>>>> care
>>>>>>> where and how the DAG object is created, so long as it is a valid DAG
>>>>>>> object. The workaround for us is to mix parser and loader in the same
>>>> file
>>>>>>> and drop it in the airflow default dags folder. During dag_bag
>>>> creation,
>>>>>>> this file is loaded up with import_modules utility and shows up in
>> the
>>>> UI.
>>>>>>> While this is a solution, but it is not clean.
>>>>>>> 
>>>>>>> What do DEVs think about a solution to this problem? Will saving the
>>>> DAG to
>>>>>>> the db and reading it from the db work? Or some core changes need to
>>>> happen
>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> -soma
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 


Re: programmatically creating and airflow quirks

Posted by Maxime Beauchemin <ma...@gmail.com>.
The historical reason is that people would check in scripts in the repo
that had actual compute or other forms or undesired effect in module scope
(scripts with no "if __name__ == '__main__':") and Airflow would just run
this script while seeking for DAGs. So we added this mitigation patch that
would confirm that there's something Airflow-related in the .py file. Not
elegant, and confusing at times, but it also probably prevented some issues
over the years.

The solution here is to have a more explicit way of adding DAGs to the
DagBag (instead of the folder-crawling approach). The DagFetcher proposal
offers solutions around that, having a central "manifest" file that
provides explicit pointers to all DAGs in the environment.

Max

On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <be...@gmail.com>
wrote:

> In my opinion this searching for dags is not ideal.
>
> We should be explicitly specifying the dags to load somewhere.
>
>
> > On 25 Nov 2018, at 10:41 am, Kevin Yang <yr...@gmail.com> wrote:
> >
> > I believe that is mostly because we want to skip parsing/loading .py
> files
> > that doesn't contain DAG defs to save time, as scheduler is going to
> > parse/load the .py files over and over again and some files can take
> quite
> > long to load.
> >
> > Cheers,
> > Kevin Y
> >
> > On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <so...@gmail.com>
> > wrote:
> >
> >> happy to report that the “fix” worked. thanks Alex.
> >>
> >> btw, wondering why was it there in the first place? how does it help —
> >> saves time, early termination — what?
> >>
> >>
> >>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com>
> wrote:
> >>>
> >>> Yup.
> >>>
> >>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@gmail.com
> >> <ma...@gmail.com>> wrote:
> >>>
> >>>
> >>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
> >> <ma...@airbnb.com>> wrote:
> >>>>
> >>>> It’s because of this
> >>>>
> >>>> “When searching for DAGs, Airflow will only consider files where the
> >> string “airflow” and “DAG” both appear in the contents of the .py file.”
> >>>>
> >>>
> >>> Have not noticed it.  From airflow/models.py, in process_file — (both
> in
> >> 1.9 and 1.10)
> >>> ..
> >>> if not all([s in content for s in (b'DAG', b'airflow')]):
> >>> ..
> >>> is looking for those strings and if they are not found, it is returning
> >> without loading the DAGs.
> >>>
> >>>
> >>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
> >> it work?
> >>>
> >>>
> >>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com
> >> <ma...@gmail.com>> wrote:
> >>>>
> >>>>
> >>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
> >> <ma...@airbnb.com>> wrote:
> >>>>>
> >>>>> I think this is what is going on. The dags are picked by local
> >> variables. I.E. if you do
> >>>>> dag = Dag(...)
> >>>>> dag = Dag(…)
> >>>>
> >>>> from my_module import create_dag
> >>>>
> >>>> for file in yaml_files:
> >>>>     dag = create_dag(file)
> >>>>     globals()[dag.dag_id] = dag
> >>>>
> >>>> You notice that create_dag is in a different module. If it is in the
> >> same scope (file), it will be fine.
> >>>>
> >>>>>
> >>>>
> >>>>> Only the second dag will be picked up.
> >>>>>
> >>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
> soma.dhavala@gmail.com
> >> <ma...@gmail.com>> wrote:
> >>>>> Hey AirFlow Devs:
> >>>>> In our organization, we build a Machine Learning WorkBench with
> >> AirFlow as
> >>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> >>>>> operators to customize the behaviour. These work flows are specified
> in
> >>>>> YAML.
> >>>>>
> >>>>> We drop a DAG loader (written python) in the default location airflow
> >>>>> expects the DAG files.  This DAG loader reads the specified YAML
> files
> >> and
> >>>>> converts them into airflow DAG objects. Essentially, we are
> >>>>> programmatically creating the DAG objects. In order to support
> muliple
> >>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
> >> But
> >>>>> when a DAG is created (in a separate module) and made available to
> the
> >> DAG
> >>>>> loaders, airflow does not pick it up. As an example, consider that I
> >>>>> created a DAG picked it, and will simply unpickle the DAG and give it
> >> to
> >>>>> airflow.
> >>>>>
> >>>>> However, in current avatar of airfow, the very creation of DAG has to
> >>>>> happen in the loader itself. As far I am concerned, airflow should
> not
> >> care
> >>>>> where and how the DAG object is created, so long as it is a valid DAG
> >>>>> object. The workaround for us is to mix parser and loader in the same
> >> file
> >>>>> and drop it in the airflow default dags folder. During dag_bag
> >> creation,
> >>>>> this file is loaded up with import_modules utility and shows up in
> the
> >> UI.
> >>>>> While this is a solution, but it is not clean.
> >>>>>
> >>>>> What do DEVs think about a solution to this problem? Will saving the
> >> DAG to
> >>>>> the db and reading it from the db work? Or some core changes need to
> >> happen
> >>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
> >>>>>
> >>>>> thanks,
> >>>>> -soma
> >>>>
> >>>
> >>
> >>
>

Re: programmatically creating and airflow quirks

Posted by Beau Barker <be...@gmail.com>.
In my opinion this searching for dags is not ideal.

We should be explicitly specifying the dags to load somewhere.


> On 25 Nov 2018, at 10:41 am, Kevin Yang <yr...@gmail.com> wrote:
> 
> I believe that is mostly because we want to skip parsing/loading .py files
> that doesn't contain DAG defs to save time, as scheduler is going to
> parse/load the .py files over and over again and some files can take quite
> long to load.
> 
> Cheers,
> Kevin Y
> 
> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <so...@gmail.com>
> wrote:
> 
>> happy to report that the “fix” worked. thanks Alex.
>> 
>> btw, wondering why was it there in the first place? how does it help —
>> saves time, early termination — what?
>> 
>> 
>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com> wrote:
>>> 
>>> Yup.
>>> 
>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@gmail.com
>> <ma...@gmail.com>> wrote:
>>> 
>>> 
>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
>> <ma...@airbnb.com>> wrote:
>>>> 
>>>> It’s because of this
>>>> 
>>>> “When searching for DAGs, Airflow will only consider files where the
>> string “airflow” and “DAG” both appear in the contents of the .py file.”
>>>> 
>>> 
>>> Have not noticed it.  From airflow/models.py, in process_file — (both in
>> 1.9 and 1.10)
>>> ..
>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>> ..
>>> is looking for those strings and if they are not found, it is returning
>> without loading the DAGs.
>>> 
>>> 
>>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
>> it work?
>>> 
>>> 
>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com
>> <ma...@gmail.com>> wrote:
>>>> 
>>>> 
>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
>> <ma...@airbnb.com>> wrote:
>>>>> 
>>>>> I think this is what is going on. The dags are picked by local
>> variables. I.E. if you do
>>>>> dag = Dag(...)
>>>>> dag = Dag(…)
>>>> 
>>>> from my_module import create_dag
>>>> 
>>>> for file in yaml_files:
>>>>     dag = create_dag(file)
>>>>     globals()[dag.dag_id] = dag
>>>> 
>>>> You notice that create_dag is in a different module. If it is in the
>> same scope (file), it will be fine.
>>>> 
>>>>> 
>>>> 
>>>>> Only the second dag will be picked up.
>>>>> 
>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <soma.dhavala@gmail.com
>> <ma...@gmail.com>> wrote:
>>>>> Hey AirFlow Devs:
>>>>> In our organization, we build a Machine Learning WorkBench with
>> AirFlow as
>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>>>> operators to customize the behaviour. These work flows are specified in
>>>>> YAML.
>>>>> 
>>>>> We drop a DAG loader (written python) in the default location airflow
>>>>> expects the DAG files.  This DAG loader reads the specified YAML files
>> and
>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>> programmatically creating the DAG objects. In order to support muliple
>>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
>> But
>>>>> when a DAG is created (in a separate module) and made available to the
>> DAG
>>>>> loaders, airflow does not pick it up. As an example, consider that I
>>>>> created a DAG picked it, and will simply unpickle the DAG and give it
>> to
>>>>> airflow.
>>>>> 
>>>>> However, in current avatar of airfow, the very creation of DAG has to
>>>>> happen in the loader itself. As far I am concerned, airflow should not
>> care
>>>>> where and how the DAG object is created, so long as it is a valid DAG
>>>>> object. The workaround for us is to mix parser and loader in the same
>> file
>>>>> and drop it in the airflow default dags folder. During dag_bag
>> creation,
>>>>> this file is loaded up with import_modules utility and shows up in the
>> UI.
>>>>> While this is a solution, but it is not clean.
>>>>> 
>>>>> What do DEVs think about a solution to this problem? Will saving the
>> DAG to
>>>>> the db and reading it from the db work? Or some core changes need to
>> happen
>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>>>> 
>>>>> thanks,
>>>>> -soma
>>>> 
>>> 
>> 
>> 

Re: programmatically creating and airflow quirks

Posted by Kevin Yang <yr...@gmail.com>.
I believe that is mostly because we want to skip parsing/loading .py files
that doesn't contain DAG defs to save time, as scheduler is going to
parse/load the .py files over and over again and some files can take quite
long to load.

Cheers,
Kevin Y

On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <so...@gmail.com>
wrote:

> happy to report that the “fix” worked. thanks Alex.
>
> btw, wondering why was it there in the first place? how does it help —
> saves time, early termination — what?
>
>
> > On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com> wrote:
> >
> > Yup.
> >
> > On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@gmail.com
> <ma...@gmail.com>> wrote:
> >
> >
> >> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
> <ma...@airbnb.com>> wrote:
> >>
> >> It’s because of this
> >>
> >> “When searching for DAGs, Airflow will only consider files where the
> string “airflow” and “DAG” both appear in the contents of the .py file.”
> >>
> >
> > Have not noticed it.  From airflow/models.py, in process_file — (both in
> 1.9 and 1.10)
> > ..
> > if not all([s in content for s in (b'DAG', b'airflow')]):
> > ..
> > is looking for those strings and if they are not found, it is returning
> without loading the DAGs.
> >
> >
> > So having “airflow” and “DAG”  dummy strings placed somewhere will make
> it work?
> >
> >
> >> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com
> <ma...@gmail.com>> wrote:
> >>
> >>
> >>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
> <ma...@airbnb.com>> wrote:
> >>>
> >>> I think this is what is going on. The dags are picked by local
> variables. I.E. if you do
> >>> dag = Dag(...)
> >>> dag = Dag(…)
> >>
> >> from my_module import create_dag
> >>
> >> for file in yaml_files:
> >>      dag = create_dag(file)
> >>      globals()[dag.dag_id] = dag
> >>
> >> You notice that create_dag is in a different module. If it is in the
> same scope (file), it will be fine.
> >>
> >>>
> >>
> >>> Only the second dag will be picked up.
> >>>
> >>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <soma.dhavala@gmail.com
> <ma...@gmail.com>> wrote:
> >>> Hey AirFlow Devs:
> >>> In our organization, we build a Machine Learning WorkBench with
> AirFlow as
> >>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> >>> operators to customize the behaviour. These work flows are specified in
> >>> YAML.
> >>>
> >>> We drop a DAG loader (written python) in the default location airflow
> >>> expects the DAG files.  This DAG loader reads the specified YAML files
> and
> >>> converts them into airflow DAG objects. Essentially, we are
> >>> programmatically creating the DAG objects. In order to support muliple
> >>> parsers (yaml, json etc), we separated the DAG creation from loading.
> But
> >>> when a DAG is created (in a separate module) and made available to the
> DAG
> >>> loaders, airflow does not pick it up. As an example, consider that I
> >>> created a DAG picked it, and will simply unpickle the DAG and give it
> to
> >>> airflow.
> >>>
> >>> However, in current avatar of airfow, the very creation of DAG has to
> >>> happen in the loader itself. As far I am concerned, airflow should not
> care
> >>> where and how the DAG object is created, so long as it is a valid DAG
> >>> object. The workaround for us is to mix parser and loader in the same
> file
> >>> and drop it in the airflow default dags folder. During dag_bag
> creation,
> >>> this file is loaded up with import_modules utility and shows up in the
> UI.
> >>> While this is a solution, but it is not clean.
> >>>
> >>> What do DEVs think about a solution to this problem? Will saving the
> DAG to
> >>> the db and reading it from the db work? Or some core changes need to
> happen
> >>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
> >>>
> >>> thanks,
> >>> -soma
> >>
> >
>
>

Re: programmatically creating and airflow quirks

Posted by soma dhavala <so...@gmail.com>.
happy to report that the “fix” worked. thanks Alex.

btw, wondering why was it there in the first place? how does it help — saves time, early termination — what?


> On Nov 23, 2018, at 8:18 AM, Alex Guziel <al...@airbnb.com> wrote:
> 
> Yup. 
> 
> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@gmail.com <ma...@gmail.com>> wrote:
> 
> 
>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com <ma...@airbnb.com>> wrote:
>> 
>> It’s because of this 
>> 
>> “When searching for DAGs, Airflow will only consider files where the string “airflow” and “DAG” both appear in the contents of the .py file.”
>> 
> 
> Have not noticed it.  From airflow/models.py, in process_file — (both in 1.9 and 1.10)
> ..
> if not all([s in content for s in (b'DAG', b'airflow')]):
> ..
> is looking for those strings and if they are not found, it is returning without loading the DAGs.
> 
> 
> So having “airflow” and “DAG”  dummy strings placed somewhere will make it work?
> 
> 
>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com <ma...@gmail.com>> wrote:
>> 
>> 
>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com <ma...@airbnb.com>> wrote:
>>> 
>>> I think this is what is going on. The dags are picked by local variables. I.E. if you do
>>> dag = Dag(...)
>>> dag = Dag(…)
>> 
>> from my_module import create_dag
>> 
>> for file in yaml_files:
>> 	dag = create_dag(file)
>> 	globals()[dag.dag_id] = dag
>> 
>> You notice that create_dag is in a different module. If it is in the same scope (file), it will be fine.
>> 
>>> 
>> 
>>> Only the second dag will be picked up.
>>> 
>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <soma.dhavala@gmail.com <ma...@gmail.com>> wrote:
>>> Hey AirFlow Devs:
>>> In our organization, we build a Machine Learning WorkBench with AirFlow as
>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>> operators to customize the behaviour. These work flows are specified in
>>> YAML.
>>> 
>>> We drop a DAG loader (written python) in the default location airflow
>>> expects the DAG files.  This DAG loader reads the specified YAML files and
>>> converts them into airflow DAG objects. Essentially, we are
>>> programmatically creating the DAG objects. In order to support muliple
>>> parsers (yaml, json etc), we separated the DAG creation from loading. But
>>> when a DAG is created (in a separate module) and made available to the DAG
>>> loaders, airflow does not pick it up. As an example, consider that I
>>> created a DAG picked it, and will simply unpickle the DAG and give it to
>>> airflow.
>>> 
>>> However, in current avatar of airfow, the very creation of DAG has to
>>> happen in the loader itself. As far I am concerned, airflow should not care
>>> where and how the DAG object is created, so long as it is a valid DAG
>>> object. The workaround for us is to mix parser and loader in the same file
>>> and drop it in the airflow default dags folder. During dag_bag creation,
>>> this file is loaded up with import_modules utility and shows up in the UI.
>>> While this is a solution, but it is not clean.
>>> 
>>> What do DEVs think about a solution to this problem? Will saving the DAG to
>>> the db and reading it from the db work? Or some core changes need to happen
>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>> 
>>> thanks,
>>> -soma
>> 
> 


Re: programmatically creating and airflow quirks

Posted by Alex Guziel <al...@airbnb.com.INVALID>.
Yup.

On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <so...@gmail.com> wrote:

>
>
> On Nov 23, 2018, at 3:28 AM, Alex Guziel <al...@airbnb.com> wrote:
>
> It’s because of this
>
> “When searching for DAGs, Airflow will only consider files where the
> string “airflow” and “DAG” both appear in the contents of the .py file.”
>
>
> Have not noticed it.  From airflow/models.py, in process_file — (both in
> 1.9 and 1.10)
> ..
> if not all([s in content for s in (b'DAG', b'airflow')]):
> ..
> is looking for those strings and if they are not found, it is returning
> without loading the DAGs.
>
>
> So having “airflow” and “DAG”  dummy strings placed somewhere will make it
> work?
>
>
> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <so...@gmail.com>
> wrote:
>
>>
>>
>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <al...@airbnb.com> wrote:
>>
>> I think this is what is going on. The dags are picked by local variables.
>> I.E. if you do
>> dag = Dag(...)
>> dag = Dag(…)
>>
>>
>> from my_module import create_dag
>>
>> for file in yaml_files:
>> dag = create_dag(file)
>> globals()[dag.dag_id] = dag
>>
>> You notice that create_dag is in a different module. If it is in the
>> same scope (file), it will be fine.
>>
>>
>>
>> Only the second dag will be picked up.
>>
>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <so...@gmail.com>
>> wrote:
>>
>>> Hey AirFlow Devs:
>>> In our organization, we build a Machine Learning WorkBench with AirFlow
>>> as
>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>> operators to customize the behaviour. These work flows are specified in
>>> YAML.
>>>
>>> We drop a DAG loader (written python) in the default location airflow
>>> expects the DAG files.  This DAG loader reads the specified YAML files
>>> and
>>> converts them into airflow DAG objects. Essentially, we are
>>> programmatically creating the DAG objects. In order to support muliple
>>> parsers (yaml, json etc), we separated the DAG creation from loading. But
>>> when a DAG is created (in a separate module) and made available to the
>>> DAG
>>> loaders, airflow does not pick it up. As an example, consider that I
>>> created a DAG picked it, and will simply unpickle the DAG and give it to
>>> airflow.
>>>
>>> However, in current avatar of airfow, the very creation of DAG has to
>>> happen in the loader itself. As far I am concerned, airflow should not
>>> care
>>> where and how the DAG object is created, so long as it is a valid DAG
>>> object. The workaround for us is to mix parser and loader in the same
>>> file
>>> and drop it in the airflow default dags folder. During dag_bag creation,
>>> this file is loaded up with import_modules utility and shows up in the
>>> UI.
>>> While this is a solution, but it is not clean.
>>>
>>> What do DEVs think about a solution to this problem? Will saving the DAG
>>> to
>>> the db and reading it from the db work? Or some core changes need to
>>> happen
>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>>
>>> thanks,
>>> -soma
>>>
>>
>>
>

Re: programmatically creating and airflow quirks

Posted by soma dhavala <so...@gmail.com>.

> On Nov 23, 2018, at 3:28 AM, Alex Guziel <al...@airbnb.com> wrote:
> 
> It’s because of this 
> 
> “When searching for DAGs, Airflow will only consider files where the string “airflow” and “DAG” both appear in the contents of the .py file.”
> 

Have not noticed it.  From airflow/models.py, in process_file — (both in 1.9 and 1.10)
..
if not all([s in content for s in (b'DAG', b'airflow')]):
..
is looking for those strings and if they are not found, it is returning without loading the DAGs.


So having “airflow” and “DAG”  dummy strings placed somewhere will make it work?


> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com <ma...@gmail.com>> wrote:
> 
> 
>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com <ma...@airbnb.com>> wrote:
>> 
>> I think this is what is going on. The dags are picked by local variables. I.E. if you do
>> dag = Dag(...)
>> dag = Dag(…)
> 
> from my_module import create_dag
> 
> for file in yaml_files:
> 	dag = create_dag(file)
> 	globals()[dag.dag_id] = dag
> 
> You notice that create_dag is in a different module. If it is in the same scope (file), it will be fine.
> 
>> 
> 
>> Only the second dag will be picked up.
>> 
>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <soma.dhavala@gmail.com <ma...@gmail.com>> wrote:
>> Hey AirFlow Devs:
>> In our organization, we build a Machine Learning WorkBench with AirFlow as
>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>> operators to customize the behaviour. These work flows are specified in
>> YAML.
>> 
>> We drop a DAG loader (written python) in the default location airflow
>> expects the DAG files.  This DAG loader reads the specified YAML files and
>> converts them into airflow DAG objects. Essentially, we are
>> programmatically creating the DAG objects. In order to support muliple
>> parsers (yaml, json etc), we separated the DAG creation from loading. But
>> when a DAG is created (in a separate module) and made available to the DAG
>> loaders, airflow does not pick it up. As an example, consider that I
>> created a DAG picked it, and will simply unpickle the DAG and give it to
>> airflow.
>> 
>> However, in current avatar of airfow, the very creation of DAG has to
>> happen in the loader itself. As far I am concerned, airflow should not care
>> where and how the DAG object is created, so long as it is a valid DAG
>> object. The workaround for us is to mix parser and loader in the same file
>> and drop it in the airflow default dags folder. During dag_bag creation,
>> this file is loaded up with import_modules utility and shows up in the UI.
>> While this is a solution, but it is not clean.
>> 
>> What do DEVs think about a solution to this problem? Will saving the DAG to
>> the db and reading it from the db work? Or some core changes need to happen
>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>> 
>> thanks,
>> -soma
> 


Re: programmatically creating and airflow quirks

Posted by Alex Guziel <al...@airbnb.com.INVALID>.
It’s because of this

“When searching for DAGs, Airflow will only consider files where the string
“airflow” and “DAG” both appear in the contents of the .py file.”

On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <so...@gmail.com> wrote:

>
>
> On Nov 22, 2018, at 3:37 PM, Alex Guziel <al...@airbnb.com> wrote:
>
> I think this is what is going on. The dags are picked by local variables.
> I.E. if you do
> dag = Dag(...)
> dag = Dag(…)
>
>
> from my_module import create_dag
>
> for file in yaml_files:
> dag = create_dag(file)
> globals()[dag.dag_id] = dag
>
> You notice that create_dag is in a different module. If it is in the same
> scope (file), it will be fine.
>
>
>
> Only the second dag will be picked up.
>
> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <so...@gmail.com>
> wrote:
>
>> Hey AirFlow Devs:
>> In our organization, we build a Machine Learning WorkBench with AirFlow as
>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>> operators to customize the behaviour. These work flows are specified in
>> YAML.
>>
>> We drop a DAG loader (written python) in the default location airflow
>> expects the DAG files.  This DAG loader reads the specified YAML files and
>> converts them into airflow DAG objects. Essentially, we are
>> programmatically creating the DAG objects. In order to support muliple
>> parsers (yaml, json etc), we separated the DAG creation from loading. But
>> when a DAG is created (in a separate module) and made available to the DAG
>> loaders, airflow does not pick it up. As an example, consider that I
>> created a DAG picked it, and will simply unpickle the DAG and give it to
>> airflow.
>>
>> However, in current avatar of airfow, the very creation of DAG has to
>> happen in the loader itself. As far I am concerned, airflow should not
>> care
>> where and how the DAG object is created, so long as it is a valid DAG
>> object. The workaround for us is to mix parser and loader in the same file
>> and drop it in the airflow default dags folder. During dag_bag creation,
>> this file is loaded up with import_modules utility and shows up in the UI.
>> While this is a solution, but it is not clean.
>>
>> What do DEVs think about a solution to this problem? Will saving the DAG
>> to
>> the db and reading it from the db work? Or some core changes need to
>> happen
>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>
>> thanks,
>> -soma
>>
>
>

Re: programmatically creating and airflow quirks

Posted by soma dhavala <so...@gmail.com>.

> On Nov 22, 2018, at 3:37 PM, Alex Guziel <al...@airbnb.com> wrote:
> 
> I think this is what is going on. The dags are picked by local variables. I.E. if you do
> dag = Dag(...)
> dag = Dag(…)

from my_module import create_dag

for file in yaml_files:
	dag = create_dag(file)
	globals()[dag.dag_id] = dag

You notice that create_dag is in a different module. If it is in the same scope (file), it will be fine.

> 

> Only the second dag will be picked up.
> 
> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <soma.dhavala@gmail.com <ma...@gmail.com>> wrote:
> Hey AirFlow Devs:
> In our organization, we build a Machine Learning WorkBench with AirFlow as
> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> operators to customize the behaviour. These work flows are specified in
> YAML.
> 
> We drop a DAG loader (written python) in the default location airflow
> expects the DAG files.  This DAG loader reads the specified YAML files and
> converts them into airflow DAG objects. Essentially, we are
> programmatically creating the DAG objects. In order to support muliple
> parsers (yaml, json etc), we separated the DAG creation from loading. But
> when a DAG is created (in a separate module) and made available to the DAG
> loaders, airflow does not pick it up. As an example, consider that I
> created a DAG picked it, and will simply unpickle the DAG and give it to
> airflow.
> 
> However, in current avatar of airfow, the very creation of DAG has to
> happen in the loader itself. As far I am concerned, airflow should not care
> where and how the DAG object is created, so long as it is a valid DAG
> object. The workaround for us is to mix parser and loader in the same file
> and drop it in the airflow default dags folder. During dag_bag creation,
> this file is loaded up with import_modules utility and shows up in the UI.
> While this is a solution, but it is not clean.
> 
> What do DEVs think about a solution to this problem? Will saving the DAG to
> the db and reading it from the db work? Or some core changes need to happen
> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
> 
> thanks,
> -soma


Re: programmatically creating and airflow quirks

Posted by Alex Guziel <al...@airbnb.com.INVALID>.
I think this is what is going on. The dags are picked by local variables.
I.E. if you do
dag = Dag(...)
dag = Dag(...)

Only the second dag will be picked up.

On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <so...@gmail.com>
wrote:

> Hey AirFlow Devs:
> In our organization, we build a Machine Learning WorkBench with AirFlow as
> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> operators to customize the behaviour. These work flows are specified in
> YAML.
>
> We drop a DAG loader (written python) in the default location airflow
> expects the DAG files.  This DAG loader reads the specified YAML files and
> converts them into airflow DAG objects. Essentially, we are
> programmatically creating the DAG objects. In order to support muliple
> parsers (yaml, json etc), we separated the DAG creation from loading. But
> when a DAG is created (in a separate module) and made available to the DAG
> loaders, airflow does not pick it up. As an example, consider that I
> created a DAG picked it, and will simply unpickle the DAG and give it to
> airflow.
>
> However, in current avatar of airfow, the very creation of DAG has to
> happen in the loader itself. As far I am concerned, airflow should not care
> where and how the DAG object is created, so long as it is a valid DAG
> object. The workaround for us is to mix parser and loader in the same file
> and drop it in the airflow default dags folder. During dag_bag creation,
> this file is loaded up with import_modules utility and shows up in the UI.
> While this is a solution, but it is not clean.
>
> What do DEVs think about a solution to this problem? Will saving the DAG to
> the db and reading it from the db work? Or some core changes need to happen
> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>
> thanks,
> -soma
>