You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Gabriel Silk <gs...@dropbox.com.INVALID> on 2018/11/01 06:18:02 UTC

Re: Deployment / Execution Model

Our DAG deployment is already a separate deployment from Airflow itself.

The issue is that the Airflow binary (whether acting as webserver,
scheduler, worker), is the one that *reads* the DAG files. So if you have,
for example, a DAG that has this import statement in it:

import mylib.foobar

Then the only way to successfully interpret this DAG definition in the
Airflow process, is if you package the Airflow binary with the mylib.foobar
dependency.

This implies that every time you add a new dependency in one of your DAG
definitions, you have to re-deploy Airflow itself, not just the DAG
definitions.


On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Deploying the DAGs should be decoupled from deploying Airflow itself. You
> can just use a resource that syncs the DAGs repo to the boxes on the
> Airflow cluster periodically (say every minute). Resource orchestrators
> like Chef, Ansible, Puppet, should have some easy way to do that. Either
> that or some sort of mount or mount-equivalent (k8s has constructs for
> that, EFS on Amazon).
>
> Also note that the DagFetcher abstraction that's been discussed before on
> the mailing list would solve this and more.
>
> Max
>
> On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk <gs...@dropbox.com.invalid>
> wrote:
>
> > Hello Airflow community,
> >
> >
> > I'm currently putting Airflow into production at my company of 2000+
> > people. The most significant sticking point so far is the deployment /
> > execution model. I wanted to write up my experience so far in this matter
> > and see how other people are dealing with this issue.
> >
> > First of all, our goal is to allow engineers to author DAGs and easily
> > deploy them. That means they should be able to make changes to their
> DAGs,
> > add/remove dependencies, and not have to  redeploy any of the core
> > component (scheduler, webserver, workers).
> >
> > Our first attempt at productionizing Airflow used the vanilla DAGs
> folder,
> > and including all the deps of all the DAGs with the airflow binary
> itself.
> > Unfortunately, that meant we had to redeploy the scheduler, webserver
> > and/or workers every time a dependency changed, which will definitely not
> > work for us long term.
> >
> > The next option we considered was to use the "packaged DAGs" approach,
> > whereby you place dependencies in a zip file. This would not work for us,
> > due to the lack of support for dynamic libraries (see
> > https://airflow.apache.org/concepts.html#packaged-dags)
> >
> > We have finally arrived at an option that seems reasonable, which is to
> use
> > a single Operator that shells out to various binary targets that we build
> > independently of Airflow, and which include their own dependencies.
> > Configuration is serialized via protobuf and passed over stdin to the
> > subprocess. The parent process (which is in Airflow's memory space)
> streams
> > the logs from stdout and stderr.
> >
> > This approach has the advantage of working seamlessly with our build
> > system, and allowing us to redeploy DAGs even when dependencies in the
> > operator implementations change.
> >
> > Any thoughts / comments / feedback? Have people faced similar issues out
> > there?
> >
> > Many thanks,
> >
> >
> > -G Silk
> >
>

Re: Deployment / Execution Model

Posted by "Daniel (Daniel Lamblin) [BDP - Seoul]" <la...@coupang.com>.

So, if mylib is from pypi or something, then yes all the workers webservers and the scheduler would need to have it installed.
If mylib is something that was authored just for a couple of DAG files to share, then (in this EG you are using the dag folder /dag/) lay out files like this:
/dag/team/dag_file_imports_mylib_foobar.py
/dag/mylib/__init__.py
/dag/mylib/foobar.py
Have the engineers make add that as one commit, have your ci trigger the deployed instances to pull the updated files after checking them for … well correctness.
-Daniel

On 11/1/18, 3:18 PM, "Gabriel Silk" <gs...@dropbox.com.INVALID> wrote:

    Our DAG deployment is already a separate deployment from Airflow itself.
    
    The issue is that the Airflow binary (whether acting as webserver,
    scheduler, worker), is the one that *reads* the DAG files. So if you have,
    for example, a DAG that has this import statement in it:
    
    import mylib.foobar
    
    Then the only way to successfully interpret this DAG definition in the
    Airflow process, is if you package the Airflow binary with the mylib.foobar
    dependency.
    
    This implies that every time you add a new dependency in one of your DAG
    definitions, you have to re-deploy Airflow itself, not just the DAG
    definitions.
    
    
    On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin <
    maximebeauchemin@gmail.com> wrote:
    
    > Deploying the DAGs should be decoupled from deploying Airflow itself. You
    > can just use a resource that syncs the DAGs repo to the boxes on the
    > Airflow cluster periodically (say every minute). Resource orchestrators
    > like Chef, Ansible, Puppet, should have some easy way to do that. Either
    > that or some sort of mount or mount-equivalent (k8s has constructs for
    > that, EFS on Amazon).
    >
    > Also note that the DagFetcher abstraction that's been discussed before on
    > the mailing list would solve this and more.
    >
    > Max
    >
    > On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk <gs...@dropbox.com.invalid>
    > wrote:
    >
    > > Hello Airflow community,
    > >
    > >
    > > I'm currently putting Airflow into production at my company of 2000+
    > > people. The most significant sticking point so far is the deployment /
    > > execution model. I wanted to write up my experience so far in this matter
    > > and see how other people are dealing with this issue.
    > >
    > > First of all, our goal is to allow engineers to author DAGs and easily
    > > deploy them. That means they should be able to make changes to their
    > DAGs,
    > > add/remove dependencies, and not have to  redeploy any of the core
    > > component (scheduler, webserver, workers).
    > >
    > > Our first attempt at productionizing Airflow used the vanilla DAGs
    > folder,
    > > and including all the deps of all the DAGs with the airflow binary
    > itself.
    > > Unfortunately, that meant we had to redeploy the scheduler, webserver
    > > and/or workers every time a dependency changed, which will definitely not
    > > work for us long term.
    > >
    > > The next option we considered was to use the "packaged DAGs" approach,
    > > whereby you place dependencies in a zip file. This would not work for us,
    > > due to the lack of support for dynamic libraries (see
    > > https://airflow.apache.org/concepts.html#packaged-dags)
    > >
    > > We have finally arrived at an option that seems reasonable, which is to
    > use
    > > a single Operator that shells out to various binary targets that we build
    > > independently of Airflow, and which include their own dependencies.
    > > Configuration is serialized via protobuf and passed over stdin to the
    > > subprocess. The parent process (which is in Airflow's memory space)
    > streams
    > > the logs from stdout and stderr.
    > >
    > > This approach has the advantage of working seamlessly with our build
    > > system, and allowing us to redeploy DAGs even when dependencies in the
    > > operator implementations change.
    > >
    > > Any thoughts / comments / feedback? Have people faced similar issues out
    > > there?
    > >
    > > Many thanks,
    > >
    > >
    > > -G Silk
    > >
    >

Re: Deployment / Execution Model

Posted by Michel Goldstein <mi...@gmail.com>.

Hi Gabriel,

Not that I'm necessarily advocating for this approach, but one solution
that has worked for us is to separate into two different classes of
dependencies: external libraries and "owned DAG dependencies".

For external libraries, we forced common library versions per Airflow
deployment, and that is deployed through Salt to our Airflow fleet.

For owned dependencies, we deploy them with the DAG in a subdirectory of
the DAGS_FOLDER and then all imports to those libraries are done using a
function that first looks at the library appending the DAG directory to the
import (e.g. "import a.b" in DAG in directory dir becomes "import
dir.a.b"). If not found, it falls back to the normal import ("import a.b").

The benefit of this approach is that you actually can have:

/dag/dag1/util/util.py
/dag/dag2/util/util.py

be different versions of the util.py used by dag1 and dag2 separately. It
made it easier to support the deployment of ad-hoc DAGs where developers
are trying out different versions of those "shared" libraries at the same
time.

Hope this inspires you for maybe a more elegant solution! We spent some
time playing around with a lot of different options, but none of them
worked well enough. Maybe all these ideas of supporting the equivalent of
having multiple DAGS_FOLDERs that has been floating around (mostly to
support the distribution of the scheduler) might give us a path for
something a little better.

Michel

On Wed, Oct 31, 2018 at 11:20 PM Gabriel Silk <gs...@dropbox.com.invalid>
wrote:

> I can see how my first email was confusing, where I said:
>
> "Our first attempt at productionizing Airflow used the vanilla DAGs folder,
> including all the deps of all the DAGs with the airflow binary itself"
>
> What I meant is that we have separate DAGs deployment, but we are being
> forced to package the *dependencies of the DAGs* with the Airflow binary,
> because that's the only way to make the DAG definitions work.
>
> On Wed, Oct 31, 2018 at 11:18 PM, Gabriel Silk <gs...@dropbox.com> wrote:
>
> > Our DAG deployment is already a separate deployment from Airflow itself.
> >
> > The issue is that the Airflow binary (whether acting as webserver,
> > scheduler, worker), is the one that *reads* the DAG files. So if you
> > have, for example, a DAG that has this import statement in it:
> >
> > import mylib.foobar
> >
> > Then the only way to successfully interpret this DAG definition in the
> > Airflow process, is if you package the Airflow binary with the
> mylib.foobar
> > dependency.
> >
> > This implies that every time you add a new dependency in one of your DAG
> > definitions, you have to re-deploy Airflow itself, not just the DAG
> > definitions.
> >
> >
> > On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> >> Deploying the DAGs should be decoupled from deploying Airflow itself.
> You
> >> can just use a resource that syncs the DAGs repo to the boxes on the
> >> Airflow cluster periodically (say every minute). Resource orchestrators
> >> like Chef, Ansible, Puppet, should have some easy way to do that. Either
> >> that or some sort of mount or mount-equivalent (k8s has constructs for
> >> that, EFS on Amazon).
> >>
> >> Also note that the DagFetcher abstraction that's been discussed before
> on
> >> the mailing list would solve this and more.
> >>
> >> Max
> >>
> >> On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk <gsilk@dropbox.com.invalid
> >
> >> wrote:
> >>
> >> > Hello Airflow community,
> >> >
> >> >
> >> > I'm currently putting Airflow into production at my company of 2000+
> >> > people. The most significant sticking point so far is the deployment /
> >> > execution model. I wanted to write up my experience so far in this
> >> matter
> >> > and see how other people are dealing with this issue.
> >> >
> >> > First of all, our goal is to allow engineers to author DAGs and easily
> >> > deploy them. That means they should be able to make changes to their
> >> DAGs,
> >> > add/remove dependencies, and not have to  redeploy any of the core
> >> > component (scheduler, webserver, workers).
> >> >
> >> > Our first attempt at productionizing Airflow used the vanilla DAGs
> >> folder,
> >> > and including all the deps of all the DAGs with the airflow binary
> >> itself.
> >> > Unfortunately, that meant we had to redeploy the scheduler, webserver
> >> > and/or workers every time a dependency changed, which will definitely
> >> not
> >> > work for us long term.
> >> >
> >> > The next option we considered was to use the "packaged DAGs" approach,
> >> > whereby you place dependencies in a zip file. This would not work for
> >> us,
> >> > due to the lack of support for dynamic libraries (see
> >> > https://airflow.apache.org/concepts.html#packaged-dags)
> >> >
> >> > We have finally arrived at an option that seems reasonable, which is
> to
> >> use
> >> > a single Operator that shells out to various binary targets that we
> >> build
> >> > independently of Airflow, and which include their own dependencies.
> >> > Configuration is serialized via protobuf and passed over stdin to the
> >> > subprocess. The parent process (which is in Airflow's memory space)
> >> streams
> >> > the logs from stdout and stderr.
> >> >
> >> > This approach has the advantage of working seamlessly with our build
> >> > system, and allowing us to redeploy DAGs even when dependencies in the
> >> > operator implementations change.
> >> >
> >> > Any thoughts / comments / feedback? Have people faced similar issues
> out
> >> > there?
> >> >
> >> > Many thanks,
> >> >
> >> >
> >> > -G Silk
> >> >
> >>
> >
> >
>

Re: Deployment / Execution Model

Posted by James Meickle <jm...@quantopian.com.INVALID>.

We're running into a lot of pain with this. We have a CI system that
enables very rapid iteration on DAG code. Whenever you need to modify
plugin code, it requires a re-ship of all of the infrastructure, which
takes at least 10x longer than a DAG deployment Jenkins build.

I think that Airflow should have multiple DAG parsing backends, the same
way that it has multiple executors. It's fine for subprocess to be the
default, but it would be immensely helpful if DAG parsing could take place
in a virtualenv, Docker container, or Kubernetes pod.

On Thu, Nov 1, 2018 at 2:20 AM Gabriel Silk <gs...@dropbox.com.invalid>
wrote:

> I can see how my first email was confusing, where I said:
>
> "Our first attempt at productionizing Airflow used the vanilla DAGs folder,
> including all the deps of all the DAGs with the airflow binary itself"
>
> What I meant is that we have separate DAGs deployment, but we are being
> forced to package the *dependencies of the DAGs* with the Airflow binary,
> because that's the only way to make the DAG definitions work.
>
> On Wed, Oct 31, 2018 at 11:18 PM, Gabriel Silk <gs...@dropbox.com> wrote:
>
> > Our DAG deployment is already a separate deployment from Airflow itself.
> >
> > The issue is that the Airflow binary (whether acting as webserver,
> > scheduler, worker), is the one that *reads* the DAG files. So if you
> > have, for example, a DAG that has this import statement in it:
> >
> > import mylib.foobar
> >
> > Then the only way to successfully interpret this DAG definition in the
> > Airflow process, is if you package the Airflow binary with the
> mylib.foobar
> > dependency.
> >
> > This implies that every time you add a new dependency in one of your DAG
> > definitions, you have to re-deploy Airflow itself, not just the DAG
> > definitions.
> >
> >
> > On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> >> Deploying the DAGs should be decoupled from deploying Airflow itself.
> You
> >> can just use a resource that syncs the DAGs repo to the boxes on the
> >> Airflow cluster periodically (say every minute). Resource orchestrators
> >> like Chef, Ansible, Puppet, should have some easy way to do that. Either
> >> that or some sort of mount or mount-equivalent (k8s has constructs for
> >> that, EFS on Amazon).
> >>
> >> Also note that the DagFetcher abstraction that's been discussed before
> on
> >> the mailing list would solve this and more.
> >>
> >> Max
> >>
> >> On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk <gsilk@dropbox.com.invalid
> >
> >> wrote:
> >>
> >> > Hello Airflow community,
> >> >
> >> >
> >> > I'm currently putting Airflow into production at my company of 2000+
> >> > people. The most significant sticking point so far is the deployment /
> >> > execution model. I wanted to write up my experience so far in this
> >> matter
> >> > and see how other people are dealing with this issue.
> >> >
> >> > First of all, our goal is to allow engineers to author DAGs and easily
> >> > deploy them. That means they should be able to make changes to their
> >> DAGs,
> >> > add/remove dependencies, and not have to  redeploy any of the core
> >> > component (scheduler, webserver, workers).
> >> >
> >> > Our first attempt at productionizing Airflow used the vanilla DAGs
> >> folder,
> >> > and including all the deps of all the DAGs with the airflow binary
> >> itself.
> >> > Unfortunately, that meant we had to redeploy the scheduler, webserver
> >> > and/or workers every time a dependency changed, which will definitely
> >> not
> >> > work for us long term.
> >> >
> >> > The next option we considered was to use the "packaged DAGs" approach,
> >> > whereby you place dependencies in a zip file. This would not work for
> >> us,
> >> > due to the lack of support for dynamic libraries (see
> >> > https://airflow.apache.org/concepts.html#packaged-dags)
> >> >
> >> > We have finally arrived at an option that seems reasonable, which is
> to
> >> use
> >> > a single Operator that shells out to various binary targets that we
> >> build
> >> > independently of Airflow, and which include their own dependencies.
> >> > Configuration is serialized via protobuf and passed over stdin to the
> >> > subprocess. The parent process (which is in Airflow's memory space)
> >> streams
> >> > the logs from stdout and stderr.
> >> >
> >> > This approach has the advantage of working seamlessly with our build
> >> > system, and allowing us to redeploy DAGs even when dependencies in the
> >> > operator implementations change.
> >> >
> >> > Any thoughts / comments / feedback? Have people faced similar issues
> out
> >> > there?
> >> >
> >> > Many thanks,
> >> >
> >> >
> >> > -G Silk
> >> >
> >>
> >
> >
>

Re: Deployment / Execution Model

Posted by Gabriel Silk <gs...@dropbox.com.INVALID>.

I can see how my first email was confusing, where I said:

"Our first attempt at productionizing Airflow used the vanilla DAGs folder,
including all the deps of all the DAGs with the airflow binary itself"

What I meant is that we have separate DAGs deployment, but we are being
forced to package the *dependencies of the DAGs* with the Airflow binary,
because that's the only way to make the DAG definitions work.

On Wed, Oct 31, 2018 at 11:18 PM, Gabriel Silk <gs...@dropbox.com> wrote:

> Our DAG deployment is already a separate deployment from Airflow itself.
>
> The issue is that the Airflow binary (whether acting as webserver,
> scheduler, worker), is the one that *reads* the DAG files. So if you
> have, for example, a DAG that has this import statement in it:
>
> import mylib.foobar
>
> Then the only way to successfully interpret this DAG definition in the
> Airflow process, is if you package the Airflow binary with the mylib.foobar
> dependency.
>
> This implies that every time you add a new dependency in one of your DAG
> definitions, you have to re-deploy Airflow itself, not just the DAG
> definitions.
>
>
> On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> Deploying the DAGs should be decoupled from deploying Airflow itself. You
>> can just use a resource that syncs the DAGs repo to the boxes on the
>> Airflow cluster periodically (say every minute). Resource orchestrators
>> like Chef, Ansible, Puppet, should have some easy way to do that. Either
>> that or some sort of mount or mount-equivalent (k8s has constructs for
>> that, EFS on Amazon).
>>
>> Also note that the DagFetcher abstraction that's been discussed before on
>> the mailing list would solve this and more.
>>
>> Max
>>
>> On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk <gs...@dropbox.com.invalid>
>> wrote:
>>
>> > Hello Airflow community,
>> >
>> >
>> > I'm currently putting Airflow into production at my company of 2000+
>> > people. The most significant sticking point so far is the deployment /
>> > execution model. I wanted to write up my experience so far in this
>> matter
>> > and see how other people are dealing with this issue.
>> >
>> > First of all, our goal is to allow engineers to author DAGs and easily
>> > deploy them. That means they should be able to make changes to their
>> DAGs,
>> > add/remove dependencies, and not have to  redeploy any of the core
>> > component (scheduler, webserver, workers).
>> >
>> > Our first attempt at productionizing Airflow used the vanilla DAGs
>> folder,
>> > and including all the deps of all the DAGs with the airflow binary
>> itself.
>> > Unfortunately, that meant we had to redeploy the scheduler, webserver
>> > and/or workers every time a dependency changed, which will definitely
>> not
>> > work for us long term.
>> >
>> > The next option we considered was to use the "packaged DAGs" approach,
>> > whereby you place dependencies in a zip file. This would not work for
>> us,
>> > due to the lack of support for dynamic libraries (see
>> > https://airflow.apache.org/concepts.html#packaged-dags)
>> >
>> > We have finally arrived at an option that seems reasonable, which is to
>> use
>> > a single Operator that shells out to various binary targets that we
>> build
>> > independently of Airflow, and which include their own dependencies.
>> > Configuration is serialized via protobuf and passed over stdin to the
>> > subprocess. The parent process (which is in Airflow's memory space)
>> streams
>> > the logs from stdout and stderr.
>> >
>> > This approach has the advantage of working seamlessly with our build
>> > system, and allowing us to redeploy DAGs even when dependencies in the
>> > operator implementations change.
>> >
>> > Any thoughts / comments / feedback? Have people faced similar issues out
>> > there?
>> >
>> > Many thanks,
>> >
>> >
>> > -G Silk
>> >
>>
>
>