You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Daniel Imberman <da...@gmail.com> on 2017/07/05 18:25:09 UTC

Airflow kubernetes executor

Hello Airflow community!

My name is Daniel Imberman, and I have been working on behalf of Bloomberg
LP to create an airflow kubernetes executor/operator. We wanted to allow
for maximum throughput/scalability, while keeping a lot of the kubernetes
details abstracted away from the users. Below I have a link to the WIP PR
and the PDF of the initial proposal. If anyone has any comments/questions I
would be glad to discuss this feature further.

Thank you,

Daniel

https://github.com/apache/incubator-airflow/pull/2414

Re: Airflow kubernetes executor

Posted by Daniel Imberman <da...@gmail.com>.

@Grant @Gerard: WRT Static workers: I think that the downside of a slightly
longer start-up time is significantly less important than the massive
upshot in scalability offered by having non-static workers. This also opens
up really interesting opportunities (i.e. being able to identify how many
resources to give a task, lauching tasks with specific dependencies
installed, etc.).

For our current plan for k8s deployment we're looking into the following:

1: Launch an NFS cluster with a side-car that consistently polls
github/artifactory.
2: If there is any change to github/artifcatory, pull down the latest
version of the code
3: every time a slave task starts, have it attach to the NFS cluster as a
volume mount.

This method would work with pretty much any system that kubernetes allows
as a persistentVolumeClaim.

@Jeremiah: Your system actually sounds a lot like ours (except we have
pretty heavy regulations against cloud services so we're doing our stuff a
lot more bare-metal). git-sync definitely works for a lot of use-cases. The
main issue for companies like mine is that there are certain
robustness/availability issues with pulling code straight from github ->
production (i.e. if our git enterprise goes down). I might speak to the k8s
guys about implementing an artifactory PVC. Until then we'll probably just
create an "artifactory-sync" and have a jenkins job that consistently polls
github to port to artifactory.

I'm glad to see this topic has sparked so much conversation :).

On Thu, Jul 13, 2017 at 5:24 AM Jeremiah Lowin <jl...@apache.org> wrote:

> p.s. it looks like git-sync has received an "official" release since the
> last time I looked at it: https://github.com/kubernetes/git-sync
>
> On Thu, Jul 13, 2017 at 8:18 AM Jeremiah Lowin <jl...@apache.org> wrote:
>
> > Hi Gerard (and anyone else for whom this might be helpful),
> >
> > We've run Airflow on GCP for a few years. The structure has changed over
> > time but at the moment we use the following basic outline:
> >
> > 1. Build a container that includes all Airflow and DAG dependencies and
> > push it to Google container registry. If you need to add/update
> > dependencies or update airflow.cfg, simply push a new image
> > 2. All DAGs are pushed to a git repo
> > 3. Host the AirflowDB in Google Cloud SQL
> > 4. Create a Kuberenetes deployment that runs the following containers:
> > -- Airflow scheduler (using the dependencies image)
> > -- Airflow webserver (using the dependencies image)
> > -- Airflow maintainence (using the dependencies image) - this container
> > does nothing (sleep infinity) but since it shares the same setup as the
> > scheduler/webserver, it's an easy place to `exec` into the cluster to
> > investigate any issues that might be crashing the main containers. We
> limit
> > its CPU to minimize impact on cluster resources. Hacky but effective.
> > -- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy
> )
> > - to connect to the Airflow DB
> > -- git-sync (https://github.com/jlowin/git-sync)
> >
> > The last container (git-sync) is a small library I wrote to solve the
> > issue of syncing DAGs. It's not perfect and ***I am NOT offering any
> > support for it*** but it gets the job done. It's meant to be a sidecar
> > container and does one thing: constantly fetch a git repo to a local
> > folder. In your deployment, create an EmptyDir volume and mount it in all
> > containers (except cloud sql). Git-sync should use that volume as its
> > target, and scheduler/webserver should use the volume as the DAGs folder.
> > That way, every 30 seconds, git-sync will fetch the git repo in to that
> > volume, and the Airflow containers will immediately see the latest files
> > appear.
> >
> > 5. Create a Kubernetes service to expose the webserver UI
> >
> > Our actual implementation is considerably more complicated than this
> since
> > we have extensive custom modules that are loaded via git-sync rather than
> > being baked into the image, as well as a few other GCP service
> > integrations, but this overview should point in the right direction.
> > Getting it running the first time requires a little elbow grease but once
> > built, it's easy to automate the process.
> >
> > Best,
> > Jeremiah
> >
> >
> >
> > On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <gt...@gmail.com>
> > wrote:
> >
> >> It would be really good if you'd share experiences on how to run this on
> >> kubernetes and ECS.
> >> I'm not aware of a good guide on how to run this on either for example,
> >> but
> >> it's a very useful and
> >> quick setup to start with, especially combining that with deployment
> >> manager and cloudformation (probably).
> >>
> >> I'm talking to someone else who's looking at running on kubernetes and
> >> potentially opensourcing a generic
> >> template for kubernetes deployments.
> >>
> >>
> >> Would it be possible to share your experiences?  What tech are you using
> >> for specific issues?
> >>
> >> - how do you deploy and sync dags?  Are you using EFS?
> >> - how you do build the container with airflow + executables?
> >> - where do you send log files or log lines to?
> >> - High Availability and how?
> >>
> >> Really looking forward to how that's done, so we can put this on the
> wiki.
> >>
> >> Especially since GCP is now also starting to embrace airflow, it'd be
> good
> >> to have a better understanding
> >> how easy and quick it can be to deploy airflow on gcp:
> >>
> >>
> >>
> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
> >>
> >> Rgds,
> >>
> >> Gerard
> >>
> >>
> >> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <ap...@lumoslabs.com>
> >> wrote:
> >>
> >> > for what it's worth we've been running airflow on ECS for a few years
> >> > already.
> >> >
> >> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
> >> > grantnicholas2015@u.northwestern.edu> wrote:
> >> >
> >> > > Is having a static set of workers necessary? Launching a job on
> >> > Kubernetes
> >> > > from a cached docker image takes a few seconds max. I think this is
> an
> >> > > acceptable delay for a batch processing system like airflow.
> >> > >
> >> > > Additionally, if you dynamically launch workers you can start
> >> dynamically
> >> > > launching *any type* of worker and you don't have to statically
> >> allocate
> >> > > pools of worker types. IE) A single DAG could use a scala docker
> >> image to
> >> > > do spark calculations, a C++ docker image to use some low level
> >> numerical
> >> > > library,  and a python docker image by default to do any generic
> >> airflow
> >> > > stuff. Additionally, you can size workers according to their usage.
> >> Maybe
> >> > > the spark driver program only needs a few GBs of RAM but the C++
> >> > numerical
> >> > > library needs many hundreds.
> >> > >
> >> > > I agree there is a bit of extra book-keeping that needs to be done,
> >> but
> >> > > the tradeoff is an important one to explicitly make.
> >> > >
> >> >
> >>
> >
>

Re: Airflow kubernetes executor

Posted by Jeremiah Lowin <jl...@apache.org>.

p.s. it looks like git-sync has received an "official" release since the
last time I looked at it: https://github.com/kubernetes/git-sync

On Thu, Jul 13, 2017 at 8:18 AM Jeremiah Lowin <jl...@apache.org> wrote:

> Hi Gerard (and anyone else for whom this might be helpful),
>
> We've run Airflow on GCP for a few years. The structure has changed over
> time but at the moment we use the following basic outline:
>
> 1. Build a container that includes all Airflow and DAG dependencies and
> push it to Google container registry. If you need to add/update
> dependencies or update airflow.cfg, simply push a new image
> 2. All DAGs are pushed to a git repo
> 3. Host the AirflowDB in Google Cloud SQL
> 4. Create a Kuberenetes deployment that runs the following containers:
> -- Airflow scheduler (using the dependencies image)
> -- Airflow webserver (using the dependencies image)
> -- Airflow maintainence (using the dependencies image) - this container
> does nothing (sleep infinity) but since it shares the same setup as the
> scheduler/webserver, it's an easy place to `exec` into the cluster to
> investigate any issues that might be crashing the main containers. We limit
> its CPU to minimize impact on cluster resources. Hacky but effective.
> -- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy)
> - to connect to the Airflow DB
> -- git-sync (https://github.com/jlowin/git-sync)
>
> The last container (git-sync) is a small library I wrote to solve the
> issue of syncing DAGs. It's not perfect and ***I am NOT offering any
> support for it*** but it gets the job done. It's meant to be a sidecar
> container and does one thing: constantly fetch a git repo to a local
> folder. In your deployment, create an EmptyDir volume and mount it in all
> containers (except cloud sql). Git-sync should use that volume as its
> target, and scheduler/webserver should use the volume as the DAGs folder.
> That way, every 30 seconds, git-sync will fetch the git repo in to that
> volume, and the Airflow containers will immediately see the latest files
> appear.
>
> 5. Create a Kubernetes service to expose the webserver UI
>
> Our actual implementation is considerably more complicated than this since
> we have extensive custom modules that are loaded via git-sync rather than
> being baked into the image, as well as a few other GCP service
> integrations, but this overview should point in the right direction.
> Getting it running the first time requires a little elbow grease but once
> built, it's easy to automate the process.
>
> Best,
> Jeremiah
>
>
>
> On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <gt...@gmail.com>
> wrote:
>
>> It would be really good if you'd share experiences on how to run this on
>> kubernetes and ECS.
>> I'm not aware of a good guide on how to run this on either for example,
>> but
>> it's a very useful and
>> quick setup to start with, especially combining that with deployment
>> manager and cloudformation (probably).
>>
>> I'm talking to someone else who's looking at running on kubernetes and
>> potentially opensourcing a generic
>> template for kubernetes deployments.
>>
>>
>> Would it be possible to share your experiences?  What tech are you using
>> for specific issues?
>>
>> - how do you deploy and sync dags?  Are you using EFS?
>> - how you do build the container with airflow + executables?
>> - where do you send log files or log lines to?
>> - High Availability and how?
>>
>> Really looking forward to how that's done, so we can put this on the wiki.
>>
>> Especially since GCP is now also starting to embrace airflow, it'd be good
>> to have a better understanding
>> how easy and quick it can be to deploy airflow on gcp:
>>
>>
>> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
>>
>> Rgds,
>>
>> Gerard
>>
>>
>> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <ap...@lumoslabs.com>
>> wrote:
>>
>> > for what it's worth we've been running airflow on ECS for a few years
>> > already.
>> >
>> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
>> > grantnicholas2015@u.northwestern.edu> wrote:
>> >
>> > > Is having a static set of workers necessary? Launching a job on
>> > Kubernetes
>> > > from a cached docker image takes a few seconds max. I think this is an
>> > > acceptable delay for a batch processing system like airflow.
>> > >
>> > > Additionally, if you dynamically launch workers you can start
>> dynamically
>> > > launching *any type* of worker and you don't have to statically
>> allocate
>> > > pools of worker types. IE) A single DAG could use a scala docker
>> image to
>> > > do spark calculations, a C++ docker image to use some low level
>> numerical
>> > > library,  and a python docker image by default to do any generic
>> airflow
>> > > stuff. Additionally, you can size workers according to their usage.
>> Maybe
>> > > the spark driver program only needs a few GBs of RAM but the C++
>> > numerical
>> > > library needs many hundreds.
>> > >
>> > > I agree there is a bit of extra book-keeping that needs to be done,
>> but
>> > > the tradeoff is an important one to explicitly make.
>> > >
>> >
>>
>

Re: Airflow kubernetes executor

Posted by Jeremiah Lowin <jl...@apache.org>.

Hi Gerard (and anyone else for whom this might be helpful),

We've run Airflow on GCP for a few years. The structure has changed over
time but at the moment we use the following basic outline:

1. Build a container that includes all Airflow and DAG dependencies and
push it to Google container registry. If you need to add/update
dependencies or update airflow.cfg, simply push a new image
2. All DAGs are pushed to a git repo
3. Host the AirflowDB in Google Cloud SQL
4. Create a Kuberenetes deployment that runs the following containers:
-- Airflow scheduler (using the dependencies image)
-- Airflow webserver (using the dependencies image)
-- Airflow maintainence (using the dependencies image) - this container
does nothing (sleep infinity) but since it shares the same setup as the
scheduler/webserver, it's an easy place to `exec` into the cluster to
investigate any issues that might be crashing the main containers. We limit
its CPU to minimize impact on cluster resources. Hacky but effective.
-- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy) -
to connect to the Airflow DB
-- git-sync (https://github.com/jlowin/git-sync)

The last container (git-sync) is a small library I wrote to solve the issue
of syncing DAGs. It's not perfect and ***I am NOT offering any support for
it*** but it gets the job done. It's meant to be a sidecar container and
does one thing: constantly fetch a git repo to a local folder. In your
deployment, create an EmptyDir volume and mount it in all containers
(except cloud sql). Git-sync should use that volume as its target, and
scheduler/webserver should use the volume as the DAGs folder. That way,
every 30 seconds, git-sync will fetch the git repo in to that volume, and
the Airflow containers will immediately see the latest files appear.

5. Create a Kubernetes service to expose the webserver UI

Our actual implementation is considerably more complicated than this since
we have extensive custom modules that are loaded via git-sync rather than
being baked into the image, as well as a few other GCP service
integrations, but this overview should point in the right direction.
Getting it running the first time requires a little elbow grease but once
built, it's easy to automate the process.

Best,
Jeremiah



On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <gt...@gmail.com> wrote:

> It would be really good if you'd share experiences on how to run this on
> kubernetes and ECS.
> I'm not aware of a good guide on how to run this on either for example, but
> it's a very useful and
> quick setup to start with, especially combining that with deployment
> manager and cloudformation (probably).
>
> I'm talking to someone else who's looking at running on kubernetes and
> potentially opensourcing a generic
> template for kubernetes deployments.
>
>
> Would it be possible to share your experiences?  What tech are you using
> for specific issues?
>
> - how do you deploy and sync dags?  Are you using EFS?
> - how you do build the container with airflow + executables?
> - where do you send log files or log lines to?
> - High Availability and how?
>
> Really looking forward to how that's done, so we can put this on the wiki.
>
> Especially since GCP is now also starting to embrace airflow, it'd be good
> to have a better understanding
> how easy and quick it can be to deploy airflow on gcp:
>
>
> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
>
> Rgds,
>
> Gerard
>
>
> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <ap...@lumoslabs.com>
> wrote:
>
> > for what it's worth we've been running airflow on ECS for a few years
> > already.
> >
> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
> > grantnicholas2015@u.northwestern.edu> wrote:
> >
> > > Is having a static set of workers necessary? Launching a job on
> > Kubernetes
> > > from a cached docker image takes a few seconds max. I think this is an
> > > acceptable delay for a batch processing system like airflow.
> > >
> > > Additionally, if you dynamically launch workers you can start
> dynamically
> > > launching *any type* of worker and you don't have to statically
> allocate
> > > pools of worker types. IE) A single DAG could use a scala docker image
> to
> > > do spark calculations, a C++ docker image to use some low level
> numerical
> > > library,  and a python docker image by default to do any generic
> airflow
> > > stuff. Additionally, you can size workers according to their usage.
> Maybe
> > > the spark driver program only needs a few GBs of RAM but the C++
> > numerical
> > > library needs many hundreds.
> > >
> > > I agree there is a bit of extra book-keeping that needs to be done, but
> > > the tradeoff is an important one to explicitly make.
> > >
> >
>

Re: Airflow kubernetes executor

Posted by Gerard Toonstra <gt...@gmail.com>.

It would be really good if you'd share experiences on how to run this on
kubernetes and ECS.
I'm not aware of a good guide on how to run this on either for example, but
it's a very useful and
quick setup to start with, especially combining that with deployment
manager and cloudformation (probably).

I'm talking to someone else who's looking at running on kubernetes and
potentially opensourcing a generic
template for kubernetes deployments.


Would it be possible to share your experiences?  What tech are you using
for specific issues?

- how do you deploy and sync dags?  Are you using EFS?
- how you do build the container with airflow + executables?
- where do you send log files or log lines to?
- High Availability and how?

Really looking forward to how that's done, so we can put this on the wiki.

Especially since GCP is now also starting to embrace airflow, it'd be good
to have a better understanding
how easy and quick it can be to deploy airflow on gcp:

https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow

Rgds,

Gerard


On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <ap...@lumoslabs.com>
wrote:

> for what it's worth we've been running airflow on ECS for a few years
> already.
>
> On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
> grantnicholas2015@u.northwestern.edu> wrote:
>
> > Is having a static set of workers necessary? Launching a job on
> Kubernetes
> > from a cached docker image takes a few seconds max. I think this is an
> > acceptable delay for a batch processing system like airflow.
> >
> > Additionally, if you dynamically launch workers you can start dynamically
> > launching *any type* of worker and you don't have to statically allocate
> > pools of worker types. IE) A single DAG could use a scala docker image to
> > do spark calculations, a C++ docker image to use some low level numerical
> > library,  and a python docker image by default to do any generic airflow
> > stuff. Additionally, you can size workers according to their usage. Maybe
> > the spark driver program only needs a few GBs of RAM but the C++
> numerical
> > library needs many hundreds.
> >
> > I agree there is a bit of extra book-keeping that needs to be done, but
> > the tradeoff is an important one to explicitly make.
> >
>

Re: Airflow kubernetes executor

Posted by Arthur Purvis <ap...@lumoslabs.com>.

for what it's worth we've been running airflow on ECS for a few years
already.

On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
grantnicholas2015@u.northwestern.edu> wrote:

> Is having a static set of workers necessary? Launching a job on Kubernetes
> from a cached docker image takes a few seconds max. I think this is an
> acceptable delay for a batch processing system like airflow.
>
> Additionally, if you dynamically launch workers you can start dynamically
> launching *any type* of worker and you don't have to statically allocate
> pools of worker types. IE) A single DAG could use a scala docker image to
> do spark calculations, a C++ docker image to use some low level numerical
> library,  and a python docker image by default to do any generic airflow
> stuff. Additionally, you can size workers according to their usage. Maybe
> the spark driver program only needs a few GBs of RAM but the C++ numerical
> library needs many hundreds.
>
> I agree there is a bit of extra book-keeping that needs to be done, but
> the tradeoff is an important one to explicitly make.
>

Re: Airflow kubernetes executor

Posted by Grant Nicholas <gr...@u.northwestern.edu>.

Is having a static set of workers necessary? Launching a job on Kubernetes from a cached docker image takes a few seconds max. I think this is an acceptable delay for a batch processing system like airflow. 

Additionally, if you dynamically launch workers you can start dynamically launching *any type* of worker and you don't have to statically allocate pools of worker types. IE) A single DAG could use a scala docker image to do spark calculations, a C++ docker image to use some low level numerical library,  and a python docker image by default to do any generic airflow stuff. Additionally, you can size workers according to their usage. Maybe the spark driver program only needs a few GBs of RAM but the C++ numerical library needs many hundreds. 

I agree there is a bit of extra book-keeping that needs to be done, but the tradeoff is an important one to explicitly make.

Re: Airflow kubernetes executor

Posted by Gerard Toonstra <gt...@gmail.com>.

On Thu, Jul 6, 2017 at 12:35 AM, Daniel Imberman <da...@gmail.com>
wrote:

> Hi  Gerard,
>
> Thank you for your feedback/details of your current set up. I would
> actually really like to jump on a skype/hangout call with you to see where
> we can collaborate on this effort.
>

Send me a PM to work something out. I don't mind weekends, this sa is ok.


>
> > We deploy dags using an Elastic File System (shared across all
> instances), which then map this read-only into the docker
> container.
>
> EFS appears to be one of the options that you can use to create a
> persistent volume on kubernetes. One piece of feedback I had recieved from
> the airflow team is that they wanted as much flexibility as possible for
> users to decide where to store their dags/plugins. They might want
> something similar even if you are running airflow in an ECS environemnt
> https://ngineered.co.uk/blog/using-amazon-efs-to-persist-
> and-share-between-conatiners-data-in-kubernetes
>
>
I think we had to make the volume rw to get this to work, but todays run
was able to process everything.


> In terms of tooling:  The current airflow config is somewhat static inthe
> > sense that it does not reconfigure itself to the (now) dynamic
> environment.
> > You'd think that airflow should have to query the environment to
> figureout
> > parallellism instead of statically specifying this.
>
> I'm not super strongly opinionated on whether the parallelism should be
> handled via the environment or the config file. The config file seems to
> make sense since airflow users already expect to fill out these configs in
> the config file.
>
> That said, it definitely would make sense to allow multiple namespaces to
> use the same airflow.cfg and just define the parallelism in its own
> configuration. Would like further feedback on this.
>

I think initially airflow was provisioned with a static set of workers and
most setups use a specific set of worker machines that are maintained.
What we intend to do is experiment with using file change notifications
inside python processes to detect changes to specifically the airflow.cfg
file.
There are libs in python to detect this. I don't know how this works out on
shared volumes though.

The idea is that when the airflow.cfg changes, the workers and schedulers
reload the new config, so that processes do not need to be restarted.
The scheduler can then reconfigure itself to use higher degrees of
parallellism. We still need to investigate which attributes specifically
should be changeable.

If that works, we can deploy a simple tool to only manipulate the safe
attributes, so that it's possible to reconfigure the entire cluster on the
fly. In the backend,
it would then rewrite the airflow.cfg file. If that works, it means that
others can just log into the scheduler box and adjust settings when
running, so it's a change
that would benefit everyone.


> > About redeploying instances:  We see this as a potential issue for our
> setup.
> > My take is that jobs simply shouldn't take that much time in principle
> to start with,
> > which avoids having to worry about this. If that's ridiculous, shouldn't
> it be a concern
> > of the environment airflow runs in rather than airflowitself?  I.e....
> > further tool out kubernetes CLI's / operators to query the environment
> to plan/deny/schedule
> > this kind of work automatically. Beacuse k8s was probably built from the
> perspective of
> > handling short-running queries, running anything long-term on that is
> going to naturally compete
> > with the architecture.
>
> I would disagree that jobs "shouldn't take that much time." Multiple use
> cases that we are developing our airflow system for can take over a week to
> run. This does raise an interesting question of what to do if airflow dies
> but the tasks are still running.
> One aspect of how we're running this implementation that would help WRT
> restarting  the scheduler is that each pod is its own task with its own
> heartbeat to the SQL source-of-truth.
> This means that even if the scheduler is re-started, as long as we can
> scan for currently running jobs, we can technically continue the DAG
> execution with no interruption. Would want further feedback on whether the
> community wants this ability
>

Hmm... so maybe that was a bit too quick on the trigger. what I mean is
that I wouldn't advise anyone to have airflow workers running for more than
an hour for
reliability concerns. A bigdata job would typically by itself have plenty
of failover capabilities in the cluster to redo parts of the job when
instances die or become
unstable, but here you introduce a single point of failure on a week's
usage of resources.

I'd advise for that reason to redesign the workflow itself to be able to
deal with dying workers, which means kicking off the job, saving the job id
somewhere
and then frequently polling on that job id and exiting early and then build
in the necessary number of retries. Another option is to use LatestOnly and
branching to
make the dag succeed until the job id has a known exit status.



One last thing I looked at today: using splunk to send all logs to a single
place. I like how it's all available, but it's also
all over the place. If you intend to use splunk as well and have dashboards
or examples on how to 'bucket' the data in the
right locations (maybe something where the entire DAG flow becomes
available as a long sequence log), I'd be happy
to hear about such cases and how this was done.  Probably the currently
executing DAG id has to be used somewhere
and the current task id to realign the loglines.

The nice part about splunk is that it can be realtime and you can see
things in larger contexts, like "cluster-wide".



>
> On Wed, Jul 5, 2017 at 1:26 PM Daniel Imberman <da...@gmail.com>
> wrote:
>
>> Thanks Chris, will do!
>>
>> On Wed, Jul 5, 2017 at 1:26 PM Chris Riccomini <cr...@apache.org>
>> wrote:
>>
>>> @Daniel, done! Should have access. Please create the wiki as a subpage
>>> under:
>>>
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Roadmap
>>>
>>> On Wed, Jul 5, 2017 at 1:20 PM, Daniel Imberman <
>>> daniel.imberman@gmail.com>
>>> wrote:
>>>
>>> > @chris: Thank you! My wiki name is dimberman.
>>> > @gerard: I've started writing out my reply but there's a fair amount to
>>> > respond to so I'll need a few minutes :).
>>> >
>>> > On Wed, Jul 5, 2017 at 1:17 PM Chris Riccomini <cr...@apache.org>
>>> > wrote:
>>> >
>>> > > @daniel, what's your wiki username? I can grant you access.
>>> > >
>>> > > On Wed, Jul 5, 2017 at 12:35 PM, Gerard Toonstra <
>>> gtoonstra@gmail.com>
>>> > > wrote:
>>> > >
>>> > > > Hey Daniel,
>>> > > >
>>> > > > Great work. We're looking at running airflow on AWS ECS inside
>>> docker
>>> > > > containers and making great progress on this.
>>> > > > We use redis and RDS as managed services to form a comms backbone
>>> and
>>> > > then
>>> > > > just spawn webserver, scheduler, worker and flower containers
>>> > > > as needed on ECS. We deploy dags using an Elastic File System
>>> (shared
>>> > > > across all instances), which then map this read-only into the
>>> docker
>>> > > > container.
>>> > > > We're now evaluating this setup going forward in more earnest.
>>> > > >
>>> > > > Good idea to use queues to separate dependencies or some other
>>> concerns
>>> > > > (high-mem pods?), there are many ways this way that it's possible
>>> to
>>> > > > customize where and on which hw a DAG is going to run. We're
>>> looking at
>>> > > > Cycle scaling to temporarily increase resources in a morning run
>>> and
>>> > > create
>>> > > > larger worker containers for data science tasks and perhaps some
>>> other
>>> > > > tasks.
>>> > > >
>>> > > >
>>> > > > - In terms of tooling:  The current airflow config is somewhat
>>> static
>>> > in
>>> > > > the sense that it does not reconfigure itself to the (now) dynamic
>>> > > > environment.
>>> > > >   You'd think that airflow should have to query the environment to
>>> > figure
>>> > > > out parallellism instead of statically specifying this.
>>> > > >
>>> > > > - Sometimes DAGs import hooks or operators that import
>>> dependencies at
>>> > > the
>>> > > > top. The only reason, (I think) that a scheduler needs to
>>> physically
>>> > > >   import and parse a DAG is because there may be dynamically built
>>> > > elements
>>> > > > within the DAG. If there wouldn't be static elements, it is
>>> > theoretically
>>> > > >    possible to optimize this.  Your PDF sort of hints towards a
>>> system
>>> > > > where a worker where a DAG will eventually run could parse the DAG
>>> and
>>> > > > report
>>> > > >    back a meta description of the DAG, which could simplify and
>>> > optimize
>>> > > > performance of the scheduler at the cost of network roundtrips.
>>> > > >
>>> > > > - About redeploying instances:  We see this as a potential issue
>>> for
>>> > our
>>> > > > setup. My take is that jobs simply shouldn't take that much time in
>>> > > > principle to start with,
>>> > > >    which avoids having to worry about this. If that's ridiculous,
>>> > > shouldn't
>>> > > > it be a concern of the environment airflow runs in rather than
>>> airflow
>>> > > > itself?  I.e....
>>> > > >    further tool out kubernetes CLI's / operators to query the
>>> > environment
>>> > > > to plan/deny/schedule this kind of work automatically. Beacuse k8s
>>> was
>>> > > > probably
>>> > > >     built from the perspective of handling short-running queries,
>>> > running
>>> > > > anything long-term on that is going to naturally compete with the
>>> > > > architecture.
>>> > > >
>>> > > > - About failures and instances disappearing on failure: it's not
>>> > > desirable
>>> > > > to keep those instances around for a long time, we really do need
>>> to
>>> > > depend
>>> > > > on
>>> > > >    client logging and other services available to tell us what
>>> > happened.
>>> > > > The difference in thinking is that a pod/container is just a
>>> temporary
>>> > > > thing that runs a job
>>> > > >    and we should be interested in how the job did vs. how the
>>> > > container/pod
>>> > > > ran this. From my little experience with k8s though, I do see that
>>> it
>>> > > tends
>>> > > > to
>>> > > >    get rid of everything a little bit too quick on failure. One
>>> thing
>>> > you
>>> > > > could look into is to log onto a commonly shared volume with a
>>> specific
>>> > > > 'key' for that container,
>>> > > >    so you can always refer back to the important log file and fish
>>> this
>>> > > > out, with measures to clean up the shared filesystem on a regular
>>> > basis.
>>> > > >
>>> > > > - About rescaling and starting jobs:  it doesn't come for free as
>>> you
>>> > > > mention. I think it's a great idea to be able to scale out on busy
>>> > > > intervals (we intend to just use cycle scaling here),
>>> > > >   but a hint towards what policy or scaling strategy you intend to
>>> use
>>> > on
>>> > > > k8s is welcome there.
>>> > > >
>>> > > >
>>> > > > Gerard
>>> > > >
>>> > > >
>>> > > > On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <
>>> > > daniel.imberman@gmail.com
>>> > > > >
>>> > > > wrote:
>>> > > >
>>> > > > > @amit
>>> > > > >
>>> > > > > I've added the proposal to the PR for now. Should make it easier
>>> for
>>> > > > people
>>> > > > > to get to it. Will delete once I add it to the wiki.
>>> > > > >
>>> > > https://github.com/bloomberg/airflow/blob/
>>> 29694ae9903c4dad3f18fb8eb767c4
>>> > > > > 922dbef2e8/dimberman-KubernetesExecutorProposal-
>>> 050717-1423-36.pdf
>>> > > > >
>>> > > > > Daniel
>>> > > > >
>>> > > > > On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <
>>> > > > daniel.imberman@gmail.com
>>> > > > > >
>>> > > > > wrote:
>>> > > > >
>>> > > > > > Hi Amit,
>>> > > > > >
>>> > > > > > For now the design doc is included as an attachment to the
>>> original
>>> > > > > email.
>>> > > > > > Once I am able to get permission to edit the wiki I would like
>>> add
>>> > it
>>> > > > > there
>>> > > > > > but for now I figured that this would get the ball rolling.
>>> > > > > >
>>> > > > > >
>>> > > > > > Daniel
>>> > > > > >
>>> > > > > >
>>> > > > > > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <amitk@wepay.com
>>> >
>>> > > wrote:
>>> > > > > >
>>> > > > > >> Hi Daniel,
>>> > > > > >>
>>> > > > > >> I don't see link to design PDF.
>>> > > > > >>
>>> > > > > >>
>>> > > > > >> Amit Kulkarni
>>> > > > > >> Site Reliability Engineer
>>> > > > > >> Mobile:  (716)-352-3270 <(716)%20352-3270>
>>> <(716)%20352-3270> <(716)%20352-3270>
>>> > > > > >>
>>> > > > > >> Payments partner to the platform economy
>>> > > > > >>
>>> > > > > >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
>>> > > > > >> daniel.imberman@gmail.com>
>>> > > > > >> wrote:
>>> > > > > >>
>>> > > > > >> > Hello Airflow community!
>>> > > > > >> >
>>> > > > > >> > My name is Daniel Imberman, and I have been working on
>>> behalf of
>>> > > > > >> Bloomberg
>>> > > > > >> > LP to create an airflow kubernetes executor/operator. We
>>> wanted
>>> > to
>>> > > > > allow
>>> > > > > >> > for maximum throughput/scalability, while keeping a lot of
>>> the
>>> > > > > >> kubernetes
>>> > > > > >> > details abstracted away from the users. Below I have a link
>>> to
>>> > the
>>> > > > WIP
>>> > > > > >> PR
>>> > > > > >> > and the PDF of the initial proposal. If anyone has any
>>> > > > > >> comments/questions I
>>> > > > > >> > would be glad to discuss this feature further.
>>> > > > > >> >
>>> > > > > >> > Thank you,
>>> > > > > >> >
>>> > > > > >> > Daniel
>>> > > > > >> >
>>> > > > > >> > https://github.com/apache/incubator-airflow/pull/2414
>>> > > > > >> >
>>> > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: Airflow kubernetes executor

Posted by Daniel Imberman <da...@gmail.com>.

Hi  Gerard,

Thank you for your feedback/details of your current set up. I would
actually really like to jump on a skype/hangout call with you to see where
we can collaborate on this effort.


> Great work. We're looking at running airflow on AWS ECS inside docker
containers and making great progress on this.

That's great! I definitely think there should be a decent ammount of
overlap there. We could abstract the executor to be a general "docker
deployment executor" and then have a kubernetes mode and an ECS mode. (I
would of course want to get this PR scale tested/accepted first, but that
could definitely be something we add to the roadmap).

> We deploy dags using an Elastic File System (shared across all
instances), which then map this read-only into the docker
container.

EFS appears to be one of the options that you can use to create a
persistent volume on kubernetes. One piece of feedback I had recieved from
the airflow team is that they wanted as much flexibility as possible for
users to decide where to store their dags/plugins. They might want
something similar even if you are running airflow in an ECS environemnt
https://ngineered.co.uk/blog/using-amazon-efs-to-persist-and-share-between-conatiners-data-in-kubernetes



> In terms of tooling:  The current airflow config is somewhat static inthe
> sense that it does not reconfigure itself to the (now) dynamic
environment.
> You'd think that airflow should have to query the environment to
figureout
> parallellism instead of statically specifying this.

I'm not super strongly opinionated on whether the parallelism should be
handled via the environment or the config file. The config file seems to
make sense since airflow users already expect to fill out these configs in
the config file.

That said, it definitely would make sense to allow multiple namespaces to
use the same airflow.cfg and just define the parallelism in its own
configuration. Would like further feedback on this.


> Sometimes DAGs import hooks or operators that import dependencies at
thetop.
> The only reason, (I think) that a scheduler needs to physically import
and parse
> a DAG is because there may be dynamically built elementswithin the DAG.
> If there wouldn't be static elements, it is theoretically possible to
optimize this.
> Your PDF sort of hints towards a system where a worker where a DAG will
eventually run could parse the DAG
> and report back a meta description of the DAG, which could simplify and
optimize performance
> of the scheduler at the cost of network roundtrips.

One thing I had enquired about early on was the possibility of saving DAGs
as individual eggs or wheels. I've gotten responses that this might cause
issues with using jinja templates, however I am not well-versed enough on
that subject to make that statement with any authority.



> About redeploying instances:  We see this as a potential issue for our
setup.
> My take is that jobs simply shouldn't take that much time in principle to
start with,
> which avoids having to worry about this. If that's ridiculous, shouldn't
it be a concern
> of the environment airflow runs in rather than airflowitself?  I.e....
> further tool out kubernetes CLI's / operators to query the environment to
plan/deny/schedule
> this kind of work automatically. Beacuse k8s was probably built from the
perspective of
> handling short-running queries, running anything long-term on that is
going to naturally compete
> with the architecture.

I would disagree that jobs "shouldn't take that much time." Multiple use
cases that we are developing our airflow system for can take over a week to
run. This does raise an interesting question of what to do if airflow dies
but the tasks are still running.
One aspect of how we're running this implementation that would help WRT
restarting  the scheduler is that each pod is its own task with its own
heartbeat to the SQL source-of-truth.
This means that even if the scheduler is re-started, as long as we can scan
for currently running jobs, we can technically continue the DAG execution
with no interruption. Would want further feedback on whether the community
wants this ability


> About failures and instances disappearing on failure: it's not
desirableto keep
> those instances around for a long time, we really do need to depend on
client
> logging and other services available to tell us what happened.

There is currently an intern on the airflow team at airbnb investigating
multiple options for logging from kubernetes containers. With external
logging users should be able


> The difference in thinking is that a pod/container is just a temporary
> thing that runs a job and we should be interested in how the job did vs.
how the container/pod
>ran this. From my little experience with k8s though, I do see that it
tends to
> get rid of everything a little bit too quick on failure. One thing you
> could look into is to log onto a commonly shared volume with a specific
'key' for that container,
> so you can always refer back to the important log file and fish this
out, with measures to clean up the shared filesystem on a regular basis.





- About rescaling and starting jobs:  it doesn't come for free as you
mention. I think it's a great idea to be able to scale out on busy
intervals (we intend to just use cycle scaling here),
  but a hint towards what policy or scaling strategy you intend to use on
k8s is welcome there.

I agree that scaling out is not free, which is why we would offer a set of
options to match your scale. If you have a smaller instance, you can easily
use the github persistentvolume to store your code in a minimal-effor
place. As you get to larger DAG folders using distributed file systems like
EFS, cinder, NFS, etc. will match your needs.
We are also discussing the possibility of creating a "persistent worker
mode" which would be similar to the current mesosexecutor, though this
would be at the cost of flexibility of resources (so it would be dependent
on the use-case).


On Wed, Jul 5, 2017 at 1:26 PM Daniel Imberman <da...@gmail.com>
wrote:

> Thanks Chris, will do!
>
> On Wed, Jul 5, 2017 at 1:26 PM Chris Riccomini <cr...@apache.org>
> wrote:
>
>> @Daniel, done! Should have access. Please create the wiki as a subpage
>> under:
>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/Roadmap
>>
>> On Wed, Jul 5, 2017 at 1:20 PM, Daniel Imberman <
>> daniel.imberman@gmail.com>
>> wrote:
>>
>> > @chris: Thank you! My wiki name is dimberman.
>> > @gerard: I've started writing out my reply but there's a fair amount to
>> > respond to so I'll need a few minutes :).
>> >
>> > On Wed, Jul 5, 2017 at 1:17 PM Chris Riccomini <cr...@apache.org>
>> > wrote:
>> >
>> > > @daniel, what's your wiki username? I can grant you access.
>> > >
>> > > On Wed, Jul 5, 2017 at 12:35 PM, Gerard Toonstra <gtoonstra@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > Hey Daniel,
>> > > >
>> > > > Great work. We're looking at running airflow on AWS ECS inside
>> docker
>> > > > containers and making great progress on this.
>> > > > We use redis and RDS as managed services to form a comms backbone
>> and
>> > > then
>> > > > just spawn webserver, scheduler, worker and flower containers
>> > > > as needed on ECS. We deploy dags using an Elastic File System
>> (shared
>> > > > across all instances), which then map this read-only into the docker
>> > > > container.
>> > > > We're now evaluating this setup going forward in more earnest.
>> > > >
>> > > > Good idea to use queues to separate dependencies or some other
>> concerns
>> > > > (high-mem pods?), there are many ways this way that it's possible to
>> > > > customize where and on which hw a DAG is going to run. We're
>> looking at
>> > > > Cycle scaling to temporarily increase resources in a morning run and
>> > > create
>> > > > larger worker containers for data science tasks and perhaps some
>> other
>> > > > tasks.
>> > > >
>> > > >
>> > > > - In terms of tooling:  The current airflow config is somewhat
>> static
>> > in
>> > > > the sense that it does not reconfigure itself to the (now) dynamic
>> > > > environment.
>> > > >   You'd think that airflow should have to query the environment to
>> > figure
>> > > > out parallellism instead of statically specifying this.
>> > > >
>> > > > - Sometimes DAGs import hooks or operators that import dependencies
>> at
>> > > the
>> > > > top. The only reason, (I think) that a scheduler needs to physically
>> > > >   import and parse a DAG is because there may be dynamically built
>> > > elements
>> > > > within the DAG. If there wouldn't be static elements, it is
>> > theoretically
>> > > >    possible to optimize this.  Your PDF sort of hints towards a
>> system
>> > > > where a worker where a DAG will eventually run could parse the DAG
>> and
>> > > > report
>> > > >    back a meta description of the DAG, which could simplify and
>> > optimize
>> > > > performance of the scheduler at the cost of network roundtrips.
>> > > >
>> > > > - About redeploying instances:  We see this as a potential issue for
>> > our
>> > > > setup. My take is that jobs simply shouldn't take that much time in
>> > > > principle to start with,
>> > > >    which avoids having to worry about this. If that's ridiculous,
>> > > shouldn't
>> > > > it be a concern of the environment airflow runs in rather than
>> airflow
>> > > > itself?  I.e....
>> > > >    further tool out kubernetes CLI's / operators to query the
>> > environment
>> > > > to plan/deny/schedule this kind of work automatically. Beacuse k8s
>> was
>> > > > probably
>> > > >     built from the perspective of handling short-running queries,
>> > running
>> > > > anything long-term on that is going to naturally compete with the
>> > > > architecture.
>> > > >
>> > > > - About failures and instances disappearing on failure: it's not
>> > > desirable
>> > > > to keep those instances around for a long time, we really do need to
>> > > depend
>> > > > on
>> > > >    client logging and other services available to tell us what
>> > happened.
>> > > > The difference in thinking is that a pod/container is just a
>> temporary
>> > > > thing that runs a job
>> > > >    and we should be interested in how the job did vs. how the
>> > > container/pod
>> > > > ran this. From my little experience with k8s though, I do see that
>> it
>> > > tends
>> > > > to
>> > > >    get rid of everything a little bit too quick on failure. One
>> thing
>> > you
>> > > > could look into is to log onto a commonly shared volume with a
>> specific
>> > > > 'key' for that container,
>> > > >    so you can always refer back to the important log file and fish
>> this
>> > > > out, with measures to clean up the shared filesystem on a regular
>> > basis.
>> > > >
>> > > > - About rescaling and starting jobs:  it doesn't come for free as
>> you
>> > > > mention. I think it's a great idea to be able to scale out on busy
>> > > > intervals (we intend to just use cycle scaling here),
>> > > >   but a hint towards what policy or scaling strategy you intend to
>> use
>> > on
>> > > > k8s is welcome there.
>> > > >
>> > > >
>> > > > Gerard
>> > > >
>> > > >
>> > > > On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <
>> > > daniel.imberman@gmail.com
>> > > > >
>> > > > wrote:
>> > > >
>> > > > > @amit
>> > > > >
>> > > > > I've added the proposal to the PR for now. Should make it easier
>> for
>> > > > people
>> > > > > to get to it. Will delete once I add it to the wiki.
>> > > > >
>> > >
>> https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4
>> > > > > 922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf
>> > > > >
>> > > > > Daniel
>> > > > >
>> > > > > On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <
>> > > > daniel.imberman@gmail.com
>> > > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Amit,
>> > > > > >
>> > > > > > For now the design doc is included as an attachment to the
>> original
>> > > > > email.
>> > > > > > Once I am able to get permission to edit the wiki I would like
>> add
>> > it
>> > > > > there
>> > > > > > but for now I figured that this would get the ball rolling.
>> > > > > >
>> > > > > >
>> > > > > > Daniel
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com>
>> > > wrote:
>> > > > > >
>> > > > > >> Hi Daniel,
>> > > > > >>
>> > > > > >> I don't see link to design PDF.
>> > > > > >>
>> > > > > >>
>> > > > > >> Amit Kulkarni
>> > > > > >> Site Reliability Engineer
>> > > > > >> Mobile:  (716)-352-3270 <(716)%20352-3270> <(716)%20352-3270>
>> <(716)%20352-3270>
>> > > > > >>
>> > > > > >> Payments partner to the platform economy
>> > > > > >>
>> > > > > >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
>> > > > > >> daniel.imberman@gmail.com>
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >> > Hello Airflow community!
>> > > > > >> >
>> > > > > >> > My name is Daniel Imberman, and I have been working on
>> behalf of
>> > > > > >> Bloomberg
>> > > > > >> > LP to create an airflow kubernetes executor/operator. We
>> wanted
>> > to
>> > > > > allow
>> > > > > >> > for maximum throughput/scalability, while keeping a lot of
>> the
>> > > > > >> kubernetes
>> > > > > >> > details abstracted away from the users. Below I have a link
>> to
>> > the
>> > > > WIP
>> > > > > >> PR
>> > > > > >> > and the PDF of the initial proposal. If anyone has any
>> > > > > >> comments/questions I
>> > > > > >> > would be glad to discuss this feature further.
>> > > > > >> >
>> > > > > >> > Thank you,
>> > > > > >> >
>> > > > > >> > Daniel
>> > > > > >> >
>> > > > > >> > https://github.com/apache/incubator-airflow/pull/2414
>> > > > > >> >
>> > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Airflow kubernetes executor

Posted by Daniel Imberman <da...@gmail.com>.

Thanks Chris, will do!

On Wed, Jul 5, 2017 at 1:26 PM Chris Riccomini <cr...@apache.org>
wrote:

> @Daniel, done! Should have access. Please create the wiki as a subpage
> under:
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/Roadmap
>
> On Wed, Jul 5, 2017 at 1:20 PM, Daniel Imberman <daniel.imberman@gmail.com
> >
> wrote:
>
> > @chris: Thank you! My wiki name is dimberman.
> > @gerard: I've started writing out my reply but there's a fair amount to
> > respond to so I'll need a few minutes :).
> >
> > On Wed, Jul 5, 2017 at 1:17 PM Chris Riccomini <cr...@apache.org>
> > wrote:
> >
> > > @daniel, what's your wiki username? I can grant you access.
> > >
> > > On Wed, Jul 5, 2017 at 12:35 PM, Gerard Toonstra <gt...@gmail.com>
> > > wrote:
> > >
> > > > Hey Daniel,
> > > >
> > > > Great work. We're looking at running airflow on AWS ECS inside docker
> > > > containers and making great progress on this.
> > > > We use redis and RDS as managed services to form a comms backbone and
> > > then
> > > > just spawn webserver, scheduler, worker and flower containers
> > > > as needed on ECS. We deploy dags using an Elastic File System (shared
> > > > across all instances), which then map this read-only into the docker
> > > > container.
> > > > We're now evaluating this setup going forward in more earnest.
> > > >
> > > > Good idea to use queues to separate dependencies or some other
> concerns
> > > > (high-mem pods?), there are many ways this way that it's possible to
> > > > customize where and on which hw a DAG is going to run. We're looking
> at
> > > > Cycle scaling to temporarily increase resources in a morning run and
> > > create
> > > > larger worker containers for data science tasks and perhaps some
> other
> > > > tasks.
> > > >
> > > >
> > > > - In terms of tooling:  The current airflow config is somewhat static
> > in
> > > > the sense that it does not reconfigure itself to the (now) dynamic
> > > > environment.
> > > >   You'd think that airflow should have to query the environment to
> > figure
> > > > out parallellism instead of statically specifying this.
> > > >
> > > > - Sometimes DAGs import hooks or operators that import dependencies
> at
> > > the
> > > > top. The only reason, (I think) that a scheduler needs to physically
> > > >   import and parse a DAG is because there may be dynamically built
> > > elements
> > > > within the DAG. If there wouldn't be static elements, it is
> > theoretically
> > > >    possible to optimize this.  Your PDF sort of hints towards a
> system
> > > > where a worker where a DAG will eventually run could parse the DAG
> and
> > > > report
> > > >    back a meta description of the DAG, which could simplify and
> > optimize
> > > > performance of the scheduler at the cost of network roundtrips.
> > > >
> > > > - About redeploying instances:  We see this as a potential issue for
> > our
> > > > setup. My take is that jobs simply shouldn't take that much time in
> > > > principle to start with,
> > > >    which avoids having to worry about this. If that's ridiculous,
> > > shouldn't
> > > > it be a concern of the environment airflow runs in rather than
> airflow
> > > > itself?  I.e....
> > > >    further tool out kubernetes CLI's / operators to query the
> > environment
> > > > to plan/deny/schedule this kind of work automatically. Beacuse k8s
> was
> > > > probably
> > > >     built from the perspective of handling short-running queries,
> > running
> > > > anything long-term on that is going to naturally compete with the
> > > > architecture.
> > > >
> > > > - About failures and instances disappearing on failure: it's not
> > > desirable
> > > > to keep those instances around for a long time, we really do need to
> > > depend
> > > > on
> > > >    client logging and other services available to tell us what
> > happened.
> > > > The difference in thinking is that a pod/container is just a
> temporary
> > > > thing that runs a job
> > > >    and we should be interested in how the job did vs. how the
> > > container/pod
> > > > ran this. From my little experience with k8s though, I do see that it
> > > tends
> > > > to
> > > >    get rid of everything a little bit too quick on failure. One thing
> > you
> > > > could look into is to log onto a commonly shared volume with a
> specific
> > > > 'key' for that container,
> > > >    so you can always refer back to the important log file and fish
> this
> > > > out, with measures to clean up the shared filesystem on a regular
> > basis.
> > > >
> > > > - About rescaling and starting jobs:  it doesn't come for free as you
> > > > mention. I think it's a great idea to be able to scale out on busy
> > > > intervals (we intend to just use cycle scaling here),
> > > >   but a hint towards what policy or scaling strategy you intend to
> use
> > on
> > > > k8s is welcome there.
> > > >
> > > >
> > > > Gerard
> > > >
> > > >
> > > > On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <
> > > daniel.imberman@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > @amit
> > > > >
> > > > > I've added the proposal to the PR for now. Should make it easier
> for
> > > > people
> > > > > to get to it. Will delete once I add it to the wiki.
> > > > >
> > >
> https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4
> > > > > 922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf
> > > > >
> > > > > Daniel
> > > > >
> > > > > On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <
> > > > daniel.imberman@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Amit,
> > > > > >
> > > > > > For now the design doc is included as an attachment to the
> original
> > > > > email.
> > > > > > Once I am able to get permission to edit the wiki I would like
> add
> > it
> > > > > there
> > > > > > but for now I figured that this would get the ball rolling.
> > > > > >
> > > > > >
> > > > > > Daniel
> > > > > >
> > > > > >
> > > > > > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com>
> > > wrote:
> > > > > >
> > > > > >> Hi Daniel,
> > > > > >>
> > > > > >> I don't see link to design PDF.
> > > > > >>
> > > > > >>
> > > > > >> Amit Kulkarni
> > > > > >> Site Reliability Engineer
> > > > > >> Mobile:  (716)-352-3270 <(716)%20352-3270> <(716)%20352-3270>
> <(716)%20352-3270>
> > > > > >>
> > > > > >> Payments partner to the platform economy
> > > > > >>
> > > > > >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
> > > > > >> daniel.imberman@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hello Airflow community!
> > > > > >> >
> > > > > >> > My name is Daniel Imberman, and I have been working on behalf
> of
> > > > > >> Bloomberg
> > > > > >> > LP to create an airflow kubernetes executor/operator. We
> wanted
> > to
> > > > > allow
> > > > > >> > for maximum throughput/scalability, while keeping a lot of the
> > > > > >> kubernetes
> > > > > >> > details abstracted away from the users. Below I have a link to
> > the
> > > > WIP
> > > > > >> PR
> > > > > >> > and the PDF of the initial proposal. If anyone has any
> > > > > >> comments/questions I
> > > > > >> > would be glad to discuss this feature further.
> > > > > >> >
> > > > > >> > Thank you,
> > > > > >> >
> > > > > >> > Daniel
> > > > > >> >
> > > > > >> > https://github.com/apache/incubator-airflow/pull/2414
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Airflow kubernetes executor

Posted by Chris Riccomini <cr...@apache.org>.

@Daniel, done! Should have access. Please create the wiki as a subpage
under:

https://cwiki.apache.org/confluence/display/AIRFLOW/Roadmap

On Wed, Jul 5, 2017 at 1:20 PM, Daniel Imberman <da...@gmail.com>
wrote:

> @chris: Thank you! My wiki name is dimberman.
> @gerard: I've started writing out my reply but there's a fair amount to
> respond to so I'll need a few minutes :).
>
> On Wed, Jul 5, 2017 at 1:17 PM Chris Riccomini <cr...@apache.org>
> wrote:
>
> > @daniel, what's your wiki username? I can grant you access.
> >
> > On Wed, Jul 5, 2017 at 12:35 PM, Gerard Toonstra <gt...@gmail.com>
> > wrote:
> >
> > > Hey Daniel,
> > >
> > > Great work. We're looking at running airflow on AWS ECS inside docker
> > > containers and making great progress on this.
> > > We use redis and RDS as managed services to form a comms backbone and
> > then
> > > just spawn webserver, scheduler, worker and flower containers
> > > as needed on ECS. We deploy dags using an Elastic File System (shared
> > > across all instances), which then map this read-only into the docker
> > > container.
> > > We're now evaluating this setup going forward in more earnest.
> > >
> > > Good idea to use queues to separate dependencies or some other concerns
> > > (high-mem pods?), there are many ways this way that it's possible to
> > > customize where and on which hw a DAG is going to run. We're looking at
> > > Cycle scaling to temporarily increase resources in a morning run and
> > create
> > > larger worker containers for data science tasks and perhaps some other
> > > tasks.
> > >
> > >
> > > - In terms of tooling:  The current airflow config is somewhat static
> in
> > > the sense that it does not reconfigure itself to the (now) dynamic
> > > environment.
> > >   You'd think that airflow should have to query the environment to
> figure
> > > out parallellism instead of statically specifying this.
> > >
> > > - Sometimes DAGs import hooks or operators that import dependencies at
> > the
> > > top. The only reason, (I think) that a scheduler needs to physically
> > >   import and parse a DAG is because there may be dynamically built
> > elements
> > > within the DAG. If there wouldn't be static elements, it is
> theoretically
> > >    possible to optimize this.  Your PDF sort of hints towards a system
> > > where a worker where a DAG will eventually run could parse the DAG and
> > > report
> > >    back a meta description of the DAG, which could simplify and
> optimize
> > > performance of the scheduler at the cost of network roundtrips.
> > >
> > > - About redeploying instances:  We see this as a potential issue for
> our
> > > setup. My take is that jobs simply shouldn't take that much time in
> > > principle to start with,
> > >    which avoids having to worry about this. If that's ridiculous,
> > shouldn't
> > > it be a concern of the environment airflow runs in rather than airflow
> > > itself?  I.e....
> > >    further tool out kubernetes CLI's / operators to query the
> environment
> > > to plan/deny/schedule this kind of work automatically. Beacuse k8s was
> > > probably
> > >     built from the perspective of handling short-running queries,
> running
> > > anything long-term on that is going to naturally compete with the
> > > architecture.
> > >
> > > - About failures and instances disappearing on failure: it's not
> > desirable
> > > to keep those instances around for a long time, we really do need to
> > depend
> > > on
> > >    client logging and other services available to tell us what
> happened.
> > > The difference in thinking is that a pod/container is just a temporary
> > > thing that runs a job
> > >    and we should be interested in how the job did vs. how the
> > container/pod
> > > ran this. From my little experience with k8s though, I do see that it
> > tends
> > > to
> > >    get rid of everything a little bit too quick on failure. One thing
> you
> > > could look into is to log onto a commonly shared volume with a specific
> > > 'key' for that container,
> > >    so you can always refer back to the important log file and fish this
> > > out, with measures to clean up the shared filesystem on a regular
> basis.
> > >
> > > - About rescaling and starting jobs:  it doesn't come for free as you
> > > mention. I think it's a great idea to be able to scale out on busy
> > > intervals (we intend to just use cycle scaling here),
> > >   but a hint towards what policy or scaling strategy you intend to use
> on
> > > k8s is welcome there.
> > >
> > >
> > > Gerard
> > >
> > >
> > > On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <
> > daniel.imberman@gmail.com
> > > >
> > > wrote:
> > >
> > > > @amit
> > > >
> > > > I've added the proposal to the PR for now. Should make it easier for
> > > people
> > > > to get to it. Will delete once I add it to the wiki.
> > > >
> > https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4
> > > > 922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf
> > > >
> > > > Daniel
> > > >
> > > > On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <
> > > daniel.imberman@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi Amit,
> > > > >
> > > > > For now the design doc is included as an attachment to the original
> > > > email.
> > > > > Once I am able to get permission to edit the wiki I would like add
> it
> > > > there
> > > > > but for now I figured that this would get the ball rolling.
> > > > >
> > > > >
> > > > > Daniel
> > > > >
> > > > >
> > > > > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com>
> > wrote:
> > > > >
> > > > >> Hi Daniel,
> > > > >>
> > > > >> I don't see link to design PDF.
> > > > >>
> > > > >>
> > > > >> Amit Kulkarni
> > > > >> Site Reliability Engineer
> > > > >> Mobile:  (716)-352-3270 <(716)%20352-3270> <(716)%20352-3270>
> > > > >>
> > > > >> Payments partner to the platform economy
> > > > >>
> > > > >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
> > > > >> daniel.imberman@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hello Airflow community!
> > > > >> >
> > > > >> > My name is Daniel Imberman, and I have been working on behalf of
> > > > >> Bloomberg
> > > > >> > LP to create an airflow kubernetes executor/operator. We wanted
> to
> > > > allow
> > > > >> > for maximum throughput/scalability, while keeping a lot of the
> > > > >> kubernetes
> > > > >> > details abstracted away from the users. Below I have a link to
> the
> > > WIP
> > > > >> PR
> > > > >> > and the PDF of the initial proposal. If anyone has any
> > > > >> comments/questions I
> > > > >> > would be glad to discuss this feature further.
> > > > >> >
> > > > >> > Thank you,
> > > > >> >
> > > > >> > Daniel
> > > > >> >
> > > > >> > https://github.com/apache/incubator-airflow/pull/2414
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Airflow kubernetes executor

Posted by Daniel Imberman <da...@gmail.com>.

@chris: Thank you! My wiki name is dimberman.
@gerard: I've started writing out my reply but there's a fair amount to
respond to so I'll need a few minutes :).

On Wed, Jul 5, 2017 at 1:17 PM Chris Riccomini <cr...@apache.org>
wrote:

> @daniel, what's your wiki username? I can grant you access.
>
> On Wed, Jul 5, 2017 at 12:35 PM, Gerard Toonstra <gt...@gmail.com>
> wrote:
>
> > Hey Daniel,
> >
> > Great work. We're looking at running airflow on AWS ECS inside docker
> > containers and making great progress on this.
> > We use redis and RDS as managed services to form a comms backbone and
> then
> > just spawn webserver, scheduler, worker and flower containers
> > as needed on ECS. We deploy dags using an Elastic File System (shared
> > across all instances), which then map this read-only into the docker
> > container.
> > We're now evaluating this setup going forward in more earnest.
> >
> > Good idea to use queues to separate dependencies or some other concerns
> > (high-mem pods?), there are many ways this way that it's possible to
> > customize where and on which hw a DAG is going to run. We're looking at
> > Cycle scaling to temporarily increase resources in a morning run and
> create
> > larger worker containers for data science tasks and perhaps some other
> > tasks.
> >
> >
> > - In terms of tooling:  The current airflow config is somewhat static in
> > the sense that it does not reconfigure itself to the (now) dynamic
> > environment.
> >   You'd think that airflow should have to query the environment to figure
> > out parallellism instead of statically specifying this.
> >
> > - Sometimes DAGs import hooks or operators that import dependencies at
> the
> > top. The only reason, (I think) that a scheduler needs to physically
> >   import and parse a DAG is because there may be dynamically built
> elements
> > within the DAG. If there wouldn't be static elements, it is theoretically
> >    possible to optimize this.  Your PDF sort of hints towards a system
> > where a worker where a DAG will eventually run could parse the DAG and
> > report
> >    back a meta description of the DAG, which could simplify and optimize
> > performance of the scheduler at the cost of network roundtrips.
> >
> > - About redeploying instances:  We see this as a potential issue for our
> > setup. My take is that jobs simply shouldn't take that much time in
> > principle to start with,
> >    which avoids having to worry about this. If that's ridiculous,
> shouldn't
> > it be a concern of the environment airflow runs in rather than airflow
> > itself?  I.e....
> >    further tool out kubernetes CLI's / operators to query the environment
> > to plan/deny/schedule this kind of work automatically. Beacuse k8s was
> > probably
> >     built from the perspective of handling short-running queries, running
> > anything long-term on that is going to naturally compete with the
> > architecture.
> >
> > - About failures and instances disappearing on failure: it's not
> desirable
> > to keep those instances around for a long time, we really do need to
> depend
> > on
> >    client logging and other services available to tell us what happened.
> > The difference in thinking is that a pod/container is just a temporary
> > thing that runs a job
> >    and we should be interested in how the job did vs. how the
> container/pod
> > ran this. From my little experience with k8s though, I do see that it
> tends
> > to
> >    get rid of everything a little bit too quick on failure. One thing you
> > could look into is to log onto a commonly shared volume with a specific
> > 'key' for that container,
> >    so you can always refer back to the important log file and fish this
> > out, with measures to clean up the shared filesystem on a regular basis.
> >
> > - About rescaling and starting jobs:  it doesn't come for free as you
> > mention. I think it's a great idea to be able to scale out on busy
> > intervals (we intend to just use cycle scaling here),
> >   but a hint towards what policy or scaling strategy you intend to use on
> > k8s is welcome there.
> >
> >
> > Gerard
> >
> >
> > On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <
> daniel.imberman@gmail.com
> > >
> > wrote:
> >
> > > @amit
> > >
> > > I've added the proposal to the PR for now. Should make it easier for
> > people
> > > to get to it. Will delete once I add it to the wiki.
> > >
> https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4
> > > 922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf
> > >
> > > Daniel
> > >
> > > On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <
> > daniel.imberman@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi Amit,
> > > >
> > > > For now the design doc is included as an attachment to the original
> > > email.
> > > > Once I am able to get permission to edit the wiki I would like add it
> > > there
> > > > but for now I figured that this would get the ball rolling.
> > > >
> > > >
> > > > Daniel
> > > >
> > > >
> > > > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com>
> wrote:
> > > >
> > > >> Hi Daniel,
> > > >>
> > > >> I don't see link to design PDF.
> > > >>
> > > >>
> > > >> Amit Kulkarni
> > > >> Site Reliability Engineer
> > > >> Mobile:  (716)-352-3270 <(716)%20352-3270> <(716)%20352-3270>
> > > >>
> > > >> Payments partner to the platform economy
> > > >>
> > > >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
> > > >> daniel.imberman@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hello Airflow community!
> > > >> >
> > > >> > My name is Daniel Imberman, and I have been working on behalf of
> > > >> Bloomberg
> > > >> > LP to create an airflow kubernetes executor/operator. We wanted to
> > > allow
> > > >> > for maximum throughput/scalability, while keeping a lot of the
> > > >> kubernetes
> > > >> > details abstracted away from the users. Below I have a link to the
> > WIP
> > > >> PR
> > > >> > and the PDF of the initial proposal. If anyone has any
> > > >> comments/questions I
> > > >> > would be glad to discuss this feature further.
> > > >> >
> > > >> > Thank you,
> > > >> >
> > > >> > Daniel
> > > >> >
> > > >> > https://github.com/apache/incubator-airflow/pull/2414
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Airflow kubernetes executor

Posted by Chris Riccomini <cr...@apache.org>.

@daniel, what's your wiki username? I can grant you access.

On Wed, Jul 5, 2017 at 12:35 PM, Gerard Toonstra <gt...@gmail.com>
wrote:

> Hey Daniel,
>
> Great work. We're looking at running airflow on AWS ECS inside docker
> containers and making great progress on this.
> We use redis and RDS as managed services to form a comms backbone and then
> just spawn webserver, scheduler, worker and flower containers
> as needed on ECS. We deploy dags using an Elastic File System (shared
> across all instances), which then map this read-only into the docker
> container.
> We're now evaluating this setup going forward in more earnest.
>
> Good idea to use queues to separate dependencies or some other concerns
> (high-mem pods?), there are many ways this way that it's possible to
> customize where and on which hw a DAG is going to run. We're looking at
> Cycle scaling to temporarily increase resources in a morning run and create
> larger worker containers for data science tasks and perhaps some other
> tasks.
>
>
> - In terms of tooling:  The current airflow config is somewhat static in
> the sense that it does not reconfigure itself to the (now) dynamic
> environment.
>   You'd think that airflow should have to query the environment to figure
> out parallellism instead of statically specifying this.
>
> - Sometimes DAGs import hooks or operators that import dependencies at the
> top. The only reason, (I think) that a scheduler needs to physically
>   import and parse a DAG is because there may be dynamically built elements
> within the DAG. If there wouldn't be static elements, it is theoretically
>    possible to optimize this.  Your PDF sort of hints towards a system
> where a worker where a DAG will eventually run could parse the DAG and
> report
>    back a meta description of the DAG, which could simplify and optimize
> performance of the scheduler at the cost of network roundtrips.
>
> - About redeploying instances:  We see this as a potential issue for our
> setup. My take is that jobs simply shouldn't take that much time in
> principle to start with,
>    which avoids having to worry about this. If that's ridiculous, shouldn't
> it be a concern of the environment airflow runs in rather than airflow
> itself?  I.e....
>    further tool out kubernetes CLI's / operators to query the environment
> to plan/deny/schedule this kind of work automatically. Beacuse k8s was
> probably
>     built from the perspective of handling short-running queries, running
> anything long-term on that is going to naturally compete with the
> architecture.
>
> - About failures and instances disappearing on failure: it's not desirable
> to keep those instances around for a long time, we really do need to depend
> on
>    client logging and other services available to tell us what happened.
> The difference in thinking is that a pod/container is just a temporary
> thing that runs a job
>    and we should be interested in how the job did vs. how the container/pod
> ran this. From my little experience with k8s though, I do see that it tends
> to
>    get rid of everything a little bit too quick on failure. One thing you
> could look into is to log onto a commonly shared volume with a specific
> 'key' for that container,
>    so you can always refer back to the important log file and fish this
> out, with measures to clean up the shared filesystem on a regular basis.
>
> - About rescaling and starting jobs:  it doesn't come for free as you
> mention. I think it's a great idea to be able to scale out on busy
> intervals (we intend to just use cycle scaling here),
>   but a hint towards what policy or scaling strategy you intend to use on
> k8s is welcome there.
>
>
> Gerard
>
>
> On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <daniel.imberman@gmail.com
> >
> wrote:
>
> > @amit
> >
> > I've added the proposal to the PR for now. Should make it easier for
> people
> > to get to it. Will delete once I add it to the wiki.
> > https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4
> > 922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf
> >
> > Daniel
> >
> > On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <
> daniel.imberman@gmail.com
> > >
> > wrote:
> >
> > > Hi Amit,
> > >
> > > For now the design doc is included as an attachment to the original
> > email.
> > > Once I am able to get permission to edit the wiki I would like add it
> > there
> > > but for now I figured that this would get the ball rolling.
> > >
> > >
> > > Daniel
> > >
> > >
> > > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com> wrote:
> > >
> > >> Hi Daniel,
> > >>
> > >> I don't see link to design PDF.
> > >>
> > >>
> > >> Amit Kulkarni
> > >> Site Reliability Engineer
> > >> Mobile:  (716)-352-3270 <(716)%20352-3270>
> > >>
> > >> Payments partner to the platform economy
> > >>
> > >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
> > >> daniel.imberman@gmail.com>
> > >> wrote:
> > >>
> > >> > Hello Airflow community!
> > >> >
> > >> > My name is Daniel Imberman, and I have been working on behalf of
> > >> Bloomberg
> > >> > LP to create an airflow kubernetes executor/operator. We wanted to
> > allow
> > >> > for maximum throughput/scalability, while keeping a lot of the
> > >> kubernetes
> > >> > details abstracted away from the users. Below I have a link to the
> WIP
> > >> PR
> > >> > and the PDF of the initial proposal. If anyone has any
> > >> comments/questions I
> > >> > would be glad to discuss this feature further.
> > >> >
> > >> > Thank you,
> > >> >
> > >> > Daniel
> > >> >
> > >> > https://github.com/apache/incubator-airflow/pull/2414
> > >> >
> > >>
> > >
> >
>

Re: Airflow kubernetes executor

Posted by Gerard Toonstra <gt...@gmail.com>.

Hey Daniel,

Great work. We're looking at running airflow on AWS ECS inside docker
containers and making great progress on this.
We use redis and RDS as managed services to form a comms backbone and then
just spawn webserver, scheduler, worker and flower containers
as needed on ECS. We deploy dags using an Elastic File System (shared
across all instances), which then map this read-only into the docker
container.
We're now evaluating this setup going forward in more earnest.

Good idea to use queues to separate dependencies or some other concerns
(high-mem pods?), there are many ways this way that it's possible to
customize where and on which hw a DAG is going to run. We're looking at
Cycle scaling to temporarily increase resources in a morning run and create
larger worker containers for data science tasks and perhaps some other
tasks.


- In terms of tooling:  The current airflow config is somewhat static in
the sense that it does not reconfigure itself to the (now) dynamic
environment.
  You'd think that airflow should have to query the environment to figure
out parallellism instead of statically specifying this.

- Sometimes DAGs import hooks or operators that import dependencies at the
top. The only reason, (I think) that a scheduler needs to physically
  import and parse a DAG is because there may be dynamically built elements
within the DAG. If there wouldn't be static elements, it is theoretically
   possible to optimize this.  Your PDF sort of hints towards a system
where a worker where a DAG will eventually run could parse the DAG and
report
   back a meta description of the DAG, which could simplify and optimize
performance of the scheduler at the cost of network roundtrips.

- About redeploying instances:  We see this as a potential issue for our
setup. My take is that jobs simply shouldn't take that much time in
principle to start with,
   which avoids having to worry about this. If that's ridiculous, shouldn't
it be a concern of the environment airflow runs in rather than airflow
itself?  I.e....
   further tool out kubernetes CLI's / operators to query the environment
to plan/deny/schedule this kind of work automatically. Beacuse k8s was
probably
    built from the perspective of handling short-running queries, running
anything long-term on that is going to naturally compete with the
architecture.

- About failures and instances disappearing on failure: it's not desirable
to keep those instances around for a long time, we really do need to depend
on
   client logging and other services available to tell us what happened.
The difference in thinking is that a pod/container is just a temporary
thing that runs a job
   and we should be interested in how the job did vs. how the container/pod
ran this. From my little experience with k8s though, I do see that it tends
to
   get rid of everything a little bit too quick on failure. One thing you
could look into is to log onto a commonly shared volume with a specific
'key' for that container,
   so you can always refer back to the important log file and fish this
out, with measures to clean up the shared filesystem on a regular basis.

- About rescaling and starting jobs:  it doesn't come for free as you
mention. I think it's a great idea to be able to scale out on busy
intervals (we intend to just use cycle scaling here),
  but a hint towards what policy or scaling strategy you intend to use on
k8s is welcome there.


Gerard


On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <da...@gmail.com>
wrote:

> @amit
>
> I've added the proposal to the PR for now. Should make it easier for people
> to get to it. Will delete once I add it to the wiki.
> https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4
> 922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf
>
> Daniel
>
> On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <daniel.imberman@gmail.com
> >
> wrote:
>
> > Hi Amit,
> >
> > For now the design doc is included as an attachment to the original
> email.
> > Once I am able to get permission to edit the wiki I would like add it
> there
> > but for now I figured that this would get the ball rolling.
> >
> >
> > Daniel
> >
> >
> > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com> wrote:
> >
> >> Hi Daniel,
> >>
> >> I don't see link to design PDF.
> >>
> >>
> >> Amit Kulkarni
> >> Site Reliability Engineer
> >> Mobile:  (716)-352-3270 <(716)%20352-3270>
> >>
> >> Payments partner to the platform economy
> >>
> >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
> >> daniel.imberman@gmail.com>
> >> wrote:
> >>
> >> > Hello Airflow community!
> >> >
> >> > My name is Daniel Imberman, and I have been working on behalf of
> >> Bloomberg
> >> > LP to create an airflow kubernetes executor/operator. We wanted to
> allow
> >> > for maximum throughput/scalability, while keeping a lot of the
> >> kubernetes
> >> > details abstracted away from the users. Below I have a link to the WIP
> >> PR
> >> > and the PDF of the initial proposal. If anyone has any
> >> comments/questions I
> >> > would be glad to discuss this feature further.
> >> >
> >> > Thank you,
> >> >
> >> > Daniel
> >> >
> >> > https://github.com/apache/incubator-airflow/pull/2414
> >> >
> >>
> >
>

Re: Airflow kubernetes executor

Posted by Daniel Imberman <da...@gmail.com>.

@amit

I've added the proposal to the PR for now. Should make it easier for people
to get to it. Will delete once I add it to the wiki.
https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf

Daniel

On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <da...@gmail.com>
wrote:

> Hi Amit,
>
> For now the design doc is included as an attachment to the original email.
> Once I am able to get permission to edit the wiki I would like add it there
> but for now I figured that this would get the ball rolling.
>
>
> Daniel
>
>
> On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com> wrote:
>
>> Hi Daniel,
>>
>> I don't see link to design PDF.
>>
>>
>> Amit Kulkarni
>> Site Reliability Engineer
>> Mobile:  (716)-352-3270 <(716)%20352-3270>
>>
>> Payments partner to the platform economy
>>
>> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
>> daniel.imberman@gmail.com>
>> wrote:
>>
>> > Hello Airflow community!
>> >
>> > My name is Daniel Imberman, and I have been working on behalf of
>> Bloomberg
>> > LP to create an airflow kubernetes executor/operator. We wanted to allow
>> > for maximum throughput/scalability, while keeping a lot of the
>> kubernetes
>> > details abstracted away from the users. Below I have a link to the WIP
>> PR
>> > and the PDF of the initial proposal. If anyone has any
>> comments/questions I
>> > would be glad to discuss this feature further.
>> >
>> > Thank you,
>> >
>> > Daniel
>> >
>> > https://github.com/apache/incubator-airflow/pull/2414
>> >
>>
>

Re: Airflow kubernetes executor

Posted by Daniel Imberman <da...@gmail.com>.

Hi Amit,

For now the design doc is included as an attachment to the original email.
Once I am able to get permission to edit the wiki I would like add it there
but for now I figured that this would get the ball rolling.

Daniel

On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <am...@wepay.com> wrote:

> Hi Daniel,
>
> I don't see link to design PDF.
>
>
> Amit Kulkarni
> Site Reliability Engineer
> Mobile:  (716)-352-3270 <(716)%20352-3270>
>
> Payments partner to the platform economy
>
> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
> daniel.imberman@gmail.com>
> wrote:
>
> > Hello Airflow community!
> >
> > My name is Daniel Imberman, and I have been working on behalf of
> Bloomberg
> > LP to create an airflow kubernetes executor/operator. We wanted to allow
> > for maximum throughput/scalability, while keeping a lot of the kubernetes
> > details abstracted away from the users. Below I have a link to the WIP PR
> > and the PDF of the initial proposal. If anyone has any
> comments/questions I
> > would be glad to discuss this feature further.
> >
> > Thank you,
> >
> > Daniel
> >
> > https://github.com/apache/incubator-airflow/pull/2414
> >
>

Re: Airflow kubernetes executor

Posted by Amit Kulkarni <am...@wepay.com>.

Hi Daniel,

I don't see link to design PDF.


Amit Kulkarni
Site Reliability Engineer
Mobile:  (716)-352-3270

Payments partner to the platform economy

On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <da...@gmail.com>
wrote:

> Hello Airflow community!
>
> My name is Daniel Imberman, and I have been working on behalf of Bloomberg
> LP to create an airflow kubernetes executor/operator. We wanted to allow
> for maximum throughput/scalability, while keeping a lot of the kubernetes
> details abstracted away from the users. Below I have a link to the WIP PR
> and the PDF of the initial proposal. If anyone has any comments/questions I
> would be glad to discuss this feature further.
>
> Thank you,
>
> Daniel
>
> https://github.com/apache/incubator-airflow/pull/2414
>