You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by da...@ssense.com, da...@ssense.com on 2019/03/06 14:58:03 UTC

[Discuss] Airflow Kubernetes worker configuration should be parsed from YAML

Hi,

I would like to discuss parsing YAML for the Kubernetes worker configuration instead of the current process of programmatically generating the YAML from the Pod and PodRequest Factory as is done currently.

*Motivation:*

Kubernetes configuration is quite complex. Instead of using the configuration system that is offered natively by Kubernetes (YAML),the current method involves programmatically recreating this YAML file. Fully re-implementing the configuration in Airflow is taking a lot of time, and at the moment many features available through YAML configuration are not available in Airflow. Furthermore, as the Kubernetes API evolves, the Airflow codebase will have to change with it, and Airflow will be in a constant state of catching up with missing features available. This can all be solved by simply parsing the YAML file.

*Idea:*

Either pass in the YAML as string or have a path to the YAML file.

Re: [Discuss] Airflow Kubernetes worker configuration should be parsed from YAML

Posted by Eamon Keane <ea...@gmail.com>.
Thanks for starting the discussion David.

Any templating should apply for both kubernetes airflow workers and
kubernetes pod operators. I estimate there are currently around 20 objects
in the pod spec missing (kubectl explain pod.spec --recursive).

The main challenge would probably be getting airflow up to speed with the
full current spec than the changing of that in the future as the kitchen
sink already appears to be in there.

For comparison, Jenkins uses a combination of four sources for its pod
templates:

* Built-in java objects covering most but not all of pod spec
* Yaml strings
* Yaml files
* Inheritance from base templates

https://github.com/jenkinsci/kubernetes-plugin/blob/master/README.md

Something along the lines of the CRDs you mention James might be Tekton
(aka knative-build). It is early stages of Tekton but Jenkins-x for example
is switching to that for its pipelines. I haven't examined it in enough
detail to know if it would fit neatly with airflow.

https://github.com/knative/build-pipeline/releases/tag/v0.1.0

On Wed, Mar 6, 2019 at 3:18 PM James Meickle
<jm...@quantopian.com.invalid> wrote:

> I'm in favor of having a YAML-based option for Kubernetes. We've had to
> internally subclass the Kubernetes operator because it really isn't doing
> what we need out of the box; such as intercepting the object it creates
> right before it sends it so that we can patch in missing features. I think
> it would make sense to make this a sibling class to the existing operator,
> since it can use the same watching/submitting logic, but just accept YAML
> instead. Using existing Airflow templating systems here would make sense
> too, of course.
>
> However, what I'd really like to see is a Helm operator!
>
> Airflow tasks often require temporary resources. Here's an example: we run
> the same container in ~12 different configurations. Each of them requires
> slightly different ConfigMaps. As of right now, we have to manage the
> ConfigMaps out of band from Airflow, because Airflow has no way to maintain
> or update those ConfigMaps. This can lead to pushing the new code to
> Airflow, but forgetting to update the ConfigMaps.
>
> What would be ideal for us is to define the task _and_ its necessary
> resources in a Helm chart (either in the same repo as the DAG, or pointing
> to a semver tag). Then the operator would wait for the entire chart to
> finish successfully, including creating and tearing down resources as
> required.
>
> This would also help in scenarios where we want to run a task outside of
> Airflow. Right now, a lot of our tasks are "baked into" the DAG and can't
> be run without either going through Airflow, or manually copying config
> options from the DAG code. Declaring a task as a resource, and then just
> referencing that resource from Airflow, would allow us to also reference
> that resource in other systems in our infrastructure and ensure that it
> gets invoked in an identical way.
>
> Unfortunately Helm itself has some issues around not really having a
> concept of "one-off" tasks. So we started to build something like this
> in-house but ran into roadblocks. We looked into hacks like storing task
> definitions in a CronJob but I came to the conclusion that a TaskTemplate
> CRD would be needed to support this kind of workflow.
>
> On Wed, Mar 6, 2019 at 10:06 AM david.lum@ssense.com <david.lum@ssense.com
> >
> wrote:
>
> > Hi,
> >
> > I would like to discuss parsing YAML for the Kubernetes worker
> > configuration instead of the current process of programmatically
> generating
> > the YAML from the Pod and PodRequest Factory as is done currently.
> >
> > *Motivation:*
> >
> > Kubernetes configuration is quite complex. Instead of using the
> > configuration system that is offered natively by Kubernetes (YAML),the
> > current method involves programmatically recreating this YAML file. Fully
> > re-implementing the configuration in Airflow is taking a lot of time, and
> > at the moment many features available through YAML configuration are not
> > available in Airflow. Furthermore, as the Kubernetes API evolves, the
> > Airflow codebase will have to change with it, and Airflow will be in a
> > constant state of catching up with missing features available. This can
> all
> > be solved by simply parsing the YAML file.
> >
> > *Idea:*
> >
> > Either pass in the YAML as string or have a path to the YAML file.
> >
>

Re: [Discuss] Airflow Kubernetes worker configuration should be parsed from YAML

Posted by James Meickle <jm...@quantopian.com.INVALID>.
I'm in favor of having a YAML-based option for Kubernetes. We've had to
internally subclass the Kubernetes operator because it really isn't doing
what we need out of the box; such as intercepting the object it creates
right before it sends it so that we can patch in missing features. I think
it would make sense to make this a sibling class to the existing operator,
since it can use the same watching/submitting logic, but just accept YAML
instead. Using existing Airflow templating systems here would make sense
too, of course.

However, what I'd really like to see is a Helm operator!

Airflow tasks often require temporary resources. Here's an example: we run
the same container in ~12 different configurations. Each of them requires
slightly different ConfigMaps. As of right now, we have to manage the
ConfigMaps out of band from Airflow, because Airflow has no way to maintain
or update those ConfigMaps. This can lead to pushing the new code to
Airflow, but forgetting to update the ConfigMaps.

What would be ideal for us is to define the task _and_ its necessary
resources in a Helm chart (either in the same repo as the DAG, or pointing
to a semver tag). Then the operator would wait for the entire chart to
finish successfully, including creating and tearing down resources as
required.

This would also help in scenarios where we want to run a task outside of
Airflow. Right now, a lot of our tasks are "baked into" the DAG and can't
be run without either going through Airflow, or manually copying config
options from the DAG code. Declaring a task as a resource, and then just
referencing that resource from Airflow, would allow us to also reference
that resource in other systems in our infrastructure and ensure that it
gets invoked in an identical way.

Unfortunately Helm itself has some issues around not really having a
concept of "one-off" tasks. So we started to build something like this
in-house but ran into roadblocks. We looked into hacks like storing task
definitions in a CronJob but I came to the conclusion that a TaskTemplate
CRD would be needed to support this kind of workflow.

On Wed, Mar 6, 2019 at 10:06 AM david.lum@ssense.com <da...@ssense.com>
wrote:

> Hi,
>
> I would like to discuss parsing YAML for the Kubernetes worker
> configuration instead of the current process of programmatically generating
> the YAML from the Pod and PodRequest Factory as is done currently.
>
> *Motivation:*
>
> Kubernetes configuration is quite complex. Instead of using the
> configuration system that is offered natively by Kubernetes (YAML),the
> current method involves programmatically recreating this YAML file. Fully
> re-implementing the configuration in Airflow is taking a lot of time, and
> at the moment many features available through YAML configuration are not
> available in Airflow. Furthermore, as the Kubernetes API evolves, the
> Airflow codebase will have to change with it, and Airflow will be in a
> constant state of catching up with missing features available. This can all
> be solved by simply parsing the YAML file.
>
> *Idea:*
>
> Either pass in the YAML as string or have a path to the YAML file.
>