You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/21 17:31:31 UTC

[GitHub] [airflow] jaketf opened a new issue #10454: Add Terraform Hook

jaketf opened a new issue #10454:
URL: https://github.com/apache/airflow/issues/10454


   
   **Description**
   
   Create a terraform integration for apache airflow.
   
   **Use case / motivation**
   
   Use terraform to manage ephemeral infrastructure used in airflow DAGs taking advantage of it's "drift" detection features and wide array of existing integrations. For teams who use terraform this could replace tasks like create / delete dataproc cluster operator. This could be really interesting for automating nightly large scale e2e integration tests of your terraform and data pipelines code bases (terraform apply >> run data pipelines >> terraform destory.
   
   
   **Related Issues**
   
   Inspired by [this discussion](https://github.com/apache/airflow/pull/9593#issuecomment-667906041)  in #9593
   
   cc: @brandonjbjelland @potiuk 
   
   Brandon and I will discuss a design for this next week.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #10454: Add Terraform Hook

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-678424669


   Looking forward to it!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #10454: Add native Terraform integration

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-685675965


   > How would you pull your terraform configuration source?
   > in the bash operator or the setup environment or was this very small terraform configurations embeded in your DAG code?
   
   We had everything in the bash sciprt, our terraform modules were on a private Gitlab repos but the ssh key of our Airflow Box was added to Gitlab and the native Terraform + git integration worked fine when running terraform init for us


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #10454: Add Terraform Hook

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-678670024


   I would definitely recommend using BashOperator for this :) . Just to clear the air I am not against TerraformOperator for sure but I have used a lot of terraform with Airflow in the past and BashOperator has just worked perfectly without having to care about different TF versions


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #10454: Add Terraform Hook

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-678671949


   I think it would be nice to have a Hook/Operator that could manage installing Terraform automatically and expose a python API. There is the https://pypi.org/project/python-terraform/ wrapper and installing terraform is basically downloading the right binary from https://www.terraform.io/downloads.html (you can even specify version of terraform). This way you could use the power of terraform without worrying about having it installed at your worker. 
   
   I think just mentioning that there  is a "Terraform" hook/operaotr is something that can make more Airflow users more aware that they can actually use Terraform rather than dedicated actions. Plus I think often terraform scripts are rather complex - usually they are stored somewhere in repository - not necessarily in the DAG's folder, so it would be great to have an option to somehow download (git-sync?) a specified set of terraform scripts. Or maybe we can think out some more "airflow-y" way of distributing such scripts. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jaketf edited a comment on issue #10454: Add native Terraform integration

Posted by GitBox <gi...@apache.org>.
jaketf edited a comment on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-680139027


   @kaxil How would you pull your terraform configuration source? 
   in the bash operator or the setup environment or was this very small terraform configurations embeded in your DAG code?
   
   Our idea would be to provide some abstraction of that setup the environment step to make it easy to have a terraform binary running in your airflow execution environment.
   
   My first thought was to essentially create a subclass of KubernetesPodOperator with a git-sync initialization container and a terraform container.
   The idea being we could provide reasonable defaults for terraform image (e.g. official image from docker hub) or the user could override this to be a container w/ additional binaries (gcloud, providers, etc). In theory the user could also specify an image for popular wrappers like [terragrunt](https://terragrunt.gruntwork.io/)
   
   This would give a lot of flexibility to the advanced user and takes care of a lot of boilerplate.
   ```python
   tf_task = TerraformOperator(
     command='terraform apply -auto-approve',
     git_ssh_key_secret_name='my_tf_git_sync',
     sub_path='terraform/my_dir',
     terraform_image='hashicorp/terraform:latest',
     gcp_secret_name='gcp-terraform-key',
     aws_secret_path=None,
     azure_secrete_path=None,
   )
   ``` 
   
   The drawback naturally is for non-k8s based airflow deployments.
   
   I think to make this really useful at most enterprises we need to think about how to best handle secrets.
   Terraform often needs a lot of permissions so if there was an opportunity to mange the secret for terraform's cedentials outside of airflow this would be ideal (so not every DAG can bootstrap the god-like permissions for terraform).
   
   The idea here is that the user manages a k8s secret for the god-like terraform credentials and just tells our operator this secret name so we can mount it as a volume in our pod definition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil edited a comment on issue #10454: Add native Terraform integration

Posted by GitBox <gi...@apache.org>.
kaxil edited a comment on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-685675965


   > How would you pull your terraform configuration source?
   > in the bash operator or the setup environment or was this very small terraform configurations embedded in your DAG code?
   
   We had everything in the bash script, our terraform modules were on a private Gitlab repo but the ssh key of our Airflow Box was added to Gitlab and the native Terraform + git integration worked fine when running terraform init for us.
   
   Although I definitely feel there are more users out there who would be happy with your solution too :) so yes please feel free to PR the TerraformOperator 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jaketf edited a comment on issue #10454: Add native Terraform integration

Posted by GitBox <gi...@apache.org>.
jaketf edited a comment on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-680139027


   @kaxil How would you pull your terraform configuration source? 
   in the bash operator or the setup environment or was this very small terraform configurations embeded in your DAG code?
   
   Our idea would be to provide some abstraction of that setup the environment step to make it easy to have a terraform binary running in your airflow execution environment.
   
   My first thought was to essentially create a wrapper of KubernetesPodOperator with a git-sync initialization container and a terraform container.
   The idea being we could provide reasonable defaults for terraform image (e.g. official image from docker hub) or the user could override this to be a container w/ additional binaries (gcloud, providers, etc). In theory the user could also specify an image for popular wrappers like [terragrunt](https://terragrunt.gruntwork.io/)
   
   This would give a lot of flexibility to the advanced user and takes care of a lot of boilerplate.
   ```python
   tf_task = TerraformOperator(
     command='apply -auto-approve',
     git_ssh_key_secret_name='my_tf_git_sync',
     sub_path='terraform/my_dir',
     terraform_image='hashicorp/terraform:latest',
     gcp_secret_path='/var/secrets/key.json',
   )
   ``` 
   
   The drawback naturally is for non-k8s based airflow deployments.
   
   I think to make this really useful at most enterprises we need to think about how to best handle secrets.
   Terraform often needs a lot of permissions so if there was an opportunity to mange the secret for terraform's cedentials outside of airflow this would be ideal (so not every DAG can bootstrap the god-like permissions for terraform).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #10454: Add Terraform Hook

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-678672286


   Just an FYI: https://pypi.org/project/python-terraform/ (https://github.com/beelit94/python-terraform) hasn't been actively maintained so if we someone wants to work on this one, try finding another library


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jaketf edited a comment on issue #10454: Add native Terraform integration

Posted by GitBox <gi...@apache.org>.
jaketf edited a comment on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-680139027


   @kaxil How would you pull your terraform configuration source? 
   in the bash operator or the setup environment or was this very small terraform configurations embeded in your DAG code?
   
   Our idea would be to provide some abstraction of that setup the environment step to make it easy to have a terraform binary running in your airflow execution environment.
   
   My first thought was to essentially create a subclass of KubernetesPodOperator with a git-sync initialization container and a terraform container.
   The idea being we could provide reasonable defaults for terraform image (e.g. official image from docker hub) or the user could override this to be a container w/ additional binaries (gcloud, providers, etc). In theory the user could also specify an image for popular wrappers like [terragrunt](https://terragrunt.gruntwork.io/)
   
   This would give a lot of flexibility to the advanced user and takes care of a lot of boilerplate.
   ```python
   tf_task = TerraformOperator(
     command='terraform apply -auto-approve',
     git_ssh_key_secret_name='my_tf_git_sync',
     sub_path='terraform/my_dir',
     terraform_image='hashicorp/terraform:latest',
     gcp_secret_name='gcp-terraform-key',
     aws_secret_path=None,
     azure_secrete_path=None,
   )
   ``` 
   
   The drawback naturally is for non-k8s based airflow deployments.
   
   I think to make this really useful at most enterprises we need to think about how to best handle secrets.
   Terraform often needs a lot of permissions so if there was an opportunity to mange the secret for terraform's cedentials outside of airflow this would be ideal (so not every DAG can bootstrap the god-like permissions for terraform).
   
   The idea here is that the user manages a k8s secret for the god-like terraform credentials and just tells our operator this secret name so we can mount it as a volume in our pod definition. Alternatively these secret names could be omitted and we could fall back on magic like 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jaketf commented on issue #10454: Add Terraform Hook

Posted by GitBox <gi...@apache.org>.
jaketf commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-679143760


   @kaxil I'd love to hear more about your use of terraform in airflow.
   Did you bake terraform installation into your airflow image?
   I agree w/ @potiuk that we want to make this easier to install
   > without having to care about different TF versions
   How does using bash operator eliminate the need to care about terraform version? Did your bash script install terraform each time? 
   
   I think something that comes to mind for me is that beyond terraform binaries sometimes, terraform scripts depend on other binaries e.g. custom providers or to shell out and do hacky things from null provider. 
   
   The first thing that came to mind was KubernetesPodOperator. I think if we made it easy to "bring your own terraform image or use this default image for this version (pulled from docker hub)"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jaketf edited a comment on issue #10454: Add native Terraform integration

Posted by GitBox <gi...@apache.org>.
jaketf edited a comment on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-680139027


   @kaxil How would you pull your terraform configuration source? 
   in the bash operator or the setup environment or was this very small terraform configurations embeded in your DAG code?
   
   Our idea would be to provide some abstraction of that setup the environment step to make it easy to have a terraform binary running in your airflow execution environment.
   
   My first thought was to essentially create a subclass of KubernetesPodOperator with a git-sync initialization container and a terraform container.
   The idea being we could provide reasonable defaults for terraform image (e.g. official image from docker hub) or the user could override this to be a container w/ additional binaries (gcloud, providers, etc). In theory the user could also specify an image for popular wrappers like [terragrunt](https://terragrunt.gruntwork.io/)
   
   This would give a lot of flexibility to the advanced user and takes care of a lot of boilerplate.
   ```python
   tf_task = TerraformOperator(
     command='terraform apply -auto-approve',
     git_ssh_key_secret_name='my_tf_git_sync',
     sub_path='terraform/my_dir',
     terraform_image='gcr.io/my/terragrunt-image', # this would default to ''hashicorp/terraform:latest',
     gcp_secret_name='gcp-terraform-key',
     aws_secret_path=None,
     azure_secrete_path=None,
   )
   ``` 
   
   The drawback naturally is for non-k8s based airflow deployments.
   
   I think to make this really useful at most enterprises we need to think about how to best handle secrets.
   Terraform often needs a lot of permissions so if there was an opportunity to mange the secret for terraform's credentials outside of airflow this would be ideal (so not every DAG can bootstrap the god-like permissions for terraform).
   
   The idea here is that the user manages a k8s secret for the god-like terraform credentials and just tells our operator this secret name so we can mount it as a volume in our pod definition. Alternatively these secret names could be omitted and we could fall back on provider specific magic like [workload identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jaketf commented on issue #10454: Add native Terraform integration

Posted by GitBox <gi...@apache.org>.
jaketf commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-685232736


   I think the approach of subclassing k8s pod operator is far from how we typically build airflow integrations and the heavy reliance on k8s secrets as opposed to leaning in on airflow secrets managers might be overly opinionated.
   
   I guess I'd like to ask does this seem like something that belongs in airflow core or something that belongs as an opinionated plugin released elsewhere?
   
   Due to the wide variety of binaries folks end up relying on in their terraform scripts (even perhaps different terraform version in different DAGs), the large permissions footprint required I think it is difficult to develop a very general integration that gives the user enough flexibility to meet their security needs.
   
   Taking a step back:
   I am a concerned that I'm sort of making up these requirements based on my own speculation (I have no real connection to hashi and haven't worked w/ a gcp customer who directly asked for this feature but rather it evolved from [this discussion](https://github.com/apache/airflow/pull/9593#issuecomment-667906041)).
   Is there a good way to crowdsource from the community what their interest / requirements are for an airflow/terraform integration are? Is this best done in the dev-list? user-list? slack sig?
   Is there a good way for airflow community to engage a provider like hashi in helping design / develop this based on what they might hear from their customers?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #10454: Add Terraform Hook

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-679178232


   Hey Jacob, in our Setup the DAG Author had the first task as Setup the environment where they downloaded necessary binaries from internal [Nexus](https://www.sonatype.com/product-nexus-repository) which would have multiple versions of binaries too as the different team relies on different versions of Terraform and Ansible. These binaries were stored in a temporary location and were cleared by the task at the end of the DAG.
   
   We had bash scripts to run Terraform and Ansible as there were clients who would run these things on ad-hoc basis via their local Machine or through Jenkins. Running things via BashOperator was an easy way for us without maintenance over-head of a Bash script and a Custom Operator. But it would be different for different teams with different use-cases I suppose :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jaketf commented on issue #10454: Add Terraform Hook

Posted by GitBox <gi...@apache.org>.
jaketf commented on issue #10454:
URL: https://github.com/apache/airflow/issues/10454#issuecomment-680139027


   @kaxil How would you pull your terraform configuration source? 
   in the bash operator or the setup environment or was this very small terraform configurations embeded in your DAG code?
   
   Our idea would be to provide some abstraction of that setup the environment step to make it easy to have a terraform binary running in your airflow execution environment.
   
   My first thought was to essentially create a wrapper of KubernetesPodOperator with a git-sync initialization container and a terraform container.
   The idea being we could provide reasonable defaults for terraform image (e.g. official image from docker hub) or the user could override this to be a container w/ additional binaries (gcloud, providers, etc). In theory the user could also specify an image for popular wrappers like [terragrunt](https://terragrunt.gruntwork.io/)
   
   This would give a lot of flexibility to the advanced user and takes care of a lot of boilerplate.
   ```python
   tf_task = TerraformOperator(
     command='apply',
     git_ssh_key_secret_name='my_tf_git_sync',
     sub_path='terraform/my_dir',
     terraform_image='hashicorp/terraform:latest',
     gcp_secret_path='/var/secrets/key.json',
   )
   ``` 
   
   The drawback naturally is for non-k8s based airflow deployments.
   
   I think to make this really useful at most enterprises we need to think about how to best handle secrets.
   Terraform often needs a lot of permissions so if there was an opportunity to mange the secret for terraform's cedentials outside of airflow this would be ideal (so not every DAG can bootstrap the god-like permissions for terraform).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org