You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/10/09 17:31:38 UTC

[GitHub] [airflow] potiuk opened a new pull request #11383: Spark Integration Configuration in Helm Chart

potiuk opened a new pull request #11383:
URL: https://github.com/apache/airflow/pull/11383


   Adds configuration that allows to configure workers remotely
   submitting Spark/Hadoop jobs to a Hadoop Cluster. The configuration
   can be provided as parameters to the Helm Chart and it can create
   secrets that can be used to keep authentication and configuration
   for the Hadoop/Spark jobs.
   
   Co-authored-by: Kamil BreguĊ‚a <ka...@polidea.com>
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-711980683


   Closing in favour of linked PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706312275


   [The Workflow run](https://github.com/apache/airflow/actions/runs/297946414) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-708062447


   https://github.com/apache/airflow/pull/11426 has been merged. Should we close this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706315893


   I am not sure if we need such specialized configuration options, I would prefer fewer and more generic ones. Helm for Keycloak can be an inspiration for you. 
   See: 
   https://github.com/codecentric/helm-charts/blob/1c6149a2a9c49d2c5e56dcede7e2c0de3f20f8fe/charts/keycloak/values.yaml#L189-L192
   https://github.com/codecentric/helm-charts/tree/master/charts/keycloak#using-google-cloud-sql-proxy
   
   This configuration option would be useful if it were used by Airflow Core, or we would like to enforce some good practice. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706739772


   > @potiuk well looking at this one, if we allow for arbitrary env variables and volumes, I don't think that it would be hard to create a "airflow-hadoop" helm chart that inherits the airflow helm chart and injects values as needed.
   > 
   > I'm all for us making the chart as extensible as possible so these use-cases can be easily addressed, I still have nightmares about getting hadoop out of our unit tests/docker image.
   
   Works for me :+1: 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706673660


   I've added the schema now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706741175


   I agree too, making chart extensible would be nice. Otherwise it will just get difficult to support all these tools


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706672921


   > @jaketf I'm with @mik-laj on this one. I think this would be a pandora's box to a whole HOST of add-ons that would eventually make the helm chart really difficult to maintain.
   
   I had a second look at that one. And I think it won't be as easy to separate as KeyCloak. The problem is that we have to make it part of the worker deployment, not a separate template @mik-laj - so this is not possible too have separate template for that.
   
   Also @dimberman - That's true that the helm chart becomes rather complex now, and we might think of further modularizing it, but think Haadoop/Spark authentication is really common case and possibly it's worth including it. 
   
   I will add the missing schema parts (they were part of a separate commit, but I am open to discuss on how this can be modularized. @mik-laj -> maybe you can propose something taking into account that we have to make it part of the worker deployment. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706726515


   Sure. If we can add generic extension, this might be even better, I just wonder how difficult/ generic we can make it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706675482


   Yeah. With the JSON schema it looks better.
   
   And I would be all for merging it even without further modularization. I think what we've done with our project was a real-life example that corporates heavily relying on secure authentication mechanisms that can use the helm chart easily :). 
   
   We have now:
   * Possibility to rebuild all the dependent images by the customer
   * Kerberos support for authentication of workers including Kerberos sidecar
   * Presto DB supporting kerboeros
   * PostgreSQL mTLS authentication
   * Hadoop/Spark seamless integration (including Kerberos authentication)
   * Workfloo Identity capability via annotations
   * Capability of building the Airflow custom image using vetted binary dependencies on an air-gaped system
   * Custom Cluster Policy Violation per-Dag
   
   All this tested with a real Enterprise customer, based on their expectations and the limitations they had.
   
   I'd say all together this might be a super-powerful message to Enterprise customers once we release the Helm Chart. In many ways with all those changes, it became production-ready from the Enterprise customers point of view.
   
   We need MySQL support (But I believe someone is working on it), support for multiple schedulers for 2.0, a bit more comprehensive tests and I think we should be getting closer to releasing it as "production-ready" @dimberman.
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman edited a comment on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
dimberman edited a comment on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706739055


   @potiuk well looking at this one, if we allow for arbitrary env variables and volumes, I don't think that it would be hard to create a "airflow-hadoop" helm chart that inherits the airflow helm chart and injects values as needed.
   
   I'm all for us making the chart as extensible as possible so these use-cases can be easily addressed, I still have nightmares about getting hadoop out of our unit tests/docker image. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
dimberman commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706739055


   @potiuk well looking at this one, if we allow for arbitrary env variables and volumes, I don't think that it would be hard to create a "airflow-hadoop" helm chart that inherits the airflow helm chart and injects values as needed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706440222


   Kinda agree as well. Extracted it from our common work as part of the 20 or so commits, but indeed it does look a bit of out-of-the-blue change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706318488


   @jaketf -> I think it's up to you if you would like to rework that one as @mik-laj  suggests :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706675482


   Yeah. With the JSON schema it looks better.
   
   And I would be all for merging it even without further modularization. I think what we've done with our project was a real-life example that corporates heavily relying on secure authentication mechanisms that can use the helm chart easily :). 
   
   We have now:
   * Possibility to rebuild all the dependent images by the customer
   * Kerberos support for authentication of workers including Kerberos sidecar
   * Presto DB supporting kerboeros
   * PostgreSQL mTLS authentication
   * Hadoop/Spark seamless integration (including Kerberos authentication)
   * Workfloo Identity capability via annotations
   * Capability of building the Airflow custom image using vetted binary dependencies on an air-gaped system
   * Custom Cluster Policy Violation per-Dag
   
   All this tested with a real Enterprise customer, based on their expectations and the limitations they had.
   
   I'd say all together this might be a super-powerful message to Enterprise customers once we release the Helm Chart. In many ways with all those changes, it became production-ready from the Enterrprise customers point of view.
   
   We need MySQL support (But I believe someone is working on it), support for multiple schedulers for 2.0, a bit more comprehensive tests and I think we should be getting closer to releasing it as "production-ready" @dimberman.
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706310698


   cc: @jaketf 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb closed pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
ashb closed pull request #11383:
URL: https://github.com/apache/airflow/pull/11383


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on pull request #11383: Spark Integration Configuration in Helm Chart

Posted by GitBox <gi...@apache.org>.
dimberman commented on pull request #11383:
URL: https://github.com/apache/airflow/pull/11383#issuecomment-706742629


   https://github.com/apache/airflow/pull/11426 this change will make it pretty straightforward to add volumes as needed for hadoop-related stuff. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org