You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Joseph McCartin (Jira)" <ji...@apache.org> on 2019/12/11 20:18:00 UTC

[jira] [Commented] (AIRFLOW-5744) Environment variables not correctly set in Spark submit operator

    [ https://issues.apache.org/jira/browse/AIRFLOW-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993869#comment-16993869 ] 

Joseph McCartin commented on AIRFLOW-5744:
------------------------------------------

The fix is somewhat simple, but it is unclear for what cases the '_env_vars' variable should be handed down to the Popen process.

*yarn:* [from the docs|https://spark.apache.org/docs/latest/running-on-yarn.html] _"Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration."_  This configuration is pointed by one or more of the env vars.

*k8s:* the master is set in the spark-submit arguments of the form _k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>_, and not in the hadoop configuration [link to documentation|https://spark.apache.org/docs/latest/running-on-kubernetes.html].

To minimise disruption or having unwanted environment variables present at runtime, it's probably best that this is only added for the yarn case, but it should be trivial to add it to the k8s case in the future.

> Environment variables not correctly set in Spark submit operator
> ----------------------------------------------------------------
>
>                 Key: AIRFLOW-5744
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5744
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: contrib, operators
>    Affects Versions: 1.10.5
>            Reporter: Joseph McCartin
>            Priority: Trivial
>
> AIRFLOW-2380 added support for setting environment variables at runtime for the SparkSubmitOperator. The intention was to allow for dynamic configuration paths (such as HADOOP_CONF_DIR). The pull request, however, only made it so that these env vars would only be set at runtime if a standalone cluster and a client deploy mode was chosen. For kubernetes and yarn modes, the env vars would be sent to the driver via the spark arguments _spark.yarn.appMasterEnv_ (and equivalent for k8s).
> If one wishes to dynamically set the yarn master address (via a _yarn-site.xml_ file), then one or more environment variables __ need to be present at runtime, and this is not currently done.
> The SparkSubmitHook class var `_env` is assigned the `_env_vars` variable from the SparkSubmitOperator, in the `_build_spark_submit_command` method. If running in YARN mode however, this is not set as it should be, and therefore `_env` is not passed to the Popen process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)