You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Ken Melms (JIRA)" <ji...@apache.org> on 2019/01/07 23:01:00 UTC
[jira] [Created] (AIRFLOW-3647) Contributed SparkSubmitOperator
doesn't honor --archives configuration
Ken Melms created AIRFLOW-3647:
----------------------------------
Summary: Contributed SparkSubmitOperator doesn't honor --archives configuration
Key: AIRFLOW-3647
URL: https://issues.apache.org/jira/browse/AIRFLOW-3647
Project: Apache Airflow
Issue Type: Improvement
Components: contrib
Affects Versions: 1.10.1
Environment: Linux (RHEL 7)
Python 3.5 (using a virtual environment)
spark-2.1.3-bin-hadoop26
Airflow 1.10.1
CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed)
Reporter: Ken Melms
The contributed SparkSubmitOperator has no ability to honor the spark-submit configuration field "--archives" which is treated subtly different than "--files" or "--py-files" in that it will unzip the archive into the application's working directory, and can optionally add an alias to the unzipped folder so that you can refer to it elsewhere in your submission.
EG:
spark-submit --archives=hdfs:////user/someone/python35_venv.zip#PYTHON --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" run_me.py
In our case - this behavior allows for multiple python virtual environments to be sourced from HDFS without incurring the penalty of pushing the whole python virtual env to the cluster each submission. This solves (for us) using python-based spark jobs on a cluster that the end user has no ability to define the python modules in use.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)