You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jason White <ja...@shopify.com> on 2017/06/29 16:37:14 UTC

spark.pyspark.python is ignored?

According to the documentation, `spark.pyspark.python` configures which
python executable is run on the workers. It seems to be ignored in my simple
test cast. I'm running on a pip-installed Pyspark 2.1.1, completely stock.
The only customization at this point is my Hadoop configuration directory.

In the below code, the `PYSPARK_PYTHON` value is used, so `session` is a
functioning SparkSession. However, it shouldn't be; `spark.pyspark.python`is
set to a nonsense value, and should take priority. If I take out the env
variable, it just loads python2 - this value doesn't appear to have any
impact for me.

Any suggestions?


import os
import pprint
import pyspark

ip = '10.30.50.73'

conf_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), 'conf',
'cloudera.yarn'))
os.environ['YARN_CONF_DIR'] = conf_dir
os.environ['HADOOP_CONF_DIR'] = conf_dir
os.environ['PYSPARK_PYTHON'] = '/u/pyenv/versions/3.6.1/bin/python3'

config = pyspark.SparkConf(loadDefaults=False)
config.set('spark.driver.host', ip)
config.set('spark.master', 'yarn')
config.set('spark.submit.deployMode', 'client')
config.set('spark.pyspark.python', 'foo/bar')

spark_builder = pyspark.sql.SparkSession.builder.config(conf=config)
session = spark_builder.getOrCreate()

context = session.sparkContext
config_string = pprint.pformat({key: value for key, value in
context.getConf().getAll()})
print(config_string)

import IPython
IPython.embed()



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-pyspark-python-is-ignored-tp28808.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org