You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Diana Carroll <dc...@cloudera.com> on 2014/01/16 22:28:33 UTC

python on YARN

A few weeks ago there was a question about whether you could run a python
spark program on YARN.  I didn't see anyone answer except Josh Rosen who
suggsted "maybe yarn-client" and filed a JIRA (
https://spark-project.atlassian.net/browse/SPARK-1004)

I needed to get this working so I played around and I think I figured it
out.  I have a very simple python program called test1.py.  I'm using
CDH5.0b1 for YARN installed in "pseudo-distributed" mode,
and spark-0.8.1-incubating-bin-hadoop2 for Spark.

Given that, this command worked:
SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar
\
SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
test1.py

In the program, I set up my sc as:
sc = SparkContext("yarn-client", "Simple App")

And...that's it.

the SPARK_YARN_APP_JAR file is irrelevant.  I happened to point it to a
random text file I had lying around because Spark complains if the variable
if not set, or if it points to a non-existent file, but as far as I can
tell, the contents of the file aren't used in any way.

I haven't yet tried this on a real cluster...I'm a little worried!

Diana
Senior Curriculum Developer @ Cloudera