You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2014/09/20 21:51:26 UTC
understanding rdd pipe() and bin/spark-submit --master
Hi
I am new to spark and started writing some simple test code to figure out
how things works. I am very interested in spark streaming and python. It
appears that streaming is not supported in python yet. The work around I
found by googling is to write your streaming code in either Scala or Java
and use RDD pipe() to fetch the data into your python app. I do not think I
am getting parallel execution. In my current experiment I am using a mac
book pro with 8 cores. I wrote a java job that process a small data file on
disk, uses a couple of transformations and writes the data to standard out.
I have a simple python script that calls pipe()
import sys
from operator import add
from pyspark import SparkContext
sc = SparkContext(appName="pySparkPipeJava")
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
output = distData.pipe("../bin/runJavaJob.sh").collect()
# pipe() returns strings
for (num) in output:
print "pySpark: value from java job %s" % (num)
I submit the python job as follows
$ $SPARK_HOME/bin/spark-submit --master local[1] pySparkPipeJava.py
Everything works as expected as long as I only use a single core. If I use 4
cores I get back 4 copies of data. My understanding is that the shell script
will execute on all the workers. In general I want all the transforms and
actions to run in parallel how ever I only want to process a single set of
data in my python script.
Here is the code for runJavaJob.sh
$SPARK_HOME/bin/spark-submit \
--class ³myJavaSrc" \
--master local[4] \
$jarPath
What is really strange is if I replace runJavaJob.sh with a simple shell
script that basically just echoes I can run with 4 cores and only get back
one set of data. Any idea what the difference is? (my java child does not
read anything from the python app, the echo script does)
I tried changing the number of cores in runJavaJob.sh but that does not seem
to change how much data I get back
Seems like being limited to a single core would be severely limiting.
Here is the code for my ³echo² script
#!/bin/sh
#
# Use this shell script to figure out how spark RDD pipe() works
#
#set -x # turns shell debugging on
#set +x # turns shell debugging off
while read x ;
do
echo RDDPipe.sh $x ;
Done
Thanks in advance
Andy