You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andrew Ash <an...@andrewash.com> on 2014/01/03 07:33:44 UTC

Is spark-env.sh supposed to be stateless?

In my spark-env.sh I append to the SPARK_CLASSPATH variable rather than
overriding it, because I want to support both adding a jar to all instances
of a shell (in spark-env.sh) and adding a jar to a single shell
instance (SPARK_CLASSPATH=/path/to/my.jar
/path/to/spark-shell)

That looks like this:

# spark-env.sh
export SPARK_CLASSPATH+=":/path/to/hadoop-lzo.jar"

However when my Master and workers run, they have duplicates of the
SPARK_CLASSPATH jars.  There are 3 copies of hadoop-lzo on the classpath, 2
of which are unnecessary.

The resulting command line in ps looks like this:
/path/to/java -cp
:/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:[core
spark jars] ... -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker
spark://my-host:7077

I tracked it down and the problem is that spark-env.sh is sourced 3 times:
in spark-daemon.sh, in compute-classpath.sh, and in spark-class.  Each of
those adds to the SPARK_CLASSPATH until its contents are in triplicate.

Are all of those calls necessary?  Is it possible to edit the daemon
scripts to only call spark-env.sh once?

FYI I'm starting the daemons with ./bin/start-master.sh and
./bin/start-slave.sh 1 $SPARK_URL

Thanks,
Andrew

Re: Is spark-env.sh supposed to be stateless?

Posted by Christopher Nguyen <ct...@adatao.com>.
How about this: https://github.com/apache/incubator-spark/pull/326

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 2, 2014 at 11:07 PM, Matei Zaharia <ma...@gmail.com>wrote:

> I agree that it would be good to do it only once, if you can find a nice
> way of doing so.
>
> Matei
>
> On Jan 3, 2014, at 1:33 AM, Andrew Ash <an...@andrewash.com> wrote:
>
> In my spark-env.sh I append to the SPARK_CLASSPATH variable rather than
> overriding it, because I want to support both adding a jar to all instances
> of a shell (in spark-env.sh) and adding a jar to a single shell instance (SPARK_CLASSPATH=/path/to/my.jar
> /path/to/spark-shell)
>
> That looks like this:
>
> # spark-env.sh
> export SPARK_CLASSPATH+=":/path/to/hadoop-lzo.jar"
>
> However when my Master and workers run, they have duplicates of the
> SPARK_CLASSPATH jars.  There are 3 copies of hadoop-lzo on the classpath, 2
> of which are unnecessary.
>
> The resulting command line in ps looks like this:
> /path/to/java -cp
> :/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:[core
> spark jars] ... -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker
> spark://my-host:7077
>
> I tracked it down and the problem is that spark-env.sh is sourced 3 times:
> in spark-daemon.sh, in compute-classpath.sh, and in spark-class.  Each of
> those adds to the SPARK_CLASSPATH until its contents are in triplicate.
>
> Are all of those calls necessary?  Is it possible to edit the daemon
> scripts to only call spark-env.sh once?
>
> FYI I'm starting the daemons with ./bin/start-master.sh and
> ./bin/start-slave.sh 1 $SPARK_URL
>
> Thanks,
> Andrew
>
>
>

Re: Is spark-env.sh supposed to be stateless?

Posted by Matei Zaharia <ma...@gmail.com>.
I agree that it would be good to do it only once, if you can find a nice way of doing so.

Matei

On Jan 3, 2014, at 1:33 AM, Andrew Ash <an...@andrewash.com> wrote:

> In my spark-env.sh I append to the SPARK_CLASSPATH variable rather than overriding it, because I want to support both adding a jar to all instances of a shell (in spark-env.sh) and adding a jar to a single shell instance (SPARK_CLASSPATH=/path/to/my.jar /path/to/spark-shell)
> 
> That looks like this:
> 
> # spark-env.sh
> export SPARK_CLASSPATH+=":/path/to/hadoop-lzo.jar"
> 
> However when my Master and workers run, they have duplicates of the SPARK_CLASSPATH jars.  There are 3 copies of hadoop-lzo on the classpath, 2 of which are unnecessary.
> 
> The resulting command line in ps looks like this:
> /path/to/java -cp :/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:[core spark jars] ... -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://my-host:7077
> 
> I tracked it down and the problem is that spark-env.sh is sourced 3 times: in spark-daemon.sh, in compute-classpath.sh, and in spark-class.  Each of those adds to the SPARK_CLASSPATH until its contents are in triplicate.
> 
> Are all of those calls necessary?  Is it possible to edit the daemon scripts to only call spark-env.sh once?
> 
> FYI I'm starting the daemons with ./bin/start-master.sh and ./bin/start-slave.sh 1 $SPARK_URL
> 
> Thanks,
> Andrew