You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by redocpot <ju...@gmail.com> on 2014/07/31 13:58:11 UTC

set spark.local.dir on driver program doesn't take effect

Hi,

When running spark on ec2 cluster, I find setting spark.local.dir on driver
program doesn't take effect.

INFO:
- standalone mode
- cluster launched via python script along with spark
- instance type R3.large
- ebs attached (using persistent-hdfs)
- spark version: 1.0.0 prebuilt-hadoop1,sbt downloaded, 
- run prog: sbt package run

*Here is my setting:*

val conf = new SparkConf()
    .setAppName("RecoSys")
    .setMaster(masterURL)
    .set("spark.local.dir", "/mnt")
    .set("spark.executor.memory", "10g")
    .set("spark.logConf", "true")
    .setJars(Seq("target/scala-2.10/recosys_2.10-0.1.jar"))

After checking log, I find this:

14/07/31 08:46:04 INFO spark.SparkContext: Spark configuration:
spark.app.name=RecoSys
spark.executor.memory=10g
spark.jars=target/scala-2.10/recosys_2.10-0.1.jar
spark.local.dir=/mnt
spark.logConf=true

On port 4040's environment tab, I find the same thing. It looks like
"spark.local.dir=/mnt" is used.

My prog need store RDD with StorageLevel.MEMORY_AND_DISK, so some data will
be persisted on local.dir. It is supposed to store the RDD *ONLY* on /mnt.
However, I find a big spark/ directory in /mnt2.

root@ip-10-186-147-175 mnt2]$ du -ah --max-depth=1 | sort -n
2.4G	.
2.4G	./spark
32K	./ephemeral-hdfs

Since */mnt/spark* and */mnt2/spark* are the default local.dir set in
spark-env.sh, I am quite sure that my local.dir setting on the driver
program is not used. So I think, the spark-env.sh overwrites my settings on
driver program. (Any one can confirm this ?)

So I changed it in spark-env.sh like:
# export SPARK_LOCAL_DIRS="/mnt/spark, /mnt2/spark"
export SPARK_LOCAL_DIRS="/mnt/spark"

After running the prog, nothing changed, /mnt2/spark filled with data. It
seems that even setting variable in spark-env.sh can not modify the env
variable already loaded at the booting time of cluster.

My workaround is: change the spark-env.sh -> restart all spark daemons in
cluster -> re-run prog. This time, it works. RDD was only stored on /mnt.

This is quite different from what I read from
http://spark.apache.org/docs/latest/configuration.html
"In Standalone and Mesos modes, this file can give machine specific
information such as hostnames. *It is also sourced when running local Spark
applications or submission scripts*."
According to what I observed, the spark-env.sh is not sourced when running
local Spark applications.

So here are my questions:
1) When is the spark-env.sh loaded exactly ? (maybe show me some code in
block manager)
2) Does spark-env.sh loaded overwrite some config set in driver prog ?
3) Which config will be used ? config in driver prog or in spark-env.sh ?
4) When to use config in driver prog ? When is spark-env.sh useful ?

Thank you.

Hao










--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-local-dir-on-driver-program-doesn-t-take-effect-tp11040.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: set spark.local.dir on driver program doesn't take effect

Posted by gphil <gp...@gphil.net>.
Hao --

Did you ever figure this out? I just ran into the same issue, changed
spark-env.sh and got it working--but I'd much rather keep this configuration
in my application code.

-- Greg 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-local-dir-on-driver-program-doesn-t-take-effect-tp11040p14477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org