You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andrew Lee <al...@hotmail.com> on 2014/06/18 20:05:12 UTC
HDFS folder .sparkStaging not deleted and filled up HDFS in yarn
mode
Hi All,
Have anyone ran into the same problem? By looking at the source code in official release (rc11),this property settings is set to false by default, however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it to fill up the disk pretty fast since SparkContext deploys the fat JAR file (~115MB) every time for each job and it is not cleaned up.
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala: val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", "false").toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx------ - test users 0 2014-05-01 01:42 .sparkStaging/application_1398370455828_0050drwx------ - test users 0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx------ - test users 0 2014-05-01 02:04 .sparkStaging/application_1398370455828_0052drwx------ - test users 0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx------ - test users 0 2014-05-01 05:45 .sparkStaging/application_1398370455828_0055drwx------ - test users 0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx------ - test users 0 2014-05-01 05:49 .sparkStaging/application_1398370455828_0057drwx------ - test users 0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx------ - test users 0 2014-05-01 05:58 .sparkStaging/application_1398370455828_0059drwx------ - test users 0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx------ - test users 0 2014-05-01 07:41 .sparkStaging/application_1398370455828_0061….drwx------ - test users 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx------ - test users 0 2014-06-16 15:03 .sparkStaging/application_1402001910637_0135drwx------ - test users 0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx------ - test users 0 2014-06-16 15:46 .sparkStaging/application_1402001910637_0138drwx------ - test users 0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx------ - test users 0 2014-06-17 05:55 .sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in :SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false"
http://spark.apache.org/docs/latest/running-on-yarn.htmlspark.yarn.preserve.staging.filesfalseSet to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather then delete them.or this is a bug that is not honoring the default value and is override to true somewhere?
Thanks.
RE: HDFS folder .sparkStaging not deleted and filled up HDFS in
yarn mode
Posted by Andrew Lee <al...@hotmail.com>.
I checked the source code, it looks like it was re-added back based on JIRA SPARK-1588, but I don't know if there's any test case associated with this?
SPARK-1588. Restore SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS for YARN.
Sandy Ryza <sa...@cloudera.com>
2014-04-29 12:54:02 -0700
Commit: 5f48721, github.com/apache/spark/pull/586
From: alee526@hotmail.com
To: user@spark.apache.org
Subject: RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:24:36 -0700
Forgot to mention that I am using spark-submit to submit jobs, and a verbose mode print out looks like this with the SparkPi examples.The .sparkStaging won't be deleted. My thoughts is that this should be part of the staging and should be cleaned up as well when sc gets terminated.
[test@ spark]$ SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false" SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar ./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --driver-memory 512M --driver-library-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --executor-memory 512M --executor-cores 1 --queue research --num-executors 2 examples/target/spark-examples_2.10-1.0.0.jar
Using properties file: null
Using properties file: null
Parsed arguments:
master yarn
deployMode cluster
executorMemory 512M
executorCores 1
totalExecutorCores null
propertiesFile null
driverMemory 512M
driverCores null
driverExtraClassPath null
driverExtraLibraryPath /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
driverExtraJavaOptions null
supervise false
queue research
numExecutors 2
files null
pyFiles null
archives null
mainClass org.apache.spark.examples.SparkPi
primaryResource file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
name org.apache.spark.examples.SparkPi
childArgs []
jars null
verbose true
Default properties from null:
Using properties file: null
Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--jar
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
--class
org.apache.spark.examples.SparkPi
--name
org.apache.spark.examples.SparkPi
--driver-memory
512M
--queue
research
--num-executors
2
--executor-memory
512M
--executor-cores
1
System properties:
spark.driver.extraLibraryPath -> /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
SPARK_SUBMIT -> true
spark.app.name -> org.apache.spark.examples.SparkPi
Classpath elements:
From: alee526@hotmail.com
To: user@spark.apache.org
Subject: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:05:12 -0700
Hi All,
Have anyone ran into the same problem? By looking at the source code in official release (rc11),this property settings is set to false by default, however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it to fill up the disk pretty fast since SparkContext deploys the fat JAR file (~115MB) every time for each job and it is not cleaned up.
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala: val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", "false").toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx------ - test users 0 2014-05-01 01:42 .sparkStaging/application_1398370455828_0050drwx------ - test users 0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx------ - test users 0 2014-05-01 02:04 .sparkStaging/application_1398370455828_0052drwx------ - test users 0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx------ - test users 0 2014-05-01 05:45 .sparkStaging/application_1398370455828_0055drwx------ - test users 0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx------ - test users 0 2014-05-01 05:49 .sparkStaging/application_1398370455828_0057drwx------ - test users 0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx------ - test users 0 2014-05-01 05:58 .sparkStaging/application_1398370455828_0059drwx------ - test users 0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx------ - test users 0 2014-05-01 07:41 .sparkStaging/application_1398370455828_0061….drwx------ - test users 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx------ - test users 0 2014-06-16 15:03 .sparkStaging/application_1402001910637_0135drwx------ - test users 0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx------ - test users 0 2014-06-16 15:46 .sparkStaging/application_1402001910637_0138drwx------ - test users 0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx------ - test users 0 2014-06-17 05:55 .sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in :SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false"
http://spark.apache.org/docs/latest/running-on-yarn.htmlspark.yarn.preserve.staging.filesfalseSet to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather then delete them.or this is a bug that is not honoring the default value and is override to true somewhere?
Thanks.
RE: HDFS folder .sparkStaging not deleted and filled up HDFS in
yarn mode
Posted by Andrew Lee <al...@hotmail.com>.
Forgot to mention that I am using spark-submit to submit jobs, and a verbose mode print out looks like this with the SparkPi examples.The .sparkStaging won't be deleted. My thoughts is that this should be part of the staging and should be cleaned up as well when sc gets terminated.
[test@ spark]$ SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false" SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar ./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --driver-memory 512M --driver-library-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --executor-memory 512M --executor-cores 1 --queue research --num-executors 2 examples/target/spark-examples_2.10-1.0.0.jar
Using properties file: null
Using properties file: null
Parsed arguments:
master yarn
deployMode cluster
executorMemory 512M
executorCores 1
totalExecutorCores null
propertiesFile null
driverMemory 512M
driverCores null
driverExtraClassPath null
driverExtraLibraryPath /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
driverExtraJavaOptions null
supervise false
queue research
numExecutors 2
files null
pyFiles null
archives null
mainClass org.apache.spark.examples.SparkPi
primaryResource file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
name org.apache.spark.examples.SparkPi
childArgs []
jars null
verbose true
Default properties from null:
Using properties file: null
Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--jar
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
--class
org.apache.spark.examples.SparkPi
--name
org.apache.spark.examples.SparkPi
--driver-memory
512M
--queue
research
--num-executors
2
--executor-memory
512M
--executor-cores
1
System properties:
spark.driver.extraLibraryPath -> /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
SPARK_SUBMIT -> true
spark.app.name -> org.apache.spark.examples.SparkPi
Classpath elements:
From: alee526@hotmail.com
To: user@spark.apache.org
Subject: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:05:12 -0700
Hi All,
Have anyone ran into the same problem? By looking at the source code in official release (rc11),this property settings is set to false by default, however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it to fill up the disk pretty fast since SparkContext deploys the fat JAR file (~115MB) every time for each job and it is not cleaned up.
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala: val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", "false").toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx------ - test users 0 2014-05-01 01:42 .sparkStaging/application_1398370455828_0050drwx------ - test users 0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx------ - test users 0 2014-05-01 02:04 .sparkStaging/application_1398370455828_0052drwx------ - test users 0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx------ - test users 0 2014-05-01 05:45 .sparkStaging/application_1398370455828_0055drwx------ - test users 0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx------ - test users 0 2014-05-01 05:49 .sparkStaging/application_1398370455828_0057drwx------ - test users 0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx------ - test users 0 2014-05-01 05:58 .sparkStaging/application_1398370455828_0059drwx------ - test users 0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx------ - test users 0 2014-05-01 07:41 .sparkStaging/application_1398370455828_0061….drwx------ - test users 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx------ - test users 0 2014-06-16 15:03 .sparkStaging/application_1402001910637_0135drwx------ - test users 0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx------ - test users 0 2014-06-16 15:46 .sparkStaging/application_1402001910637_0138drwx------ - test users 0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx------ - test users 0 2014-06-17 05:55 .sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in :SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false"
http://spark.apache.org/docs/latest/running-on-yarn.htmlspark.yarn.preserve.staging.filesfalseSet to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather then delete them.or this is a bug that is not honoring the default value and is override to true somewhere?
Thanks.