You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by an...@apache.org on 2015/06/19 20:03:09 UTC
spark git commit: [SPARK-5836] [DOCS] [STREAMING] Clarify what may
cause long-running Spark apps to preserve shuffle files
Repository: spark
Updated Branches:
refs/heads/master 68a2dca29 -> 4be53d039
[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files
Clarify what may cause long-running Spark apps to preserve shuffle files
Author: Sean Owen <so...@cloudera.com>
Closes #6901 from srowen/SPARK-5836 and squashes the following commits:
a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4be53d03
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4be53d03
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4be53d03
Branch: refs/heads/master
Commit: 4be53d0395d3c7f61eef6b7d72db078e2e1199a7
Parents: 68a2dca
Author: Sean Owen <so...@cloudera.com>
Authored: Fri Jun 19 11:03:04 2015 -0700
Committer: Andrew Or <an...@databricks.com>
Committed: Fri Jun 19 11:03:04 2015 -0700
----------------------------------------------------------------------
docs/programming-guide.md | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/4be53d03/docs/programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index d5ff416..ae712d6 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
to disk, incurring the additional overhead of disk I/O and increased garbage collection.
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
-are not cleaned up from Spark's temporary storage until Spark is stopped, which means that
-long-running Spark jobs may consume available disk space. This is done so the shuffle doesn't need
-to be re-computed if the lineage is re-computed. The temporary storage directory is specified by the
+are preserved until the corresponding RDDs are no longer used and are garbage collected.
+This is done so the shuffle files don't need to be re-created if the lineage is re-computed.
+Garbage collection may happen only after a long period time, if the application retains references
+to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
+consume a large amount of disk space. The temporary storage directory is specified by the
`spark.local.dir` configuration parameter when configuring the Spark context.
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org