You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Burak Yavuz (JIRA)" <ji...@apache.org> on 2014/09/22 19:34:33 UTC
[jira] [Commented] (SPARK-3631) Add docs for checkpoint usage
[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143484#comment-14143484 ]
Burak Yavuz commented on SPARK-3631:
------------------------------------
Thanks for setting this up [~aash]! [~pwendell], [~tdas], [~joshrosen] could you please confirm/correct/add to my explanation above. Thanks!
> Add docs for checkpoint usage
> -----------------------------
>
> Key: SPARK-3631
> URL: https://issues.apache.org/jira/browse/SPARK-3631
> Project: Spark
> Issue Type: Documentation
> Components: Documentation
> Affects Versions: 1.1.0
> Reporter: Andrew Ash
> Assignee: Andrew Ash
>
> We should include general documentation on using checkpoints. Right now the docs only cover checkpoints in the Spark Streaming use case which is slightly different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at https://spark.apache.org/docs/latest/programming-guide.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org