You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Burak Yavuz (JIRA)" <ji...@apache.org> on 2014/09/22 19:34:33 UTC

[jira] [Commented] (SPARK-3631) Add docs for checkpoint usage

    [ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143484#comment-14143484 ] 

Burak Yavuz commented on SPARK-3631:
------------------------------------

Thanks for setting this up [~aash]! [~pwendell], [~tdas], [~joshrosen] could you please confirm/correct/add to my explanation above. Thanks!

> Add docs for checkpoint usage
> -----------------------------
>
>                 Key: SPARK-3631
>                 URL: https://issues.apache.org/jira/browse/SPARK-3631
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.1.0
>            Reporter: Andrew Ash
>            Assignee: Andrew Ash
>
> We should include general documentation on using checkpoints.  Right now the docs only cover checkpoints in the Spark Streaming use case which is slightly different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org