You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Baris ERGUN (JIRA)" <ji...@apache.org> on 2018/08/23 21:47:00 UTC

[jira] [Comment Edited] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

    [ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590733#comment-16590733 ] 

Baris ERGUN edited comment on SPARK-8582 at 8/23/18 9:46 PM:
-------------------------------------------------------------

+1 When is this issue is planned to be resolved?

I am facing it on Spark 2.3.1 when using with Dataset Api. It has been long time and old version since it has been reported? Is this maybe more complex than it seems? Thanks for the help


was (Author: bergun):
+1 when this issue is planned to be resolved. I am facing it on Spark 2.3.1 when using with Dataset Api. It has been long time and old version since it has been reported? Is this maybe more complex than it seems? Thanks for the help

> Optimize checkpointing to avoid computing an RDD twice
> ------------------------------------------------------
>
>                 Key: SPARK-8582
>                 URL: https://issues.apache.org/jira/browse/SPARK-8582
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Andrew Or
>            Assignee: Shixiong Zhu
>            Priority: Major
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD and save the intermediate contents to HDFS for fault tolerance. However, this is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during the action that triggered the checkpointing in the first place, and once while we checkpoint (we iterate through an RDD's partitions and write them to disk). See this line for more detail: https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint data to HDFS while we run the action. This will speed up many usages of `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but this is not always viable for very large input data. It's also not a great API to use in general.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org