You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nilesh Chakraborty <ni...@nileshc.com> on 2014/06/28 18:36:58 UTC

Alternative to checkpointing and materialization for truncating lineage in high iteration jobs

Hello,

In a thread about "java.lang.StackOverflowError when calling count()" [1] I
saw Tathagata Das share an interesting approach for truncating RDD lineage -
this helps prevent StackOverflowErrors in high iteration jobs while avoiding
the disk-writing performance penalty. Here's an excerpt from TD's post:

If you are brave enough, you can try the following. Instead of relying on
checkpointing to HDFS for truncating lineage, you can do the following.
1. Persist Nth RDD with replication (see different StorageLevels), this
would replicated the in-memory RDD between workers within Spark. Lets call
this RDD as R.
2. Force it materialize in the memory.
3. Create a modified RDD R` which has the same data as RDD R but does not
have the lineage. This is done by creating a new BlockRDD using the ids of
blocks of data representing the in-memory R (can elaborate on that if you
want).

This will avoid writing to HDFS (replication in the Spark memory), but
truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow
error.

---------------------------------------------------------------------

Now I'm not sure how to do no. 3. Any ideas? I'm CC'ing Tathagata too.

Cheers,
Nilesh

[1]:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201405.mbox/%3CCAMwrk0kiQXhktFUaAMHBOROv5LV+D8Y+c5NyCmsXTqASze4_dg@mail.gmail.com%3E

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Alternative-to-checkpointing-and-materialization-for-truncating-lineage-in-high-iteration-jobs-tp8488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Alternative to checkpointing and materialization for truncating lineage in high iteration jobs

Posted by "Baoxu Shi(Dash)" <bs...@nd.edu>.

I’m facing the same situation. It would be great if someone could provide a code snippet as example.

On Jun 28, 2014, at 12:36 PM, Nilesh Chakraborty <ni...@nileshc.com> wrote:

> Hello,
> 
> In a thread about "java.lang.StackOverflowError when calling count()" [1] I
> saw Tathagata Das share an interesting approach for truncating RDD lineage -
> this helps prevent StackOverflowErrors in high iteration jobs while avoiding
> the disk-writing performance penalty. Here's an excerpt from TD's post:
> 
> If you are brave enough, you can try the following. Instead of relying on
> checkpointing to HDFS for truncating lineage, you can do the following.
> 1. Persist Nth RDD with replication (see different StorageLevels), this
> would replicated the in-memory RDD between workers within Spark. Lets call
> this RDD as R.
> 2. Force it materialize in the memory.
> 3. Create a modified RDD R` which has the same data as RDD R but does not
> have the lineage. This is done by creating a new BlockRDD using the ids of
> blocks of data representing the in-memory R (can elaborate on that if you
> want).
> 
> This will avoid writing to HDFS (replication in the Spark memory), but
> truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow
> error.
> 
> ---------------------------------------------------------------------
> 
> Now I'm not sure how to do no. 3. Any ideas? I'm CC'ing Tathagata too.
> 
> Cheers,
> Nilesh
> 
> [1]:
> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201405.mbox/%3CCAMwrk0kiQXhktFUaAMHBOROv5LV+D8Y+c5NyCmsXTqASze4_dg@mail.gmail.com%3E
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Alternative-to-checkpointing-and-materialization-for-truncating-lineage-in-high-iteration-jobs-tp8488.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.