You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by innowireless TaeYun Kim <ta...@innowireless.co.kr> on 2014/05/29 08:46:18 UTC

Suggestion: RDD cache depth

It would be nice if the RDD cache() method incorporate a depth information.

That is,

 

void test()
{

JavaRDD<.> rdd = .;

 

rdd.cache();  // to depth 1. actual caching happens.

rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
Else, exception.

.

rdd.uncache();  // to depth 1. Nop.

rdd.uncache();  // to depth 0. Actual unpersist happens.

}

 

This can be useful when writing code in modular way.

When a function receives an rdd as an argument, it doesn't necessarily know
the cache status of the rdd.

But it could want to cache the rdd, since it will use the rdd multiple
times.

But with the current RDD API, it cannot determine whether it should
unpersist it or leave it alone (so that caller can continue to use that rdd
without rebuilding).

 

Thanks.

RE: Suggestion: RDD cache depth

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

Opened a JIRA issue. (https://issues.apache.org/jira/browse/SPARK-1962)

Thanks.

-----Original Message-----
From: Matei Zaharia [mailto:matei.zaharia@gmail.com] 
Sent: Thursday, May 29, 2014 3:54 PM
To: dev@spark.apache.org
Subject: Re: Suggestion: RDD cache depth

This is a pretty cool idea - instead of cache depth I'd call it something
like reference counting. Would you mind opening a JIRA issue about it?

The issue of really composing together libraries that use RDDs nicely isn't
fully explored, but this is certainly one thing that would help with it. I'd
love to look at other ones too, e.g. how to allow libraries to share scans
over the same dataset.

Unfortunately using multiple cache() calls for this is probably not feasible
because it would change the current meaning of multiple calls. But we can
add a new API, or a parameter to the method.

Matei

On May 28, 2014, at 11:46 PM, innowireless TaeYun Kim
<ta...@innowireless.co.kr> wrote:

> It would be nice if the RDD cache() method incorporate a depth
information.
> 
> That is,
> 
> 
> 
> void test()
> {
> 
> JavaRDD<.> rdd = .;
> 
> 
> 
> rdd.cache();  // to depth 1. actual caching happens.
> 
> rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
> Else, exception.
> 
> .
> 
> rdd.uncache();  // to depth 1. Nop.
> 
> rdd.uncache();  // to depth 0. Actual unpersist happens.
> 
> }
> 
> 
> 
> This can be useful when writing code in modular way.
> 
> When a function receives an rdd as an argument, it doesn't necessarily 
> know the cache status of the rdd.
> 
> But it could want to cache the rdd, since it will use the rdd multiple 
> times.
> 
> But with the current RDD API, it cannot determine whether it should 
> unpersist it or leave it alone (so that caller can continue to use 
> that rdd without rebuilding).
> 
> 
> 
> Thanks.
> 
> 
>

Re: Suggestion: RDD cache depth

Posted by Matei Zaharia <ma...@gmail.com>.

This is a pretty cool idea — instead of cache depth I’d call it something like reference counting. Would you mind opening a JIRA issue about it?

The issue of really composing together libraries that use RDDs nicely isn’t fully explored, but this is certainly one thing that would help with it. I’d love to look at other ones too, e.g. how to allow libraries to share scans over the same dataset.

Unfortunately using multiple cache() calls for this is probably not feasible because it would change the current meaning of multiple calls. But we can add a new API, or a parameter to the method.

Matei

On May 28, 2014, at 11:46 PM, innowireless TaeYun Kim <ta...@innowireless.co.kr> wrote:

> It would be nice if the RDD cache() method incorporate a depth information.
> 
> That is,
> 
> 
> 
> void test()
> {
> 
> JavaRDD<.> rdd = .;
> 
> 
> 
> rdd.cache();  // to depth 1. actual caching happens.
> 
> rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
> Else, exception.
> 
> .
> 
> rdd.uncache();  // to depth 1. Nop.
> 
> rdd.uncache();  // to depth 0. Actual unpersist happens.
> 
> }
> 
> 
> 
> This can be useful when writing code in modular way.
> 
> When a function receives an rdd as an argument, it doesn't necessarily know
> the cache status of the rdd.
> 
> But it could want to cache the rdd, since it will use the rdd multiple
> times.
> 
> But with the current RDD API, it cannot determine whether it should
> unpersist it or leave it alone (so that caller can continue to use that rdd
> without rebuilding).
> 
> 
> 
> Thanks.
> 
> 
>