You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by charles li <ch...@gmail.com> on 2016/02/05 04:15:09 UTC

rdd cache priority

say I have 2 RDDs, RDD1 and RDD2.

both are 20g in memory.

and I cache both of them in memory using RDD1.cache() and RDD2.cache()


the in the further steps on my app, I never use RDD1 but use RDD2 for lots
of time.


then here is my question:

if there is only 40G memory in my cluster, and here I have another RDD,
RDD3 for 20g, what happened if I cache RDD3 using RDD3.cache()?


as the document says, cache using the default cache level : MEMORY_ONLY .
it means that it will not definitely cache RDD3 but re-compute it every
time used.

I feel a little confused, will spark help me remove RDD1 and put RDD3 in
the memory?

or is there any concept like " Priority cache " in spark?


great thanks



-- 
*--------------------------------------*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao

Re: rdd cache priority

Posted by Takeshi Yamamuro <li...@gmail.com>.
Hi,

u're right; rdd3 is not totally cached and it is re-computed every time.
If MEMORY_AND_DISK, rdd3 is written to disk.

Also, the current Spark does not automatically unpersist rdds depends
on frequency of use.

On Fri, Feb 5, 2016 at 12:15 PM, charles li <ch...@gmail.com> wrote:

> say I have 2 RDDs, RDD1 and RDD2.
>
> both are 20g in memory.
>
> and I cache both of them in memory using RDD1.cache() and RDD2.cache()
>
>
> the in the further steps on my app, I never use RDD1 but use RDD2 for lots
> of time.
>
>
> then here is my question:
>
> if there is only 40G memory in my cluster, and here I have another RDD,
> RDD3 for 20g, what happened if I cache RDD3 using RDD3.cache()?
>
>
> as the document says, cache using the default cache level : MEMORY_ONLY .
> it means that it will not definitely cache RDD3 but re-compute it every
> time used.
>
> I feel a little confused, will spark help me remove RDD1 and put RDD3 in
> the memory?
>
> or is there any concept like " Priority cache " in spark?
>
>
> great thanks
>
>
>
> --
> *--------------------------------------*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>



-- 
---
Takeshi Yamamuro