You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2014/07/07 19:02:10 UTC

tiers of caching

i noticed that some algorithms such as graphx liberally cache RDDs for
efficiency, which makes sense. however it can also leave a long trail of
unused yet cached RDDs, that might push other RDDs out of memory.

in a long-lived spark context i would like to decide which RDDs stick
around. would it make sense to create tiers of caching, to distinguish
explicitly cached RDDs by the application from RDDs that are temporary
cached by algos, so as to make sure these temporary caches don't push
application RDDs out of memory?

Re: tiers of caching

Posted by Andrew Or <an...@databricks.com>.

Others have also asked for this on the mailing list, and hence there's a
related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur
brings up a good point in that any current implementation of in-memory
shuffles will compete with application RDD blocks. I think we should
definitely add this at some point. In terms of a timeline, we already have
many features lined up for 1.1, however, so it will likely be after that.


2014-07-07 10:13 GMT-07:00 Ankur Dave <an...@gmail.com>:

> I think tiers/priorities for caching are a very good idea and I'd be
> interested to see what others think. In addition to letting libraries cache
> RDDs liberally, it could also unify memory management across other parts of
> Spark. For example, small shuffles benefit from explicitly keeping the
> shuffle outputs in memory rather than writing it to disk, possibly due to
> filesystem overhead. To prevent in-memory shuffle outputs from competing
> with application RDDs, Spark could mark them as lower-priority and specify
> that they should be dropped to disk when memory runs low.
>
> Ankur <http://www.ankurdave.com/>
>
>

Re: tiers of caching

Posted by Ankur Dave <an...@gmail.com>.

I think tiers/priorities for caching are a very good idea and I'd be
interested to see what others think. In addition to letting libraries cache
RDDs liberally, it could also unify memory management across other parts of
Spark. For example, small shuffles benefit from explicitly keeping the
shuffle outputs in memory rather than writing it to disk, possibly due to
filesystem overhead. To prevent in-memory shuffle outputs from competing
with application RDDs, Spark could mark them as lower-priority and specify
that they should be dropped to disk when memory runs low.

Ankur <http://www.ankurdave.com/>