You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "holdenk (JIRA)" <ji...@apache.org> on 2016/10/08 04:20:22 UTC

[jira] [Commented] (SPARK-1762) Add functionality to pin RDDs in cache

    [ https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15557133#comment-15557133 ] 

holdenk commented on SPARK-1762:
--------------------------------

Is this something we are still interested in? I could see it becoming more important with `Datasets`/`DataFrames` where a partial cache miss is much more expensive (potentially) than with `RDD`s.

> Add functionality to pin RDDs in cache
> --------------------------------------
>
>                 Key: SPARK-1762
>                 URL: https://issues.apache.org/jira/browse/SPARK-1762
>             Project: Spark
>          Issue Type: Improvement
>          Components: Block Manager, Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Andrew Or
>
> Right now, all RDDs are created equal, and there is no mechanism to identify a certain RDD to be more important than the rest. This is a problem if the RDD fraction is small, because just caching a few RDDs can evict more important ones.
> A side effect of this feature is that we can now more safely allocate a smaller spark.storage.memoryFraction if we know how large our important RDDs are, without having to worry about them being evicted. This allows us to use more memory for shuffles, for instance, and avoid disk spills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org