You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Effi Ofer <ef...@gmail.com> on 2017/03/29 10:19:43 UTC

Why is shuffle data always persisted to disk?

Greetings, I was wondering why Spark's Shuffler always persists the shuffle
data to disk?  I understand that the persisted data can be used by the
scheduler to truncate the lineage of the RDD graph if an existing RDD has
been materialized as a side effect of an earlier shuffle.  But that does
not explain why Spark is not keeping the shuffle RDD in memory until memory
becomes sufficiently low to trigger victim selection and spilling.  Any
hints and pointers would be appreciated.

Thanks,
Effi

Re: Why is shuffle data always persisted to disk?

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.
There's been some discussion of this on this JIRA
<https://issues.apache.org/jira/browse/SPARK-3376> and the associated PR.
The short summary is that, in theory / to a VERY rough approximation, the
OS buffer cache does everything we'd want an in-memory shuffle to do, and
is simple.

On Wed, Mar 29, 2017 at 3:19 AM, Effi Ofer <ef...@gmail.com> wrote:

> Greetings, I was wondering why Spark's Shuffler always persists the
> shuffle data to disk?  I understand that the persisted data can be used by
> the scheduler to truncate the lineage of the RDD graph if an existing RDD
> has been materialized as a side effect of an earlier shuffle.  But that
> does not explain why Spark is not keeping the shuffle RDD in memory until
> memory becomes sufficiently low to trigger victim selection and spilling.
> Any hints and pointers would be appreciated.
>
> Thanks,
> Effi
>
>