You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by huan zhang <zh...@gmail.com> on 2015/11/24 02:36:58 UTC

why does shuffle in spark write shuffle data to disk by default?

Hi All,
    I'm wonderring why does shuffle in spark write shuffle data to disk by
default?
    In Stackoverflow, someone said it's used by FTS, but node down is the
most common reason of fault, and write to disk cannot do FTS in this case
either.
    So why not use ramdisk as default instread of SDD or HDD only?

Thanks
Hubert Zhang

Re: why does shuffle in spark write shuffle data to disk by default?

Posted by Reynold Xin <rx...@databricks.com>.
I think for most jobs the bottleneck isn't in writing shuffle data to disk,
since shuffle data needs to be "shuffled" and sent across the network.

You can always use a ramdisk yourself. Requiring ramdisk by default would
significantly complicate configuration and platform portability.


On Mon, Nov 23, 2015 at 5:36 PM, huan zhang <zh...@gmail.com> wrote:

> Hi All,
>     I'm wonderring why does shuffle in spark write shuffle data to disk by
> default?
>     In Stackoverflow, someone said it's used by FTS, but node down is the
> most common reason of fault, and write to disk cannot do FTS in this case
> either.
>     So why not use ramdisk as default instread of SDD or HDD only?
>
> Thanks
> Hubert Zhang
>