You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kartik Mathur <ka...@bluedata.com> on 2015/10/01 21:36:03 UTC

Shuffle Write v/s Shuffle Read

Hi

I am trying to better understand shuffle in spark .

Based on my understanding thus far ,

*Shuffle Write* : writes stage output for intermediate stage on local disk
if memory is not sufficient.,
Example , if each worker has 200 MB memory for intermediate results and the
results are 300MB then , each executer* will keep 200 MB in memory and will
write remaining 100 MB on local disk .  *

*Shuffle Read : *Each executer will read from other executer's *memory +
disk , so total read in above case will be 300(200 from memory and 100 from
disk)*num of executers ? *

Is my understanding correct ?

Thanks,
Kartik

Re: Shuffle Write v/s Shuffle Read

Posted by Zoltán Zvara <zo...@gmail.com>.

Hi,

Shuffle output goes to local disk each time, as far as I know, never to
memory.

On Fri, Oct 2, 2015 at 1:26 PM Adrian Tanase <at...@adobe.com> wrote:

> I’m not sure this is related to memory management – the shuffle is the
> central act of moving data around nodes when the computations need the data
> on another node (E.g. Group by, sort, etc)
>
> Shuffle read and shuffle write should be mirrored on the left/right side
> of a shuffle between 2 stages.
>
> -adrian
>
> From: Kartik Mathur
> Date: Thursday, October 1, 2015 at 10:36 PM
> To: user
> Subject: Shuffle Write v/s Shuffle Read
>
> Hi
>
> I am trying to better understand shuffle in spark .
>
> Based on my understanding thus far ,
>
> *Shuffle Write* : writes stage output for intermediate stage on local
> disk if memory is not sufficient.,
> Example , if each worker has 200 MB memory for intermediate results and
> the results are 300MB then , each executer* will keep 200 MB in memory
> and will write remaining 100 MB on local disk .  *
>
> *Shuffle Read : *Each executer will read from other executer's *memory +
> disk , so total read in above case will be 300(200 from memory and 100 from
> disk)*num of executers ? *
>
> Is my understanding correct ?
>
> Thanks,
> Kartik
>

Re: Shuffle Write v/s Shuffle Read

Posted by Adrian Tanase <at...@adobe.com>.

I’m not sure this is related to memory management – the shuffle is the central act of moving data around nodes when the computations need the data on another node (E.g. Group by, sort, etc)

Shuffle read and shuffle write should be mirrored on the left/right side of a shuffle between 2 stages.

-adrian

From: Kartik Mathur
Date: Thursday, October 1, 2015 at 10:36 PM
To: user
Subject: Shuffle Write v/s Shuffle Read

Hi

I am trying to better understand shuffle in spark .

Based on my understanding thus far ,

Shuffle Write : writes stage output for intermediate stage on local disk if memory is not sufficient.,
Example , if each worker has 200 MB memory for intermediate results and the results are 300MB then , each executer will keep 200 MB in memory and will write remaining 100 MB on local disk .

Shuffle Read : Each executer will read from other executer's memory + disk , so total read in above case will be 300(200 from memory and 100 from disk)*num of executers ?

Is my understanding correct ?

Thanks,
Kartik