You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ufuk Celebi <u....@fu-berlin.de> on 2013/10/28 17:25:33 UTC

Task output before a shuffle

Hey everybody,

I just watched the Spark Internals presentation [1] from the December 2012 dev meetup and have a couple of questions regarding the output of tasks before a shuffle.

1. Can anybody confirm that the default is still to persist stage output to RAM/disk and then have the following tasks pull it (see [1] around 45:40)? I guess a couple of things have changed since last year. I just want to be sure that this is not one of those things. ;-)

2. Is it possible to switch to a "push" model between stages instead of having the following tasks "pull" the result? I guess this is equivalent to the question whether it is possible to turn persisting results off.

3. Does the data need to be fully persisted before the next stage can start? Or will the following task start pulling data before everything is written out?

4. Is the main motivation for persisting to have faster recovery times on failures (e.g. checkpointing)?

Best wishes,

Ufuk

[1] http://www.youtube.com/watch?v=49Hr5xZyTEA

Re: Task output before a shuffle

Posted by Ufuk Celebi <u....@fu-berlin.de>.

On 29 Oct 2013, at 02:47, Matei Zaharia <ma...@gmail.com> wrote:
> Yes, we still write out data after these tasks in Spark 0.8, and it needs to be written out before any stage that reads it can start. The main reason is simplicity when there are faults, as well as more flexible scheduling (you don't have to decide where each reduce task is in advance, you can have more reduce tasks than you have CPU cores, etc).

Thank you for the answer! I have a follow-up:

In which fraction (RDD or non-RDD) of the heap will the output be stored before spilling to disk?

I have a job where I read over all large data set once and don't persist anything. Would it make sense to set "spark.storage.memoryFraction" to a smaller value in order to avoid spilling to disk?

- Ufuk

Re: Task output before a shuffle

Posted by Matei Zaharia <ma...@gmail.com>.

Hi Ufuk,

Yes, we still write out data after these tasks in Spark 0.8, and it needs to be written out before any stage that reads it can start. The main reason is simplicity when there are faults, as well as more flexible scheduling (you don't have to decide where each reduce task is in advance, you can have more reduce tasks than you have CPU cores, etc).

Matei

On Oct 28, 2013, at 9:25 AM, Ufuk Celebi <u....@fu-berlin.de> wrote:

> Hey everybody,
> 
> I just watched the Spark Internals presentation [1] from the December 2012 dev meetup and have a couple of questions regarding the output of tasks before a shuffle.
> 
> 1. Can anybody confirm that the default is still to persist stage output to RAM/disk and then have the following tasks pull it (see [1] around 45:40)? I guess a couple of things have changed since last year. I just want to be sure that this is not one of those things. ;-)
> 
> 2. Is it possible to switch to a "push" model between stages instead of having the following tasks "pull" the result? I guess this is equivalent to the question whether it is possible to turn persisting results off.
> 
> 3. Does the data need to be fully persisted before the next stage can start? Or will the following task start pulling data before everything is written out?
> 
> 4. Is the main motivation for persisting to have faster recovery times on failures (e.g. checkpointing)?
> 
> Best wishes,
> 
> Ufuk
> 
> [1] http://www.youtube.com/watch?v=49Hr5xZyTEA