You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yang <te...@gmail.com> on 2016/10/17 23:11:56 UTC

previous stage results are not saved?

I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell

it seems that after all 10 finished successfully, if I run the last, or the
9th again,
spark reruns all the previous stages from scratch, instead of utilizing the
partial results.

this is quite serious since I can't experiment while making small changes
to the code.

any idea what part of the spark framework might have caused this ?

thanks
Yang

Re: previous stage results are not saved?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

There is no need to do that if 1) the stage that you are concerned with
either made use of or produced MapOutputs/shuffle files; 2) reuse of those
shuffle files (which may very well be in the OS buffer cache of the worker
nodes) is sufficient for your needs; 3) the relevant Stage objects haven't
gone out of scope, which would allow the shuffle files to be removed; 4)
you reuse the exact same Stage objects that were used previously.  If all
of that is true, then Spark will re-use the prior stage with performance
very similar to if you had explicitly cached an equivalent RDD.

On Mon, Oct 17, 2016 at 4:53 PM, ayan guha <gu...@gmail.com> wrote:

> You can use cache or persist.
>
> On Tue, Oct 18, 2016 at 10:11 AM, Yang <te...@gmail.com> wrote:
>
>> I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell
>>
>> it seems that after all 10 finished successfully, if I run the last, or
>> the 9th again,
>> spark reruns all the previous stages from scratch, instead of utilizing
>> the partial results.
>>
>> this is quite serious since I can't experiment while making small changes
>> to the code.
>>
>> any idea what part of the spark framework might have caused this ?
>>
>> thanks
>> Yang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: previous stage results are not saved?

Posted by ayan guha <gu...@gmail.com>.

You can use cache or persist.

On Tue, Oct 18, 2016 at 10:11 AM, Yang <te...@gmail.com> wrote:

> I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell
>
> it seems that after all 10 finished successfully, if I run the last, or
> the 9th again,
> spark reruns all the previous stages from scratch, instead of utilizing
> the partial results.
>
> this is quite serious since I can't experiment while making small changes
> to the code.
>
> any idea what part of the spark framework might have caused this ?
>
> thanks
> Yang
>



-- 
Best Regards,
Ayan Guha