You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shushant Arora <sh...@gmail.com> on 2016/09/29 05:09:22 UTC

spark persistence doubt

Hi

I have a flow like below

1.rdd1=some source.transform();
2.tranformedrdd1 = rdd1.transform(..);
3.transformrdd2 = rdd1.transform(..);

4.tranformrdd1.action();

Does I need to persist rdd1 to optimise step 2 and 3 ? or since there is no
lineage breakage so it will work without persist ?

Thanks

Re: spark persistence doubt

Posted by Bedrytski Aliaksandr <sp...@bedryt.ski>.

Hi,

the 4th step should contain "transformrdd2", right?

considering that transformations are lined-up and executed only when
there is an action (also known as lazy execution), I would say that
adding persist() to the step 1 would not do any good (and may even be
harmful as you may lose the optimisations given by lining up the 3 steps
in one operation).

If there is a second action executed on any of the transformation,
persisting the farthest common transformation would be a good idea.

Regards,
--
  Bedrytski Aliaksandr
  spark@bedryt.ski

On Thu, Sep 29, 2016, at 07:09, Shushant Arora wrote:
> Hi
>
> I have a flow like below
>
> 1.rdd1=some source.transform();
> 2.tranformedrdd1 = rdd1.transform(..);
> 3.transformrdd2 = rdd1.transform(..);
>
> 4.tranformrdd1.action();
>
> Does I need to persist rdd1 to optimise step 2 and 3 ? or since there
> is no lineage breakage so it will work without persist ?
>
> Thanks
>