You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shushant Arora <sh...@gmail.com> on 2016/09/29 05:09:22 UTC
spark persistence doubt
Hi
I have a flow like below
1.rdd1=some source.transform();
2.tranformedrdd1 = rdd1.transform(..);
3.transformrdd2 = rdd1.transform(..);
4.tranformrdd1.action();
Does I need to persist rdd1 to optimise step 2 and 3 ? or since there is no
lineage breakage so it will work without persist ?
Thanks
Re: spark persistence doubt
Posted by Bedrytski Aliaksandr <sp...@bedryt.ski>.
Hi,
the 4th step should contain "transformrdd2", right?
considering that transformations are lined-up and executed only when
there is an action (also known as lazy execution), I would say that
adding persist() to the step 1 would not do any good (and may even be
harmful as you may lose the optimisations given by lining up the 3 steps
in one operation).
If there is a second action executed on any of the transformation,
persisting the farthest common transformation would be a good idea.
Regards,
--
Bedrytski Aliaksandr
spark@bedryt.ski
On Thu, Sep 29, 2016, at 07:09, Shushant Arora wrote:
> Hi
>
> I have a flow like below
>
> 1.rdd1=some source.transform();
> 2.tranformedrdd1 = rdd1.transform(..);
> 3.transformrdd2 = rdd1.transform(..);
>
> 4.tranformrdd1.action();
>
> Does I need to persist rdd1 to optimise step 2 and 3 ? or since there
> is no lineage breakage so it will work without persist ?
>
> Thanks
>