You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Benyi Wang <be...@gmail.com> on 2014/11/10 20:33:20 UTC

Custom persist or cache of RDD?

When I have a multi-step process flow like this:

A -> B -> C -> D -> E -> F

I need to store B and D's results into parquet files

B.saveAsParquetFile
D.saveAsParquetFile

If I don't cache/persist any step, spark might recompute from A,B,C,D and E
if something is wrong in F.

Of course, I'd better cache all steps if I have enough memory to avoid this
re-computation, or persist result to disk. But persisting B and D seems
duplicate with saving B and D as parquet files.

I'm wondering if spark can restore B and D from the parquet files using a
customized persist and restore procedure?

Re: Custom persist or cache of RDD?

Posted by Daniel Siegmann <da...@velos.io>.
But that requires an (unnecessary) load from disk.

I have run into this same issue, where we want to save intermediate results
but continue processing. The cache / persist feature of Spark doesn't seem
designed for this case. Unfortunately I'm not aware of a better solution
with the current version of Spark.

On Mon, Nov 10, 2014 at 5:15 PM, Sean Owen <so...@cloudera.com> wrote:

> Well you can always create C by loading B from disk, and likewise for
> E / D. No need for any custom procedure.
>
> On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang <be...@gmail.com> wrote:
> > When I have a multi-step process flow like this:
> >
> > A -> B -> C -> D -> E -> F
> >
> > I need to store B and D's results into parquet files
> >
> > B.saveAsParquetFile
> > D.saveAsParquetFile
> >
> > If I don't cache/persist any step, spark might recompute from A,B,C,D
> and E
> > if something is wrong in F.
> >
> > Of course, I'd better cache all steps if I have enough memory to avoid
> this
> > re-computation, or persist result to disk. But persisting B and D seems
> > duplicate with saving B and D as parquet files.
> >
> > I'm wondering if spark can restore B and D from the parquet files using a
> > customized persist and restore procedure?
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Re: Custom persist or cache of RDD?

Posted by Sean Owen <so...@cloudera.com>.
Well you can always create C by loading B from disk, and likewise for
E / D. No need for any custom procedure.

On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang <be...@gmail.com> wrote:
> When I have a multi-step process flow like this:
>
> A -> B -> C -> D -> E -> F
>
> I need to store B and D's results into parquet files
>
> B.saveAsParquetFile
> D.saveAsParquetFile
>
> If I don't cache/persist any step, spark might recompute from A,B,C,D and E
> if something is wrong in F.
>
> Of course, I'd better cache all steps if I have enough memory to avoid this
> re-computation, or persist result to disk. But persisting B and D seems
> duplicate with saving B and D as parquet files.
>
> I'm wondering if spark can restore B and D from the parquet files using a
> customized persist and restore procedure?
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org