You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2015/03/30 10:43:36 UTC

Spark caching

Hi all,

I am trying to understand Spark lazy evaluation works, and I need some
help. I have noticed that creating an RDD once and using it many times
won't trigger recomputation of it every time it gets used. Whereas creating
a new RDD for every time a new operation is performed will trigger
recomputation of the whole RDD again.
I would have thought that both approaches behave similarly (i.e. not
caching) due to Spark's lazy evaluation strategy, but I guess Spark is
keeps track of the RDD used and of the partial results computed so far so
it doesn't do unnecessary extra work. Could anybody point me to where Spark
decides what to cache or how I can disable this behaviour?
Thanks in advance!


Renato M.

Approach 1 --> this doesn't trigger recomputation of the RDD in every
iteration
=========
JavaRDD aggrRel = Utils.readJavaRDD(...).groupBy(groupFunction).
map(mapFunction);
for (int i = 0; i < NUM_RUNS; i++) {
   // doing some computation like aggrRel.count()
   . . .
}

Approach 2 --> this triggers recomputation of the RDD in every iteration
=========
for (int i = 0; i < NUM_RUNS; i++) {
   JavaRDD aggrRel =
Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
   // doing some computation like aggrRel.count()
   . . .
}

Re: Spark caching

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Thanks Sean!
Do you know if there is a way (even manually) to delete these intermediate
shuffle results? I was just want to test the "expected" behaviour. I know
that re-caching might be a positive action most of the times but I want to
try it without it.


Renato M.

2015-03-30 12:15 GMT+02:00 Sean Owen <so...@cloudera.com>:

> I think that you get a sort of "silent" caching after shuffles, in
> some cases, since the shuffle files are not immediately removed and
> can be reused.
>
> (This is the flip side to the frequent question/complaint that the
> shuffle files aren't removed straight away.)
>
> On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo
> <re...@gmail.com> wrote:
> > Hi all,
> >
> > I am trying to understand Spark lazy evaluation works, and I need some
> help.
> > I have noticed that creating an RDD once and using it many times won't
> > trigger recomputation of it every time it gets used. Whereas creating a
> new
> > RDD for every time a new operation is performed will trigger
> recomputation
> > of the whole RDD again.
> > I would have thought that both approaches behave similarly (i.e. not
> > caching) due to Spark's lazy evaluation strategy, but I guess Spark is
> keeps
> > track of the RDD used and of the partial results computed so far so it
> > doesn't do unnecessary extra work. Could anybody point me to where Spark
> > decides what to cache or how I can disable this behaviour?
> > Thanks in advance!
> >
> >
> > Renato M.
> >
> > Approach 1 --> this doesn't trigger recomputation of the RDD in every
> > iteration
> > =========
> > JavaRDD aggrRel =
> > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
> > for (int i = 0; i < NUM_RUNS; i++) {
> >    // doing some computation like aggrRel.count()
> >    . . .
> > }
> >
> > Approach 2 --> this triggers recomputation of the RDD in every iteration
> > =========
> > for (int i = 0; i < NUM_RUNS; i++) {
> >    JavaRDD aggrRel =
> > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
> >    // doing some computation like aggrRel.count()
> >    . . .
> > }
>

Re: Spark caching

Posted by Sean Owen <so...@cloudera.com>.

I think that you get a sort of "silent" caching after shuffles, in
some cases, since the shuffle files are not immediately removed and
can be reused.

(This is the flip side to the frequent question/complaint that the
shuffle files aren't removed straight away.)

On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> Hi all,
>
> I am trying to understand Spark lazy evaluation works, and I need some help.
> I have noticed that creating an RDD once and using it many times won't
> trigger recomputation of it every time it gets used. Whereas creating a new
> RDD for every time a new operation is performed will trigger recomputation
> of the whole RDD again.
> I would have thought that both approaches behave similarly (i.e. not
> caching) due to Spark's lazy evaluation strategy, but I guess Spark is keeps
> track of the RDD used and of the partial results computed so far so it
> doesn't do unnecessary extra work. Could anybody point me to where Spark
> decides what to cache or how I can disable this behaviour?
> Thanks in advance!
>
>
> Renato M.
>
> Approach 1 --> this doesn't trigger recomputation of the RDD in every
> iteration
> =========
> JavaRDD aggrRel =
> Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
> for (int i = 0; i < NUM_RUNS; i++) {
>    // doing some computation like aggrRel.count()
>    . . .
> }
>
> Approach 2 --> this triggers recomputation of the RDD in every iteration
> =========
> for (int i = 0; i < NUM_RUNS; i++) {
>    JavaRDD aggrRel =
> Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
>    // doing some computation like aggrRel.count()
>    . . .
> }

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org