You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sourabh Chandak <so...@gmail.com> on 2016/01/23 03:10:36 UTC

Caching in Spark

Hi,

I have a spark app which internally splits into 2 jobs coz we write to 2
different cassandra tables. The input data comes from the same cassandra
table, so after reading data from cassandra and apply few transformations I
cache one of the RDD and fork the program to compute both the metrics.
Ideally what I would expect is that the 2nd job skips all the previous
transformations prior to the cached RDD and start running from there,
instead what is happening is that the entire stage in which caching was
done is getting re-executed till the caching transformation and then the
job continues the way it is supposed to.

Here are 2 DAGs from the jobs:

Job 0:
[image: Inline image 1]

Job 1:

[image: Inline image 2]

The RDD from Job 0 Stage 4 after join and map is cached as indicated by the
green dot, and the corresponding stage in Job 1 is Stage 9 which again
re-executes the join and map stage before going on to the regular path.

There is another interesting observation that the final RDD in stage 0 is
also a cached RDD and is being used in Stage 2. When we kick off the job
both Stage 0 and Stage 2 are schedule in parallel but Stage 2 waits for a
long time and then does some computation after 70-80% of Stage 0 is over
which I assume is waiting for the cache to be built by Stage 0 and the
computation it is doing is the last map transformation in Stage 2 which
indicates cache working intra job.

I want to understand this in depth coz re computation of entire stage can
sometimes lead to re-reading data from cassandra or shuffling a lot of data.

Thanks,
Sourabh