You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by rapelly kartheek <ka...@gmail.com> on 2014/12/04 08:54:58 UTC

Necessity for rdd replication.

Hi,

I was just thinking about necessity for rdd replication. One category could
be something like large number of threads requiring same rdd. Even though,
a single rdd can be shared by multiple threads belonging to "same
application" , I believe we can extract better parallelism  if the rdd is
replicated, am I right?.

I am eager to know if there are any real life applications or any other
scenarios which force rdd to be replicated. Can someone please throw some
light on "necessity for rdd replication".

Thank you

Re: Necessity for rdd replication.

Posted by Sameer Farooqui <sa...@databricks.com>.
In general, most use cases don't need the RDD to be replicated in memory
multiple times. It would be a rare exception to do this. If it's really
expensive (time consuming) to recomputing a lost partition or if the use
case is extremely time sensitive, then maybe you could replicate it in
memory. But in general, you can safely rely on the RDD lineage graph to
re-create the lost partition it it gets discarded from memory.

As far as extracting better parallelism if the RDD is replicated, that
really depends on what sort of transformations and operations you're
running against the RDD, but again.. generally speaking, you shouldn't need
to replicate it.

On Wed, Dec 3, 2014 at 11:54 PM, rapelly kartheek <ka...@gmail.com>
wrote:

> Hi,
>
> I was just thinking about necessity for rdd replication. One category
> could be something like large number of threads requiring same rdd. Even
> though, a single rdd can be shared by multiple threads belonging to "same
> application" , I believe we can extract better parallelism  if the rdd is
> replicated, am I right?.
>
> I am eager to know if there are any real life applications or any other
> scenarios which force rdd to be replicated. Can someone please throw some
> light on "necessity for rdd replication".
>
> Thank you
>
>