You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Lisonbee, Todd" <to...@intel.com> on 2014/04/26 14:59:22 UTC

is it okay to reuse objects across RDD's?

For example,

val originalRDD: RDD[SomeCaseClass] = ...

// Option 1: objects are copied, setting prop1 in the process
val transformedRDD = originalRDD.map( item => item.copy(prop1 = calculation() )

// Option 2: objects are re-used and modified
val tranformedRDD = originalRDD.map( item => item.prop1 = calculation() )

I did a couple of small tests with option 2 and noticed less time was spent in garbage collection.  It didn't add up to much but with a large enough data set it would make a difference.  Also, it seems that less memory would be used.

Potential gotchas:

- Objects in originalRDD are being modified, so you can't expect them to have not changed
- You also can't rely on objects in originalRDD having the new value because originalRDD might be re-caclulated
- If originalRDD was a PairRDD, and you modified the keys, it could cause issues
- more?

Other than the potential gotchas, is there any reason not to reuse objects across RDD's?  Is it a recommended practice for reducing memory usage and garbage collection or not?

Is it safe to do this in code you expect to work on future versions of Spark?

Thanks in advance,

Todd

Re: is it okay to reuse objects across RDD's?

Posted by Tom Vacek <mi...@gmail.com>.
Ian, I tried playing with your suggestion, but I get a task not
serializable error (and some obvious things didn't fix it).  Can you get
that working?


On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek <mi...@gmail.com> wrote:

> As to your last line: I've used RDD zipping to avoid GC since MyBaseData
> is large and doesn't change.  I think this is a very good solution to what
> is being asked for.
>
>
> On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell <ia...@ianoconnell.com>wrote:
>
>> A mutable map in an object should do what your looking for then I
>> believe. You just reference the object as an object in your closure so it
>> won't be swept up when your closure is serialized and you can reference
>> variables of the object on the remote host then. e.g.:
>>
>> object MyObject {
>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>> }
>>
>> rdd.map { ele =>
>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>> ...
>> }.map {ele =>
>> require(MyObject.mmap(ele) == 1L)
>>
>> }.count
>>
>> Along with the data loss just be careful with thread safety and multiple
>> threads/partitions on one host so the map should be viewed as shared
>> amongst a larger space.
>>
>>
>>
>> Also with your exact description it sounds like your data should be
>> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
>> LastIterationSideValues)]
>>
>>
>>
>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <
>> codedeft@cs.stanford.edu> wrote:
>>
>>> In our case, we'd like to keep memory content from one iteration to the
>>> next, and not just during a single mapPartition call because then we can do
>>> more efficient computations using the values from the previous iteration.
>>>
>>> So essentially, we need to declare objects outside the scope of the
>>> map/reduce calls (but residing in individual workers), then those can be
>>> accessed from the map/reduce calls.
>>>
>>> We'd be making some assumptions as you said, such as - RDD partition is
>>> statically located and can't move from worker to another worker unless the
>>> worker crashes.
>>>
>>>
>>>
>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>>> codedeft@cs.stanford.edu> wrote:
>>>>
>>>>> Actually, I do not know how to do something like this or whether this
>>>>> is possible - thus my suggestive statement.
>>>>>
>>>>> Can you already declare persistent memory objects per worker? I tried
>>>>> something like constructing a singleton object within map functions, but
>>>>> that didn't work as it seemed to actually serialize singletons and pass it
>>>>> back and forth in a weird manner.
>>>>>
>>>>>
>>>> Does it need to be persistent across operations, or just persist for
>>>> the lifetime of processing of one partition in one mapPartition? The latter
>>>> is quite easy and might give most of the speedup.
>>>>
>>>> Maybe that's 'enough', even if it means you re-cache values several
>>>> times in a repeated iterative computation. It would certainly avoid
>>>> managing a lot of complexity in trying to keep that state alive remotely
>>>> across operations. I'd also be interested if there is any reliable way to
>>>> do that, though it seems hard since it means you embed assumptions about
>>>> where particular data is going to be processed.
>>>>
>>>>
>>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Tom Vacek <mi...@gmail.com>.
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is
large and doesn't change.  I think this is a very good solution to what is
being asked for.


On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell <ia...@ianoconnell.com> wrote:

> A mutable map in an object should do what your looking for then I believe.
> You just reference the object as an object in your closure so it won't be
> swept up when your closure is serialized and you can reference variables of
> the object on the remote host then. e.g.:
>
> object MyObject {
>   val mmap = scala.collection.mutable.Map[Long, Long]()
> }
>
> rdd.map { ele =>
> MyObject.mmap.getOrElseUpdate(ele, 1L)
> ...
> }.map {ele =>
> require(MyObject.mmap(ele) == 1L)
>
> }.count
>
> Along with the data loss just be careful with thread safety and multiple
> threads/partitions on one host so the map should be viewed as shared
> amongst a larger space.
>
>
>
> Also with your exact description it sounds like your data should be
> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
> LastIterationSideValues)]
>
>
>
> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <codedeft@cs.stanford.edu
> > wrote:
>
>> In our case, we'd like to keep memory content from one iteration to the
>> next, and not just during a single mapPartition call because then we can do
>> more efficient computations using the values from the previous iteration.
>>
>> So essentially, we need to declare objects outside the scope of the
>> map/reduce calls (but residing in individual workers), then those can be
>> accessed from the map/reduce calls.
>>
>> We'd be making some assumptions as you said, such as - RDD partition is
>> statically located and can't move from worker to another worker unless the
>> worker crashes.
>>
>>
>>
>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>> codedeft@cs.stanford.edu> wrote:
>>>
>>>> Actually, I do not know how to do something like this or whether this
>>>> is possible - thus my suggestive statement.
>>>>
>>>> Can you already declare persistent memory objects per worker? I tried
>>>> something like constructing a singleton object within map functions, but
>>>> that didn't work as it seemed to actually serialize singletons and pass it
>>>> back and forth in a weird manner.
>>>>
>>>>
>>> Does it need to be persistent across operations, or just persist for the
>>> lifetime of processing of one partition in one mapPartition? The latter is
>>> quite easy and might give most of the speedup.
>>>
>>> Maybe that's 'enough', even if it means you re-cache values several
>>> times in a repeated iterative computation. It would certainly avoid
>>> managing a lot of complexity in trying to keep that state alive remotely
>>> across operations. I'd also be interested if there is any reliable way to
>>> do that, though it seems hard since it means you embed assumptions about
>>> where particular data is going to be processed.
>>>
>>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Tom Vacek <mi...@gmail.com>.
If you create your auxiliary RDD as a map from the examples, the
partitioning will be inherited.


On Mon, Apr 28, 2014 at 12:38 PM, Sung Hwan Chung
<co...@cs.stanford.edu>wrote:

> That might be a good alternative to what we are looking for. But I wonder
> if this would be as efficient as we want to. For instance, will RDDs of the
> same size usually get partitioned to the same machines - thus not
> triggering any cross machine aligning, etc. We'll explore it, but I would
> still very much like to see more direct worker memory management besides
> RDDs.
>
>
> On Mon, Apr 28, 2014 at 10:26 AM, Tom Vacek <mi...@gmail.com>wrote:
>
>> Right---They are zipped at each iteration.
>>
>>
>> On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen <ch...@yahoo.com>wrote:
>>
>>> Tom,
>>>     Are you suggesting two RDDs, one with loss and another for the rest
>>> info, using zip to tie them together, but do update on loss RDD (copy) ?
>>>
>>> Chester
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 28, 2014, at 9:45 AM, Tom Vacek <mi...@gmail.com> wrote:
>>>
>>> I'm not sure what I said came through.  RDD zip is not hacky at all, as
>>> it only depends on a user not changing the partitioning.  Basically, you
>>> would keep your losses as an RDD[Double] and zip whose with the RDD of
>>> examples, and update the losses.  You're doing a copy (and GC) on the RDD
>>> of losses each time, but this is negligible.
>>>
>>>
>>> On Mon, Apr 28, 2014 at 11:33 AM, Sung Hwan Chung <
>>> codedeft@cs.stanford.edu> wrote:
>>>
>>>> Yes, this is what we've done as of now (if you read earlier threads).
>>>> And we were saying that we'd prefer if Spark supported persistent worker
>>>> memory management in a little bit less hacky way ;)
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <ia...@ianoconnell.com>wrote:
>>>>
>>>>> A mutable map in an object should do what your looking for then I
>>>>> believe. You just reference the object as an object in your closure so it
>>>>> won't be swept up when your closure is serialized and you can reference
>>>>> variables of the object on the remote host then. e.g.:
>>>>>
>>>>> object MyObject {
>>>>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>>>>> }
>>>>>
>>>>> rdd.map { ele =>
>>>>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>>>>> ...
>>>>> }.map {ele =>
>>>>> require(MyObject.mmap(ele) == 1L)
>>>>>
>>>>> }.count
>>>>>
>>>>> Along with the data loss just be careful with thread safety and
>>>>> multiple threads/partitions on one host so the map should be viewed as
>>>>> shared amongst a larger space.
>>>>>
>>>>>
>>>>>
>>>>> Also with your exact description it sounds like your data should be
>>>>> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
>>>>> LastIterationSideValues)]
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <
>>>>> codedeft@cs.stanford.edu> wrote:
>>>>>
>>>>>> In our case, we'd like to keep memory content from one iteration to
>>>>>> the next, and not just during a single mapPartition call because then we
>>>>>> can do more efficient computations using the values from the previous
>>>>>> iteration.
>>>>>>
>>>>>> So essentially, we need to declare objects outside the scope of the
>>>>>> map/reduce calls (but residing in individual workers), then those can be
>>>>>> accessed from the map/reduce calls.
>>>>>>
>>>>>> We'd be making some assumptions as you said, such as - RDD partition
>>>>>> is statically located and can't move from worker to another worker unless
>>>>>> the worker crashes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com>wrote:
>>>>>>
>>>>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>>>>>> codedeft@cs.stanford.edu> wrote:
>>>>>>>
>>>>>>>> Actually, I do not know how to do something like this or whether
>>>>>>>> this is possible - thus my suggestive statement.
>>>>>>>>
>>>>>>>> Can you already declare persistent memory objects per worker? I
>>>>>>>> tried something like constructing a singleton object within map functions,
>>>>>>>> but that didn't work as it seemed to actually serialize singletons and pass
>>>>>>>> it back and forth in a weird manner.
>>>>>>>>
>>>>>>>>
>>>>>>> Does it need to be persistent across operations, or just persist for
>>>>>>> the lifetime of processing of one partition in one mapPartition? The latter
>>>>>>> is quite easy and might give most of the speedup.
>>>>>>>
>>>>>>> Maybe that's 'enough', even if it means you re-cache values several
>>>>>>> times in a repeated iterative computation. It would certainly avoid
>>>>>>> managing a lot of complexity in trying to keep that state alive remotely
>>>>>>> across operations. I'd also be interested if there is any reliable way to
>>>>>>> do that, though it seems hard since it means you embed assumptions about
>>>>>>> where particular data is going to be processed.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Sung Hwan Chung <co...@cs.stanford.edu>.
That might be a good alternative to what we are looking for. But I wonder
if this would be as efficient as we want to. For instance, will RDDs of the
same size usually get partitioned to the same machines - thus not
triggering any cross machine aligning, etc. We'll explore it, but I would
still very much like to see more direct worker memory management besides
RDDs.


On Mon, Apr 28, 2014 at 10:26 AM, Tom Vacek <mi...@gmail.com> wrote:

> Right---They are zipped at each iteration.
>
>
> On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen <ch...@yahoo.com>wrote:
>
>> Tom,
>>     Are you suggesting two RDDs, one with loss and another for the rest
>> info, using zip to tie them together, but do update on loss RDD (copy) ?
>>
>> Chester
>>
>> Sent from my iPhone
>>
>> On Apr 28, 2014, at 9:45 AM, Tom Vacek <mi...@gmail.com> wrote:
>>
>> I'm not sure what I said came through.  RDD zip is not hacky at all, as
>> it only depends on a user not changing the partitioning.  Basically, you
>> would keep your losses as an RDD[Double] and zip whose with the RDD of
>> examples, and update the losses.  You're doing a copy (and GC) on the RDD
>> of losses each time, but this is negligible.
>>
>>
>> On Mon, Apr 28, 2014 at 11:33 AM, Sung Hwan Chung <
>> codedeft@cs.stanford.edu> wrote:
>>
>>> Yes, this is what we've done as of now (if you read earlier threads).
>>> And we were saying that we'd prefer if Spark supported persistent worker
>>> memory management in a little bit less hacky way ;)
>>>
>>>
>>> On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <ia...@ianoconnell.com>wrote:
>>>
>>>> A mutable map in an object should do what your looking for then I
>>>> believe. You just reference the object as an object in your closure so it
>>>> won't be swept up when your closure is serialized and you can reference
>>>> variables of the object on the remote host then. e.g.:
>>>>
>>>> object MyObject {
>>>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>>>> }
>>>>
>>>> rdd.map { ele =>
>>>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>>>> ...
>>>> }.map {ele =>
>>>> require(MyObject.mmap(ele) == 1L)
>>>>
>>>> }.count
>>>>
>>>> Along with the data loss just be careful with thread safety and
>>>> multiple threads/partitions on one host so the map should be viewed as
>>>> shared amongst a larger space.
>>>>
>>>>
>>>>
>>>> Also with your exact description it sounds like your data should be
>>>> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
>>>> LastIterationSideValues)]
>>>>
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <
>>>> codedeft@cs.stanford.edu> wrote:
>>>>
>>>>> In our case, we'd like to keep memory content from one iteration to
>>>>> the next, and not just during a single mapPartition call because then we
>>>>> can do more efficient computations using the values from the previous
>>>>> iteration.
>>>>>
>>>>> So essentially, we need to declare objects outside the scope of the
>>>>> map/reduce calls (but residing in individual workers), then those can be
>>>>> accessed from the map/reduce calls.
>>>>>
>>>>> We'd be making some assumptions as you said, such as - RDD partition
>>>>> is statically located and can't move from worker to another worker unless
>>>>> the worker crashes.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>>>>> codedeft@cs.stanford.edu> wrote:
>>>>>>
>>>>>>> Actually, I do not know how to do something like this or whether
>>>>>>> this is possible - thus my suggestive statement.
>>>>>>>
>>>>>>> Can you already declare persistent memory objects per worker? I
>>>>>>> tried something like constructing a singleton object within map functions,
>>>>>>> but that didn't work as it seemed to actually serialize singletons and pass
>>>>>>> it back and forth in a weird manner.
>>>>>>>
>>>>>>>
>>>>>> Does it need to be persistent across operations, or just persist for
>>>>>> the lifetime of processing of one partition in one mapPartition? The latter
>>>>>> is quite easy and might give most of the speedup.
>>>>>>
>>>>>> Maybe that's 'enough', even if it means you re-cache values several
>>>>>> times in a repeated iterative computation. It would certainly avoid
>>>>>> managing a lot of complexity in trying to keep that state alive remotely
>>>>>> across operations. I'd also be interested if there is any reliable way to
>>>>>> do that, though it seems hard since it means you embed assumptions about
>>>>>> where particular data is going to be processed.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Tom Vacek <mi...@gmail.com>.
Right---They are zipped at each iteration.


On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen <ch...@yahoo.com>wrote:

> Tom,
>     Are you suggesting two RDDs, one with loss and another for the rest
> info, using zip to tie them together, but do update on loss RDD (copy) ?
>
> Chester
>
> Sent from my iPhone
>
> On Apr 28, 2014, at 9:45 AM, Tom Vacek <mi...@gmail.com> wrote:
>
> I'm not sure what I said came through.  RDD zip is not hacky at all, as it
> only depends on a user not changing the partitioning.  Basically, you would
> keep your losses as an RDD[Double] and zip whose with the RDD of examples,
> and update the losses.  You're doing a copy (and GC) on the RDD of losses
> each time, but this is negligible.
>
>
> On Mon, Apr 28, 2014 at 11:33 AM, Sung Hwan Chung <
> codedeft@cs.stanford.edu> wrote:
>
>> Yes, this is what we've done as of now (if you read earlier threads). And
>> we were saying that we'd prefer if Spark supported persistent worker memory
>> management in a little bit less hacky way ;)
>>
>>
>> On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <ia...@ianoconnell.com>wrote:
>>
>>> A mutable map in an object should do what your looking for then I
>>> believe. You just reference the object as an object in your closure so it
>>> won't be swept up when your closure is serialized and you can reference
>>> variables of the object on the remote host then. e.g.:
>>>
>>> object MyObject {
>>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>>> }
>>>
>>> rdd.map { ele =>
>>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>>> ...
>>> }.map {ele =>
>>> require(MyObject.mmap(ele) == 1L)
>>>
>>> }.count
>>>
>>> Along with the data loss just be careful with thread safety and multiple
>>> threads/partitions on one host so the map should be viewed as shared
>>> amongst a larger space.
>>>
>>>
>>>
>>> Also with your exact description it sounds like your data should be
>>> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
>>> LastIterationSideValues)]
>>>
>>>
>>>
>>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <
>>> codedeft@cs.stanford.edu> wrote:
>>>
>>>> In our case, we'd like to keep memory content from one iteration to the
>>>> next, and not just during a single mapPartition call because then we can do
>>>> more efficient computations using the values from the previous iteration.
>>>>
>>>> So essentially, we need to declare objects outside the scope of the
>>>> map/reduce calls (but residing in individual workers), then those can be
>>>> accessed from the map/reduce calls.
>>>>
>>>> We'd be making some assumptions as you said, such as - RDD partition is
>>>> statically located and can't move from worker to another worker unless the
>>>> worker crashes.
>>>>
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>>>> codedeft@cs.stanford.edu> wrote:
>>>>>
>>>>>> Actually, I do not know how to do something like this or whether this
>>>>>> is possible - thus my suggestive statement.
>>>>>>
>>>>>> Can you already declare persistent memory objects per worker? I tried
>>>>>> something like constructing a singleton object within map functions, but
>>>>>> that didn't work as it seemed to actually serialize singletons and pass it
>>>>>> back and forth in a weird manner.
>>>>>>
>>>>>>
>>>>> Does it need to be persistent across operations, or just persist for
>>>>> the lifetime of processing of one partition in one mapPartition? The latter
>>>>> is quite easy and might give most of the speedup.
>>>>>
>>>>> Maybe that's 'enough', even if it means you re-cache values several
>>>>> times in a repeated iterative computation. It would certainly avoid
>>>>> managing a lot of complexity in trying to keep that state alive remotely
>>>>> across operations. I'd also be interested if there is any reliable way to
>>>>> do that, though it seems hard since it means you embed assumptions about
>>>>> where particular data is going to be processed.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Chester Chen <ch...@yahoo.com>.
Tom, 
    Are you suggesting two RDDs, one with loss and another for the rest info, using zip to tie them together, but do update on loss RDD (copy) ?

Chester

Sent from my iPhone

On Apr 28, 2014, at 9:45 AM, Tom Vacek <mi...@gmail.com> wrote:

> I'm not sure what I said came through.  RDD zip is not hacky at all, as it only depends on a user not changing the partitioning.  Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD of examples, and update the losses.  You're doing a copy (and GC) on the RDD of losses each time, but this is negligible.
> 
> 
> On Mon, Apr 28, 2014 at 11:33 AM, Sung Hwan Chung <co...@cs.stanford.edu> wrote:
>> Yes, this is what we've done as of now (if you read earlier threads). And we were saying that we'd prefer if Spark supported persistent worker memory management in a little bit less hacky way ;)
>> 
>> 
>> On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <ia...@ianoconnell.com> wrote:
>>> A mutable map in an object should do what your looking for then I believe. You just reference the object as an object in your closure so it won't be swept up when your closure is serialized and you can reference variables of the object on the remote host then. e.g.:
>>> 
>>> object MyObject {
>>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>>> }
>>> 
>>> rdd.map { ele =>
>>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>>> ...
>>> }.map {ele =>
>>> require(MyObject.mmap(ele) == 1L)
>>> 
>>> }.count
>>> 
>>> Along with the data loss just be careful with thread safety and multiple threads/partitions on one host so the map should be viewed as shared amongst a larger space. 
>>> 
>>> 
>>> 
>>> Also with your exact description it sounds like your data should be encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData, LastIterationSideValues)] 
>>> 
>>> 
>>> 
>>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <co...@cs.stanford.edu> wrote:
>>>> In our case, we'd like to keep memory content from one iteration to the next, and not just during a single mapPartition call because then we can do more efficient computations using the values from the previous iteration.
>>>> 
>>>> So essentially, we need to declare objects outside the scope of the map/reduce calls (but residing in individual workers), then those can be accessed from the map/reduce calls.
>>>> 
>>>> We'd be making some assumptions as you said, such as - RDD partition is statically located and can't move from worker to another worker unless the worker crashes.
>>>> 
>>>> 
>>>> 
>>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <co...@cs.stanford.edu> wrote:
>>>>>> Actually, I do not know how to do something like this or whether this is possible - thus my suggestive statement.
>>>>>> 
>>>>>> Can you already declare persistent memory objects per worker? I tried something like constructing a singleton object within map functions, but that didn't work as it seemed to actually serialize singletons and pass it back and forth in a weird manner.
>>>>> 
>>>>> Does it need to be persistent across operations, or just persist for the lifetime of processing of one partition in one mapPartition? The latter is quite easy and might give most of the speedup.
>>>>> 
>>>>> Maybe that's 'enough', even if it means you re-cache values several times in a repeated iterative computation. It would certainly avoid managing a lot of complexity in trying to keep that state alive remotely across operations. I'd also be interested if there is any reliable way to do that, though it seems hard since it means you embed assumptions about where particular data is going to be processed.
> 

Re: is it okay to reuse objects across RDD's?

Posted by Tom Vacek <mi...@gmail.com>.
I'm not sure what I said came through.  RDD zip is not hacky at all, as it
only depends on a user not changing the partitioning.  Basically, you would
keep your losses as an RDD[Double] and zip whose with the RDD of examples,
and update the losses.  You're doing a copy (and GC) on the RDD of losses
each time, but this is negligible.


On Mon, Apr 28, 2014 at 11:33 AM, Sung Hwan Chung
<co...@cs.stanford.edu>wrote:

> Yes, this is what we've done as of now (if you read earlier threads). And
> we were saying that we'd prefer if Spark supported persistent worker memory
> management in a little bit less hacky way ;)
>
>
> On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <ia...@ianoconnell.com>wrote:
>
>> A mutable map in an object should do what your looking for then I
>> believe. You just reference the object as an object in your closure so it
>> won't be swept up when your closure is serialized and you can reference
>> variables of the object on the remote host then. e.g.:
>>
>> object MyObject {
>>   val mmap = scala.collection.mutable.Map[Long, Long]()
>> }
>>
>> rdd.map { ele =>
>> MyObject.mmap.getOrElseUpdate(ele, 1L)
>> ...
>> }.map {ele =>
>> require(MyObject.mmap(ele) == 1L)
>>
>> }.count
>>
>> Along with the data loss just be careful with thread safety and multiple
>> threads/partitions on one host so the map should be viewed as shared
>> amongst a larger space.
>>
>>
>>
>> Also with your exact description it sounds like your data should be
>> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
>> LastIterationSideValues)]
>>
>>
>>
>> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <
>> codedeft@cs.stanford.edu> wrote:
>>
>>> In our case, we'd like to keep memory content from one iteration to the
>>> next, and not just during a single mapPartition call because then we can do
>>> more efficient computations using the values from the previous iteration.
>>>
>>> So essentially, we need to declare objects outside the scope of the
>>> map/reduce calls (but residing in individual workers), then those can be
>>> accessed from the map/reduce calls.
>>>
>>> We'd be making some assumptions as you said, such as - RDD partition is
>>> statically located and can't move from worker to another worker unless the
>>> worker crashes.
>>>
>>>
>>>
>>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>>> codedeft@cs.stanford.edu> wrote:
>>>>
>>>>> Actually, I do not know how to do something like this or whether this
>>>>> is possible - thus my suggestive statement.
>>>>>
>>>>> Can you already declare persistent memory objects per worker? I tried
>>>>> something like constructing a singleton object within map functions, but
>>>>> that didn't work as it seemed to actually serialize singletons and pass it
>>>>> back and forth in a weird manner.
>>>>>
>>>>>
>>>> Does it need to be persistent across operations, or just persist for
>>>> the lifetime of processing of one partition in one mapPartition? The latter
>>>> is quite easy and might give most of the speedup.
>>>>
>>>> Maybe that's 'enough', even if it means you re-cache values several
>>>> times in a repeated iterative computation. It would certainly avoid
>>>> managing a lot of complexity in trying to keep that state alive remotely
>>>> across operations. I'd also be interested if there is any reliable way to
>>>> do that, though it seems hard since it means you embed assumptions about
>>>> where particular data is going to be processed.
>>>>
>>>>
>>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Sung Hwan Chung <co...@cs.stanford.edu>.
Yes, this is what we've done as of now (if you read earlier threads). And
we were saying that we'd prefer if Spark supported persistent worker memory
management in a little bit less hacky way ;)


On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <ia...@ianoconnell.com> wrote:

> A mutable map in an object should do what your looking for then I believe.
> You just reference the object as an object in your closure so it won't be
> swept up when your closure is serialized and you can reference variables of
> the object on the remote host then. e.g.:
>
> object MyObject {
>   val mmap = scala.collection.mutable.Map[Long, Long]()
> }
>
> rdd.map { ele =>
> MyObject.mmap.getOrElseUpdate(ele, 1L)
> ...
> }.map {ele =>
> require(MyObject.mmap(ele) == 1L)
>
> }.count
>
> Along with the data loss just be careful with thread safety and multiple
> threads/partitions on one host so the map should be viewed as shared
> amongst a larger space.
>
>
>
> Also with your exact description it sounds like your data should be
> encoded into the RDD if its per-record/per-row:  RDD[(MyBaseData,
> LastIterationSideValues)]
>
>
>
> On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <codedeft@cs.stanford.edu
> > wrote:
>
>> In our case, we'd like to keep memory content from one iteration to the
>> next, and not just during a single mapPartition call because then we can do
>> more efficient computations using the values from the previous iteration.
>>
>> So essentially, we need to declare objects outside the scope of the
>> map/reduce calls (but residing in individual workers), then those can be
>> accessed from the map/reduce calls.
>>
>> We'd be making some assumptions as you said, such as - RDD partition is
>> statically located and can't move from worker to another worker unless the
>> worker crashes.
>>
>>
>>
>> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>>> codedeft@cs.stanford.edu> wrote:
>>>
>>>> Actually, I do not know how to do something like this or whether this
>>>> is possible - thus my suggestive statement.
>>>>
>>>> Can you already declare persistent memory objects per worker? I tried
>>>> something like constructing a singleton object within map functions, but
>>>> that didn't work as it seemed to actually serialize singletons and pass it
>>>> back and forth in a weird manner.
>>>>
>>>>
>>> Does it need to be persistent across operations, or just persist for the
>>> lifetime of processing of one partition in one mapPartition? The latter is
>>> quite easy and might give most of the speedup.
>>>
>>> Maybe that's 'enough', even if it means you re-cache values several
>>> times in a repeated iterative computation. It would certainly avoid
>>> managing a lot of complexity in trying to keep that state alive remotely
>>> across operations. I'd also be interested if there is any reliable way to
>>> do that, though it seems hard since it means you embed assumptions about
>>> where particular data is going to be processed.
>>>
>>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Ian O'Connell <ia...@ianoconnell.com>.
A mutable map in an object should do what your looking for then I believe.
You just reference the object as an object in your closure so it won't be
swept up when your closure is serialized and you can reference variables of
the object on the remote host then. e.g.:

object MyObject {
  val mmap = scala.collection.mutable.Map[Long, Long]()
}

rdd.map { ele =>
MyObject.mmap.getOrElseUpdate(ele, 1L)
...
}.map {ele =>
require(MyObject.mmap(ele) == 1L)

}.count

Along with the data loss just be careful with thread safety and multiple
threads/partitions on one host so the map should be viewed as shared
amongst a larger space.



Also with your exact description it sounds like your data should be encoded
into the RDD if its per-record/per-row:  RDD[(MyBaseData,
LastIterationSideValues)]



On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung
<co...@cs.stanford.edu>wrote:

> In our case, we'd like to keep memory content from one iteration to the
> next, and not just during a single mapPartition call because then we can do
> more efficient computations using the values from the previous iteration.
>
> So essentially, we need to declare objects outside the scope of the
> map/reduce calls (but residing in individual workers), then those can be
> accessed from the map/reduce calls.
>
> We'd be making some assumptions as you said, such as - RDD partition is
> statically located and can't move from worker to another worker unless the
> worker crashes.
>
>
>
> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>> codedeft@cs.stanford.edu> wrote:
>>
>>> Actually, I do not know how to do something like this or whether this is
>>> possible - thus my suggestive statement.
>>>
>>> Can you already declare persistent memory objects per worker? I tried
>>> something like constructing a singleton object within map functions, but
>>> that didn't work as it seemed to actually serialize singletons and pass it
>>> back and forth in a weird manner.
>>>
>>>
>> Does it need to be persistent across operations, or just persist for the
>> lifetime of processing of one partition in one mapPartition? The latter is
>> quite easy and might give most of the speedup.
>>
>> Maybe that's 'enough', even if it means you re-cache values several times
>> in a repeated iterative computation. It would certainly avoid managing a
>> lot of complexity in trying to keep that state alive remotely across
>> operations. I'd also be interested if there is any reliable way to do that,
>> though it seems hard since it means you embed assumptions about where
>> particular data is going to be processed.
>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by Sung Hwan Chung <co...@cs.stanford.edu>.
In our case, we'd like to keep memory content from one iteration to the
next, and not just during a single mapPartition call because then we can do
more efficient computations using the values from the previous iteration.

So essentially, we need to declare objects outside the scope of the
map/reduce calls (but residing in individual workers), then those can be
accessed from the map/reduce calls.

We'd be making some assumptions as you said, such as - RDD partition is
statically located and can't move from worker to another worker unless the
worker crashes.



On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:

> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <codedeft@cs.stanford.edu
> > wrote:
>
>> Actually, I do not know how to do something like this or whether this is
>> possible - thus my suggestive statement.
>>
>> Can you already declare persistent memory objects per worker? I tried
>> something like constructing a singleton object within map functions, but
>> that didn't work as it seemed to actually serialize singletons and pass it
>> back and forth in a weird manner.
>>
>>
> Does it need to be persistent across operations, or just persist for the
> lifetime of processing of one partition in one mapPartition? The latter is
> quite easy and might give most of the speedup.
>
> Maybe that's 'enough', even if it means you re-cache values several times
> in a repeated iterative computation. It would certainly avoid managing a
> lot of complexity in trying to keep that state alive remotely across
> operations. I'd also be interested if there is any reliable way to do that,
> though it seems hard since it means you embed assumptions about where
> particular data is going to be processed.
>
>

Re: is it okay to reuse objects across RDD's?

Posted by Sean Owen <so...@cloudera.com>.
On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung
<co...@cs.stanford.edu>wrote:

> Actually, I do not know how to do something like this or whether this is
> possible - thus my suggestive statement.
>
> Can you already declare persistent memory objects per worker? I tried
> something like constructing a singleton object within map functions, but
> that didn't work as it seemed to actually serialize singletons and pass it
> back and forth in a weird manner.
>
>
Does it need to be persistent across operations, or just persist for the
lifetime of processing of one partition in one mapPartition? The latter is
quite easy and might give most of the speedup.

Maybe that's 'enough', even if it means you re-cache values several times
in a repeated iterative computation. It would certainly avoid managing a
lot of complexity in trying to keep that state alive remotely across
operations. I'd also be interested if there is any reliable way to do that,
though it seems hard since it means you embed assumptions about where
particular data is going to be processed.

Re: is it okay to reuse objects across RDD's?

Posted by Sung Hwan Chung <co...@cs.stanford.edu>.
Actually, I do not know how to do something like this or whether this is
possible - thus my suggestive statement.

Can you already declare persistent memory objects per worker? I tried
something like constructing a singleton object within map functions, but
that didn't work as it seemed to actually serialize singletons and pass it
back and forth in a weird manner.


On Mon, Apr 28, 2014 at 1:23 AM, Sean Owen <so...@cloudera.com> wrote:

> On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung <codedeft@cs.stanford.edu
> > wrote:
>>
>> e.g. something like
>>
>> rdd.mapPartition((rows : Iterator[String]) => {
>>   var idx = 0
>>   rows.map((row: String) => {
>>     val valueMap = SparkWorker.getMemoryContent("valMap")
>>     val prevVal = valueMap(idx)
>>     idx += 1
>>     ...
>>   })
>>   ...
>> })
>>
>> The developer can implement their own fault recovery mechanism if the
>> worker has crashed and lost the memory content.
>>
>
> Yea you can always just declare your own per-partition data structures in
> a function block like that, right? valueMap can be initialized to an empty
> map, loaded from somewhere, or even a value that is broadcast from the
> driver.
>
> That's certainly better than tacking data onto RDDs.
>
> It's not restored if the computation is lost of course, but in this and
> many other cases, it's fine, as it is just for some cached intermediate
> results.
>
> This already works then or did I misunderstand the original use case?
>

Re: is it okay to reuse objects across RDD's?

Posted by Sean Owen <so...@cloudera.com>.
On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung
<co...@cs.stanford.edu>wrote:
>
> e.g. something like
>
> rdd.mapPartition((rows : Iterator[String]) => {
>   var idx = 0
>   rows.map((row: String) => {
>     val valueMap = SparkWorker.getMemoryContent("valMap")
>     val prevVal = valueMap(idx)
>     idx += 1
>     ...
>   })
>   ...
> })
>
> The developer can implement their own fault recovery mechanism if the
> worker has crashed and lost the memory content.
>

Yea you can always just declare your own per-partition data structures in a
function block like that, right? valueMap can be initialized to an empty
map, loaded from somewhere, or even a value that is broadcast from the
driver.

That's certainly better than tacking data onto RDDs.

It's not restored if the computation is lost of course, but in this and
many other cases, it's fine, as it is just for some cached intermediate
results.

This already works then or did I misunderstand the original use case?

Re: is it okay to reuse objects across RDD's?

Posted by Sung Hwan Chung <co...@cs.stanford.edu>.
Yes, this is a useful trick we found that made our algorithm implementation
noticeably faster (btw, we'll send a pull request for this GLMNET
implementation, so interested people could look at it).

It would be nice if Spark supported something akin to this natively, as I
believe that many efficient algorithms could take advantage of this.

Basically, we don't even really need mutable RDD. Instead what we really
need is the ability to store/modify stuffs in workers' memory and access
them in subsequent iterations.

e.g. something like

rdd.mapPartition((rows : Iterator[String]) => {
  var idx = 0
  rows.map((row: String) => {
    val valueMap = SparkWorker.getMemoryContent("valMap")
    val prevVal = valueMap(idx)
    idx += 1
    ...
  })
  ...
})

The developer can implement their own fault recovery mechanism if the
worker has crashed and lost the memory content.


On Sun, Apr 27, 2014 at 10:24 PM, DB Tsai <db...@stanford.edu> wrote:

> Hi Todd,
>
> As Patrick and you already pointed out, it's really dangerous to mutate
> the status of RDD. However, when we implement the glmnet in Spark, if we
> can reuse the residuals for each row in RDD computed from the previous
> step, it can speed up 4~5x.
>
> As a result, we add extra column in RDD for book-keeping the residual for
> each row, and initialize it as NaN first. When the next iteration step find
> that the residual for that row is NaN, it means that either the RDD is
> ended up in the disk or the job is failed, so we recompute the residuals
> for those rows. It solves the problem of fault tolerance and data splitting
> to disk.
>
> It will be nice to have an API that we can do this type of book-keeping
> with native support.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sat, Apr 26, 2014 at 11:22 PM, Patrick Wendell <pw...@gmail.com>wrote:
>
>> Hey Todd,
>>
>> This approach violates the normal semantics of RDD transformations as you
>> point out. I think you pointed out some issues already, and there are
>> others. For instance say you cache originalRDD and some of the partitions
>> end up in memory and others end up on disk. The ones that end up in memory
>> will be mutated in-place when you create trasnformedRDD, the ones that are
>> serialized disk won't actually be changed (because there will be a copy
>> into memory from the serialized on-disk data). So you could end up where
>> originalRDD is partially mutated.
>>
>> Also, in the case of failures your map might run twice (e.g. run
>> partially once, fail, then get re-run and succeed). So if your mutation
>> e.g. relied on the current state of the object, it could end up having
>> unexpected behavior.
>>
>> We'll probably never "disallow" this in Spark because we can't really
>> control what you do inside of the function. But I'd be careful using this
>> approach...
>>
>> - Patrick
>>
>>
>> On Sat, Apr 26, 2014 at 5:59 AM, Lisonbee, Todd <to...@intel.com>wrote:
>>
>>> For example,
>>>
>>> val originalRDD: RDD[SomeCaseClass] = ...
>>>
>>> // Option 1: objects are copied, setting prop1 in the process
>>> val transformedRDD = originalRDD.map( item => item.copy(prop1 =
>>> calculation() )
>>>
>>> // Option 2: objects are re-used and modified
>>> val tranformedRDD = originalRDD.map( item => item.prop1 = calculation() )
>>>
>>> I did a couple of small tests with option 2 and noticed less time was
>>> spent in garbage collection.  It didn't add up to much but with a large
>>> enough data set it would make a difference.  Also, it seems that less
>>> memory would be used.
>>>
>>> Potential gotchas:
>>>
>>> - Objects in originalRDD are being modified, so you can't expect them to
>>> have not changed
>>> - You also can't rely on objects in originalRDD having the new value
>>> because originalRDD might be re-caclulated
>>> - If originalRDD was a PairRDD, and you modified the keys, it could
>>> cause issues
>>> - more?
>>>
>>> Other than the potential gotchas, is there any reason not to reuse
>>> objects across RDD's?  Is it a recommended practice for reducing memory
>>> usage and garbage collection or not?
>>>
>>> Is it safe to do this in code you expect to work on future versions of
>>> Spark?
>>>
>>> Thanks in advance,
>>>
>>> Todd
>>>
>>
>>
>

Re: is it okay to reuse objects across RDD's?

Posted by DB Tsai <db...@stanford.edu>.
Hi Todd,

As Patrick and you already pointed out, it's really dangerous to mutate the
status of RDD. However, when we implement the glmnet in Spark, if we can
reuse the residuals for each row in RDD computed from the previous step, it
can speed up 4~5x.

As a result, we add extra column in RDD for book-keeping the residual for
each row, and initialize it as NaN first. When the next iteration step find
that the residual for that row is NaN, it means that either the RDD is
ended up in the disk or the job is failed, so we recompute the residuals
for those rows. It solves the problem of fault tolerance and data splitting
to disk.

It will be nice to have an API that we can do this type of book-keeping
with native support.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Apr 26, 2014 at 11:22 PM, Patrick Wendell <pw...@gmail.com>wrote:

> Hey Todd,
>
> This approach violates the normal semantics of RDD transformations as you
> point out. I think you pointed out some issues already, and there are
> others. For instance say you cache originalRDD and some of the partitions
> end up in memory and others end up on disk. The ones that end up in memory
> will be mutated in-place when you create trasnformedRDD, the ones that are
> serialized disk won't actually be changed (because there will be a copy
> into memory from the serialized on-disk data). So you could end up where
> originalRDD is partially mutated.
>
> Also, in the case of failures your map might run twice (e.g. run partially
> once, fail, then get re-run and succeed). So if your mutation e.g. relied
> on the current state of the object, it could end up having unexpected
> behavior.
>
> We'll probably never "disallow" this in Spark because we can't really
> control what you do inside of the function. But I'd be careful using this
> approach...
>
> - Patrick
>
>
> On Sat, Apr 26, 2014 at 5:59 AM, Lisonbee, Todd <to...@intel.com>wrote:
>
>> For example,
>>
>> val originalRDD: RDD[SomeCaseClass] = ...
>>
>> // Option 1: objects are copied, setting prop1 in the process
>> val transformedRDD = originalRDD.map( item => item.copy(prop1 =
>> calculation() )
>>
>> // Option 2: objects are re-used and modified
>> val tranformedRDD = originalRDD.map( item => item.prop1 = calculation() )
>>
>> I did a couple of small tests with option 2 and noticed less time was
>> spent in garbage collection.  It didn't add up to much but with a large
>> enough data set it would make a difference.  Also, it seems that less
>> memory would be used.
>>
>> Potential gotchas:
>>
>> - Objects in originalRDD are being modified, so you can't expect them to
>> have not changed
>> - You also can't rely on objects in originalRDD having the new value
>> because originalRDD might be re-caclulated
>> - If originalRDD was a PairRDD, and you modified the keys, it could cause
>> issues
>> - more?
>>
>> Other than the potential gotchas, is there any reason not to reuse
>> objects across RDD's?  Is it a recommended practice for reducing memory
>> usage and garbage collection or not?
>>
>> Is it safe to do this in code you expect to work on future versions of
>> Spark?
>>
>> Thanks in advance,
>>
>> Todd
>>
>
>

Re: is it okay to reuse objects across RDD's?

Posted by Patrick Wendell <pw...@gmail.com>.
Hey Todd,

This approach violates the normal semantics of RDD transformations as you
point out. I think you pointed out some issues already, and there are
others. For instance say you cache originalRDD and some of the partitions
end up in memory and others end up on disk. The ones that end up in memory
will be mutated in-place when you create trasnformedRDD, the ones that are
serialized disk won't actually be changed (because there will be a copy
into memory from the serialized on-disk data). So you could end up where
originalRDD is partially mutated.

Also, in the case of failures your map might run twice (e.g. run partially
once, fail, then get re-run and succeed). So if your mutation e.g. relied
on the current state of the object, it could end up having unexpected
behavior.

We'll probably never "disallow" this in Spark because we can't really
control what you do inside of the function. But I'd be careful using this
approach...

- Patrick


On Sat, Apr 26, 2014 at 5:59 AM, Lisonbee, Todd <to...@intel.com>wrote:

> For example,
>
> val originalRDD: RDD[SomeCaseClass] = ...
>
> // Option 1: objects are copied, setting prop1 in the process
> val transformedRDD = originalRDD.map( item => item.copy(prop1 =
> calculation() )
>
> // Option 2: objects are re-used and modified
> val tranformedRDD = originalRDD.map( item => item.prop1 = calculation() )
>
> I did a couple of small tests with option 2 and noticed less time was
> spent in garbage collection.  It didn't add up to much but with a large
> enough data set it would make a difference.  Also, it seems that less
> memory would be used.
>
> Potential gotchas:
>
> - Objects in originalRDD are being modified, so you can't expect them to
> have not changed
> - You also can't rely on objects in originalRDD having the new value
> because originalRDD might be re-caclulated
> - If originalRDD was a PairRDD, and you modified the keys, it could cause
> issues
> - more?
>
> Other than the potential gotchas, is there any reason not to reuse objects
> across RDD's?  Is it a recommended practice for reducing memory usage and
> garbage collection or not?
>
> Is it safe to do this in code you expect to work on future versions of
> Spark?
>
> Thanks in advance,
>
> Todd
>