You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Kyle Moses <km...@cs.duke.edu> on 2012/10/11 19:12:54 UTC

Distributed Cache For 100MB+ Data Structure

Problem Background:
I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide 
auxiliary input during the reduce phase of the second job in it's 
workflow, but doesn't need the data at any other point.
It seems pretty straight forward to use the distributed cache to build 
this data structure inside each reducer in the setup() method.
This solution is functional, but ends up using a large amount of memory 
if I have 3 or more reducers running on the same node and the setup time 
of the radix tree is non-trivial.
Additionally, the IPv6 version of the structure is quite a bit larger in 
memory.

Question:
Is there a "good" way to share this data structure across all reducers 
on the same node within the Hadoop framework?

Initial Thoughts:
It seems like this might be possible by altering the Task JVM Reuse 
parameters, but from what I have read this would also affect map tasks 
and I'm concerned about drawbacks/side-effects.

Thanks for your help!

Re: Distributed Cache For 100MB+ Data Structure

Posted by Michael Segel <mi...@hotmail.com>.

Build and store the tree in some sort of globally accessible space? 

Like HBase, or HDFS?

On Oct 13, 2012, at 9:46 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Chris,
> Thanks for the suggestion on serializing the radix tree and your thoughts on the memory issue.  I'm planning to test a few different solutions and will post another reply if the results prove interesting.
> 
> Kyle
> 
> On 10/11/2012 1:52 PM, Chris Nauroth wrote:
>> Hello Kyle,
>> 
>> Regarding the setup time of the radix tree, is it possible to precompute the radix tree before job submission time, then create a serialized representation (perhaps just Java object serialization), and send the serialized form through distributed cache?  Then, each reducer would just need to deserialize during setup() instead of recomputing the full radix tree for every reducer task.  That might save time.
>> 
>> Regarding the memory consumption, when I've run into a situation like this, I've generally solved it by caching the data in a separate process and using some kind of IPC from the reducers to access it.  memcache is one example, though that's probably not an ideal fit for this data structure.  I'm aware of no equivalent solution directly in Hadoop and would be curious to hear from others on the topic.
>> 
>> Thanks,
>> --Chris
>> 
>> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:
>> Problem Background:
>> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide auxiliary input during the reduce phase of the second job in it's workflow, but doesn't need the data at any other point.
>> It seems pretty straight forward to use the distributed cache to build this data structure inside each reducer in the setup() method.
>> This solution is functional, but ends up using a large amount of memory if I have 3 or more reducers running on the same node and the setup time of the radix tree is non-trivial.
>> Additionally, the IPv6 version of the structure is quite a bit larger in memory.
>> 
>> Question:
>> Is there a "good" way to share this data structure across all reducers on the same node within the Hadoop framework?
>> 
>> Initial Thoughts:
>> It seems like this might be possible by altering the Task JVM Reuse parameters, but from what I have read this would also affect map tasks and I'm concerned about drawbacks/side-effects.
>> 
>> Thanks for your help!
>> 
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Michael Segel <mi...@hotmail.com>.

Build and store the tree in some sort of globally accessible space? 

Like HBase, or HDFS?

On Oct 13, 2012, at 9:46 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Chris,
> Thanks for the suggestion on serializing the radix tree and your thoughts on the memory issue.  I'm planning to test a few different solutions and will post another reply if the results prove interesting.
> 
> Kyle
> 
> On 10/11/2012 1:52 PM, Chris Nauroth wrote:
>> Hello Kyle,
>> 
>> Regarding the setup time of the radix tree, is it possible to precompute the radix tree before job submission time, then create a serialized representation (perhaps just Java object serialization), and send the serialized form through distributed cache?  Then, each reducer would just need to deserialize during setup() instead of recomputing the full radix tree for every reducer task.  That might save time.
>> 
>> Regarding the memory consumption, when I've run into a situation like this, I've generally solved it by caching the data in a separate process and using some kind of IPC from the reducers to access it.  memcache is one example, though that's probably not an ideal fit for this data structure.  I'm aware of no equivalent solution directly in Hadoop and would be curious to hear from others on the topic.
>> 
>> Thanks,
>> --Chris
>> 
>> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:
>> Problem Background:
>> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide auxiliary input during the reduce phase of the second job in it's workflow, but doesn't need the data at any other point.
>> It seems pretty straight forward to use the distributed cache to build this data structure inside each reducer in the setup() method.
>> This solution is functional, but ends up using a large amount of memory if I have 3 or more reducers running on the same node and the setup time of the radix tree is non-trivial.
>> Additionally, the IPv6 version of the structure is quite a bit larger in memory.
>> 
>> Question:
>> Is there a "good" way to share this data structure across all reducers on the same node within the Hadoop framework?
>> 
>> Initial Thoughts:
>> It seems like this might be possible by altering the Task JVM Reuse parameters, but from what I have read this would also affect map tasks and I'm concerned about drawbacks/side-effects.
>> 
>> Thanks for your help!
>> 
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Michael Segel <mi...@hotmail.com>.

Build and store the tree in some sort of globally accessible space? 

Like HBase, or HDFS?

On Oct 13, 2012, at 9:46 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Chris,
> Thanks for the suggestion on serializing the radix tree and your thoughts on the memory issue.  I'm planning to test a few different solutions and will post another reply if the results prove interesting.
> 
> Kyle
> 
> On 10/11/2012 1:52 PM, Chris Nauroth wrote:
>> Hello Kyle,
>> 
>> Regarding the setup time of the radix tree, is it possible to precompute the radix tree before job submission time, then create a serialized representation (perhaps just Java object serialization), and send the serialized form through distributed cache?  Then, each reducer would just need to deserialize during setup() instead of recomputing the full radix tree for every reducer task.  That might save time.
>> 
>> Regarding the memory consumption, when I've run into a situation like this, I've generally solved it by caching the data in a separate process and using some kind of IPC from the reducers to access it.  memcache is one example, though that's probably not an ideal fit for this data structure.  I'm aware of no equivalent solution directly in Hadoop and would be curious to hear from others on the topic.
>> 
>> Thanks,
>> --Chris
>> 
>> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:
>> Problem Background:
>> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide auxiliary input during the reduce phase of the second job in it's workflow, but doesn't need the data at any other point.
>> It seems pretty straight forward to use the distributed cache to build this data structure inside each reducer in the setup() method.
>> This solution is functional, but ends up using a large amount of memory if I have 3 or more reducers running on the same node and the setup time of the radix tree is non-trivial.
>> Additionally, the IPv6 version of the structure is quite a bit larger in memory.
>> 
>> Question:
>> Is there a "good" way to share this data structure across all reducers on the same node within the Hadoop framework?
>> 
>> Initial Thoughts:
>> It seems like this might be possible by altering the Task JVM Reuse parameters, but from what I have read this would also affect map tasks and I'm concerned about drawbacks/side-effects.
>> 
>> Thanks for your help!
>> 
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Michael Segel <mi...@hotmail.com>.

Build and store the tree in some sort of globally accessible space? 

Like HBase, or HDFS?

On Oct 13, 2012, at 9:46 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Chris,
> Thanks for the suggestion on serializing the radix tree and your thoughts on the memory issue.  I'm planning to test a few different solutions and will post another reply if the results prove interesting.
> 
> Kyle
> 
> On 10/11/2012 1:52 PM, Chris Nauroth wrote:
>> Hello Kyle,
>> 
>> Regarding the setup time of the radix tree, is it possible to precompute the radix tree before job submission time, then create a serialized representation (perhaps just Java object serialization), and send the serialized form through distributed cache?  Then, each reducer would just need to deserialize during setup() instead of recomputing the full radix tree for every reducer task.  That might save time.
>> 
>> Regarding the memory consumption, when I've run into a situation like this, I've generally solved it by caching the data in a separate process and using some kind of IPC from the reducers to access it.  memcache is one example, though that's probably not an ideal fit for this data structure.  I'm aware of no equivalent solution directly in Hadoop and would be curious to hear from others on the topic.
>> 
>> Thanks,
>> --Chris
>> 
>> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:
>> Problem Background:
>> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide auxiliary input during the reduce phase of the second job in it's workflow, but doesn't need the data at any other point.
>> It seems pretty straight forward to use the distributed cache to build this data structure inside each reducer in the setup() method.
>> This solution is functional, but ends up using a large amount of memory if I have 3 or more reducers running on the same node and the setup time of the radix tree is non-trivial.
>> Additionally, the IPv6 version of the structure is quite a bit larger in memory.
>> 
>> Question:
>> Is there a "good" way to share this data structure across all reducers on the same node within the Hadoop framework?
>> 
>> Initial Thoughts:
>> It seems like this might be possible by altering the Task JVM Reuse parameters, but from what I have read this would also affect map tasks and I'm concerned about drawbacks/side-effects.
>> 
>> Thanks for your help!
>> 
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Kyle Moses <km...@cs.duke.edu>.

Chris,
Thanks for the suggestion on serializing the radix tree and your 
thoughts on the memory issue.  I'm planning to test a few different 
solutions and will post another reply if the results prove interesting.

Kyle

On 10/11/2012 1:52 PM, Chris Nauroth wrote:
> Hello Kyle,
>
> Regarding the setup time of the radix tree, is it possible to 
> precompute the radix tree before job submission time, then create a 
> serialized representation (perhaps just Java object serialization), 
> and send the serialized form through distributed cache?  Then, each 
> reducer would just need to deserialize during setup() instead of 
> recomputing the full radix tree for every reducer task.  That might 
> save time.
>
> Regarding the memory consumption, when I've run into a situation like 
> this, I've generally solved it by caching the data in a separate 
> process and using some kind of IPC from the reducers to access it. 
>  memcache is one example, though that's probably not an ideal fit for 
> this data structure.  I'm aware of no equivalent solution directly in 
> Hadoop and would be curious to hear from others on the topic.
>
> Thanks,
> --Chris
>
> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <kmoses@cs.duke.edu 
> <ma...@cs.duke.edu>> wrote:
>
>     Problem Background:
>     I have a Hadoop MapReduce program that uses a IPv6 radix tree to
>     provide auxiliary input during the reduce phase of the second job
>     in it's workflow, but doesn't need the data at any other point.
>     It seems pretty straight forward to use the distributed cache to
>     build this data structure inside each reducer in the setup() method.
>     This solution is functional, but ends up using a large amount of
>     memory if I have 3 or more reducers running on the same node and
>     the setup time of the radix tree is non-trivial.
>     Additionally, the IPv6 version of the structure is quite a bit
>     larger in memory.
>
>     Question:
>     Is there a "good" way to share this data structure across all
>     reducers on the same node within the Hadoop framework?
>
>     Initial Thoughts:
>     It seems like this might be possible by altering the Task JVM
>     Reuse parameters, but from what I have read this would also affect
>     map tasks and I'm concerned about drawbacks/side-effects.
>
>     Thanks for your help!
>
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Kyle Moses <km...@cs.duke.edu>.

Chris,
Thanks for the suggestion on serializing the radix tree and your 
thoughts on the memory issue.  I'm planning to test a few different 
solutions and will post another reply if the results prove interesting.

Kyle

On 10/11/2012 1:52 PM, Chris Nauroth wrote:
> Hello Kyle,
>
> Regarding the setup time of the radix tree, is it possible to 
> precompute the radix tree before job submission time, then create a 
> serialized representation (perhaps just Java object serialization), 
> and send the serialized form through distributed cache?  Then, each 
> reducer would just need to deserialize during setup() instead of 
> recomputing the full radix tree for every reducer task.  That might 
> save time.
>
> Regarding the memory consumption, when I've run into a situation like 
> this, I've generally solved it by caching the data in a separate 
> process and using some kind of IPC from the reducers to access it. 
>  memcache is one example, though that's probably not an ideal fit for 
> this data structure.  I'm aware of no equivalent solution directly in 
> Hadoop and would be curious to hear from others on the topic.
>
> Thanks,
> --Chris
>
> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <kmoses@cs.duke.edu 
> <ma...@cs.duke.edu>> wrote:
>
>     Problem Background:
>     I have a Hadoop MapReduce program that uses a IPv6 radix tree to
>     provide auxiliary input during the reduce phase of the second job
>     in it's workflow, but doesn't need the data at any other point.
>     It seems pretty straight forward to use the distributed cache to
>     build this data structure inside each reducer in the setup() method.
>     This solution is functional, but ends up using a large amount of
>     memory if I have 3 or more reducers running on the same node and
>     the setup time of the radix tree is non-trivial.
>     Additionally, the IPv6 version of the structure is quite a bit
>     larger in memory.
>
>     Question:
>     Is there a "good" way to share this data structure across all
>     reducers on the same node within the Hadoop framework?
>
>     Initial Thoughts:
>     It seems like this might be possible by altering the Task JVM
>     Reuse parameters, but from what I have read this would also affect
>     map tasks and I'm concerned about drawbacks/side-effects.
>
>     Thanks for your help!
>
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Kyle Moses <km...@cs.duke.edu>.

Chris,
Thanks for the suggestion on serializing the radix tree and your 
thoughts on the memory issue.  I'm planning to test a few different 
solutions and will post another reply if the results prove interesting.

Kyle

On 10/11/2012 1:52 PM, Chris Nauroth wrote:
> Hello Kyle,
>
> Regarding the setup time of the radix tree, is it possible to 
> precompute the radix tree before job submission time, then create a 
> serialized representation (perhaps just Java object serialization), 
> and send the serialized form through distributed cache?  Then, each 
> reducer would just need to deserialize during setup() instead of 
> recomputing the full radix tree for every reducer task.  That might 
> save time.
>
> Regarding the memory consumption, when I've run into a situation like 
> this, I've generally solved it by caching the data in a separate 
> process and using some kind of IPC from the reducers to access it. 
>  memcache is one example, though that's probably not an ideal fit for 
> this data structure.  I'm aware of no equivalent solution directly in 
> Hadoop and would be curious to hear from others on the topic.
>
> Thanks,
> --Chris
>
> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <kmoses@cs.duke.edu 
> <ma...@cs.duke.edu>> wrote:
>
>     Problem Background:
>     I have a Hadoop MapReduce program that uses a IPv6 radix tree to
>     provide auxiliary input during the reduce phase of the second job
>     in it's workflow, but doesn't need the data at any other point.
>     It seems pretty straight forward to use the distributed cache to
>     build this data structure inside each reducer in the setup() method.
>     This solution is functional, but ends up using a large amount of
>     memory if I have 3 or more reducers running on the same node and
>     the setup time of the radix tree is non-trivial.
>     Additionally, the IPv6 version of the structure is quite a bit
>     larger in memory.
>
>     Question:
>     Is there a "good" way to share this data structure across all
>     reducers on the same node within the Hadoop framework?
>
>     Initial Thoughts:
>     It seems like this might be possible by altering the Task JVM
>     Reuse parameters, but from what I have read this would also affect
>     map tasks and I'm concerned about drawbacks/side-effects.
>
>     Thanks for your help!
>
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Kyle Moses <km...@cs.duke.edu>.

Chris,
Thanks for the suggestion on serializing the radix tree and your 
thoughts on the memory issue.  I'm planning to test a few different 
solutions and will post another reply if the results prove interesting.

Kyle

On 10/11/2012 1:52 PM, Chris Nauroth wrote:
> Hello Kyle,
>
> Regarding the setup time of the radix tree, is it possible to 
> precompute the radix tree before job submission time, then create a 
> serialized representation (perhaps just Java object serialization), 
> and send the serialized form through distributed cache?  Then, each 
> reducer would just need to deserialize during setup() instead of 
> recomputing the full radix tree for every reducer task.  That might 
> save time.
>
> Regarding the memory consumption, when I've run into a situation like 
> this, I've generally solved it by caching the data in a separate 
> process and using some kind of IPC from the reducers to access it. 
>  memcache is one example, though that's probably not an ideal fit for 
> this data structure.  I'm aware of no equivalent solution directly in 
> Hadoop and would be curious to hear from others on the topic.
>
> Thanks,
> --Chris
>
> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <kmoses@cs.duke.edu 
> <ma...@cs.duke.edu>> wrote:
>
>     Problem Background:
>     I have a Hadoop MapReduce program that uses a IPv6 radix tree to
>     provide auxiliary input during the reduce phase of the second job
>     in it's workflow, but doesn't need the data at any other point.
>     It seems pretty straight forward to use the distributed cache to
>     build this data structure inside each reducer in the setup() method.
>     This solution is functional, but ends up using a large amount of
>     memory if I have 3 or more reducers running on the same node and
>     the setup time of the radix tree is non-trivial.
>     Additionally, the IPv6 version of the structure is quite a bit
>     larger in memory.
>
>     Question:
>     Is there a "good" way to share this data structure across all
>     reducers on the same node within the Hadoop framework?
>
>     Initial Thoughts:
>     It seems like this might be possible by altering the Task JVM
>     Reuse parameters, but from what I have read this would also affect
>     map tasks and I'm concerned about drawbacks/side-effects.
>
>     Thanks for your help!
>
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Kyle,

Regarding the setup time of the radix tree, is it possible to precompute
the radix tree before job submission time, then create a serialized
representation (perhaps just Java object serialization), and send the
serialized form through distributed cache?  Then, each reducer would just
need to deserialize during setup() instead of recomputing the full radix
tree for every reducer task.  That might save time.

Regarding the memory consumption, when I've run into a situation like this,
I've generally solved it by caching the data in a separate process and
using some kind of IPC from the reducers to access it.  memcache is one
example, though that's probably not an ideal fit for this data structure.
 I'm aware of no equivalent solution directly in Hadoop and would be
curious to hear from others on the topic.

Thanks,
--Chris

On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Problem Background:
> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide
> auxiliary input during the reduce phase of the second job in it's workflow,
> but doesn't need the data at any other point.
> It seems pretty straight forward to use the distributed cache to build
> this data structure inside each reducer in the setup() method.
> This solution is functional, but ends up using a large amount of memory if
> I have 3 or more reducers running on the same node and the setup time of
> the radix tree is non-trivial.
> Additionally, the IPv6 version of the structure is quite a bit larger in
> memory.
>
> Question:
> Is there a "good" way to share this data structure across all reducers on
> the same node within the Hadoop framework?
>
> Initial Thoughts:
> It seems like this might be possible by altering the Task JVM Reuse
> parameters, but from what I have read this would also affect map tasks and
> I'm concerned about drawbacks/side-effects.
>
> Thanks for your help!
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Kyle,

Regarding the setup time of the radix tree, is it possible to precompute
the radix tree before job submission time, then create a serialized
representation (perhaps just Java object serialization), and send the
serialized form through distributed cache?  Then, each reducer would just
need to deserialize during setup() instead of recomputing the full radix
tree for every reducer task.  That might save time.

Regarding the memory consumption, when I've run into a situation like this,
I've generally solved it by caching the data in a separate process and
using some kind of IPC from the reducers to access it.  memcache is one
example, though that's probably not an ideal fit for this data structure.
 I'm aware of no equivalent solution directly in Hadoop and would be
curious to hear from others on the topic.

Thanks,
--Chris

On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Problem Background:
> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide
> auxiliary input during the reduce phase of the second job in it's workflow,
> but doesn't need the data at any other point.
> It seems pretty straight forward to use the distributed cache to build
> this data structure inside each reducer in the setup() method.
> This solution is functional, but ends up using a large amount of memory if
> I have 3 or more reducers running on the same node and the setup time of
> the radix tree is non-trivial.
> Additionally, the IPv6 version of the structure is quite a bit larger in
> memory.
>
> Question:
> Is there a "good" way to share this data structure across all reducers on
> the same node within the Hadoop framework?
>
> Initial Thoughts:
> It seems like this might be possible by altering the Task JVM Reuse
> parameters, but from what I have read this would also affect map tasks and
> I'm concerned about drawbacks/side-effects.
>
> Thanks for your help!
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Kyle,

Regarding the setup time of the radix tree, is it possible to precompute
the radix tree before job submission time, then create a serialized
representation (perhaps just Java object serialization), and send the
serialized form through distributed cache?  Then, each reducer would just
need to deserialize during setup() instead of recomputing the full radix
tree for every reducer task.  That might save time.

Regarding the memory consumption, when I've run into a situation like this,
I've generally solved it by caching the data in a separate process and
using some kind of IPC from the reducers to access it.  memcache is one
example, though that's probably not an ideal fit for this data structure.
 I'm aware of no equivalent solution directly in Hadoop and would be
curious to hear from others on the topic.

Thanks,
--Chris

On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Problem Background:
> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide
> auxiliary input during the reduce phase of the second job in it's workflow,
> but doesn't need the data at any other point.
> It seems pretty straight forward to use the distributed cache to build
> this data structure inside each reducer in the setup() method.
> This solution is functional, but ends up using a large amount of memory if
> I have 3 or more reducers running on the same node and the setup time of
> the radix tree is non-trivial.
> Additionally, the IPv6 version of the structure is quite a bit larger in
> memory.
>
> Question:
> Is there a "good" way to share this data structure across all reducers on
> the same node within the Hadoop framework?
>
> Initial Thoughts:
> It seems like this might be possible by altering the Task JVM Reuse
> parameters, but from what I have read this would also affect map tasks and
> I'm concerned about drawbacks/side-effects.
>
> Thanks for your help!
>

Re: Distributed Cache For 100MB+ Data Structure

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Kyle,

Regarding the setup time of the radix tree, is it possible to precompute
the radix tree before job submission time, then create a serialized
representation (perhaps just Java object serialization), and send the
serialized form through distributed cache?  Then, each reducer would just
need to deserialize during setup() instead of recomputing the full radix
tree for every reducer task.  That might save time.

Regarding the memory consumption, when I've run into a situation like this,
I've generally solved it by caching the data in a separate process and
using some kind of IPC from the reducers to access it.  memcache is one
example, though that's probably not an ideal fit for this data structure.
 I'm aware of no equivalent solution directly in Hadoop and would be
curious to hear from others on the topic.

Thanks,
--Chris

On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <km...@cs.duke.edu> wrote:

> Problem Background:
> I have a Hadoop MapReduce program that uses a IPv6 radix tree to provide
> auxiliary input during the reduce phase of the second job in it's workflow,
> but doesn't need the data at any other point.
> It seems pretty straight forward to use the distributed cache to build
> this data structure inside each reducer in the setup() method.
> This solution is functional, but ends up using a large amount of memory if
> I have 3 or more reducers running on the same node and the setup time of
> the radix tree is non-trivial.
> Additionally, the IPv6 version of the structure is quite a bit larger in
> memory.
>
> Question:
> Is there a "good" way to share this data structure across all reducers on
> the same node within the Hadoop framework?
>
> Initial Thoughts:
> It seems like this might be possible by altering the Task JVM Reuse
> parameters, but from what I have read this would also affect map tasks and
> I'm concerned about drawbacks/side-effects.
>
> Thanks for your help!
>