You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Adam Phelps <am...@opendns.com> on 2013/02/27 19:42:20 UTC

Large static structures in M/R heap

We have a job that uses a large lookup structure that gets created as a
static class during the map setup phase (and we have the JVM reused so
this only takes place once).  However of late this structure has grown
drastically (due to items beyond our control) and we've seen a
substantial increase in map time due to the lower available memory.

Are there any easy solutions to this sort of problem?  My first thought
was to see if it was possible to have all tasks for a job execute in
parallel within the same JVM, but I'm not seeing any setting that would
allow that.  Beyond that my only ideas are to move that data into an
external one-per-node key-value store like memcached, but I'm worried
the additional overhead of sending a query for each value being mapped
would also kill the job performance.

- Adam

Re: Large static structures in M/R heap

Posted by David Rosenstrauch <da...@darose.net>.

On 02/27/2013 01:42 PM, Adam Phelps wrote:
> We have a job that uses a large lookup structure that gets created as a
> static class during the map setup phase (and we have the JVM reused so
> this only takes place once).  However of late this structure has grown
> drastically (due to items beyond our control) and we've seen a
> substantial increase in map time due to the lower available memory.
>
> Are there any easy solutions to this sort of problem?  My first thought
> was to see if it was possible to have all tasks for a job execute in
> parallel within the same JVM, but I'm not seeing any setting that would
> allow that.  Beyond that my only ideas are to move that data into an
> external one-per-node key-value store like memcached, but I'm worried
> the additional overhead of sending a query for each value being mapped
> would also kill the job performance.
>
> - Adam
>

We use a similar solution to what you suggested to address this issue. 
Though, the in-memory app we run on each datanode is a proprietary one 
which allows for pipelineing of queries, and obviously helps optimize this.

Still, even using off-the-shelf memcached, and incurring the overhead of 
query-per-value, speed might work out to be more acceptable on this than 
you think.  Maybe give it a test in the small to benchmark first.

HTH,

DR

Re: Large static structures in M/R heap

Posted by Adam Phelps <am...@opendns.com>.

We actually use CDBs a good bit outside of M/R.  This is something worth
looking into, but the big structure we're currently using is a giant
tree-based lookup table whose access pattern is pretty random, so I
don't think caching would be of much use.  There is a lesser (but still
large) structure this might work for.

- Adam

On 2/27/13 10:56 AM, Robert Evans wrote:
> Have you looked at things like CDB http://cr.yp.to/cdb.html that would
> allow you to keep most of the file on disk and cache hot parts in memory.
> That really depends on your access pattern.
> 
> Alternatively you could give yourself more heap and take up two slots for
> your map task.
> 
> Also if it is big enough you might want to look at using a reduce to do
> the join instead of trying to do a map side join.
> 
> --Bobby
> 
> On 2/27/13 12:42 PM, "Adam Phelps" <am...@opendns.com> wrote:
> 
>> We have a job that uses a large lookup structure that gets created as a
>> static class during the map setup phase (and we have the JVM reused so
>> this only takes place once).  However of late this structure has grown
>> drastically (due to items beyond our control) and we've seen a
>> substantial increase in map time due to the lower available memory.
>>
>> Are there any easy solutions to this sort of problem?  My first thought
>> was to see if it was possible to have all tasks for a job execute in
>> parallel within the same JVM, but I'm not seeing any setting that would
>> allow that.  Beyond that my only ideas are to move that data into an
>> external one-per-node key-value store like memcached, but I'm worried
>> the additional overhead of sending a query for each value being mapped
>> would also kill the job performance.
>>
>> - Adam
>

Re: Large static structures in M/R heap

Posted by Robert Evans <ev...@yahoo-inc.com>.

Have you looked at things like CDB http://cr.yp.to/cdb.html that would
allow you to keep most of the file on disk and cache hot parts in memory.
That really depends on your access pattern.

Alternatively you could give yourself more heap and take up two slots for
your map task.

Also if it is big enough you might want to look at using a reduce to do
the join instead of trying to do a map side join.

--Bobby

On 2/27/13 12:42 PM, "Adam Phelps" <am...@opendns.com> wrote:

>We have a job that uses a large lookup structure that gets created as a
>static class during the map setup phase (and we have the JVM reused so
>this only takes place once).  However of late this structure has grown
>drastically (due to items beyond our control) and we've seen a
>substantial increase in map time due to the lower available memory.
>
>Are there any easy solutions to this sort of problem?  My first thought
>was to see if it was possible to have all tasks for a job execute in
>parallel within the same JVM, but I'm not seeing any setting that would
>allow that.  Beyond that my only ideas are to move that data into an
>external one-per-node key-value store like memcached, but I'm worried
>the additional overhead of sending a query for each value being mapped
>would also kill the job performance.
>
>- Adam