You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Leon Mergen <l....@solatis.com> on 2009/06/18 17:23:51 UTC

Practical limit on emitted map/reduce values

Hello,

I wasn't able to find this anywhere, so I'm sorry if this has been asked before.

I am wondering whether there is a practical limit of the amount of bytes that an emitted Map/Reduce value can be. Other than the obvious drawbacks of emitting huge values such as performance issues, I would like to know whether there are any hard constraints; I can imagine that a value can never be larger than the dfs.block.size .

Does anyone have any idea, or can provide me with some pointers where to look ?

Thanks in advance!

Regards,

Leon Mergen

RE: Practical limit on emitted map/reduce values

Posted by Leon Mergen <l....@solatis.com>.

Hello Jason,

> In general if the values become very large, it becomes simpler to store
> them
> outline in hdfs, and just pass the hdfs path for the item as the value
> in
> the map reduce task.
> This greatly reduces the amount of IO done, and doesn't blow up the
> sort
> space on the reducer.
> You loose the magic of data locality, but given the item size, and you
> gain
> the IO back by not having to pass the full values to the reducer, or
> handle
> them when sorting the map outputs.

Ah that actually sounds like a nice idea; instead of having the reducer emit the huge value, it can create a temporarely file and emit the filename instead.

I wasn't really planning on having huge values anyway (values above 1MB will be the exception rather than the rule), but since it's theoretically possible for our software to generate them, it seemed like a good idea to investigate any real constraints that we might run into.

Your idea sounds like a good workaround for this. Thanks!


Regards,

Leon Mergen

Re: Practical limit on emitted map/reduce values

Posted by jason hadoop <ja...@gmail.com>.

In general if the values become very large, it becomes simpler to store them
outline in hdfs, and just pass the hdfs path for the item as the value in
the map reduce task.
This greatly reduces the amount of IO done, and doesn't blow up the sort
space on the reducer.
You loose the magic of data locality, but given the item size, and you gain
the IO back by not having to pass the full values to the reducer, or handle
them when sorting the map outputs.

On Thu, Jun 18, 2009 at 8:45 AM, Leon Mergen <l....@solatis.com> wrote:

> Hello Owen,
>
> > Keys and values can be large. They are certainly capped above by
> > Java's 2GB limit on byte arrays. More practically, you will have
> > problems running out of memory with keys or values of 100 MB. There is
> > no restriction that a key/value pair fits in a single hdfs block, but
> > performance would suffer. (In particular, the FileInputFormats split
> > at block sized chunks, which means you will have maps that scan an
> > entire block without processing anything.)
>
> Thanks for the quick reply.
>
> Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit
> that is caused by the Java VM heap size ? If so, could that, for example, be
> increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ?
>
> Regards,
>
> Leon Mergen
>
>

-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

RE: Practical limit on emitted map/reduce values

Posted by Leon Mergen <l....@solatis.com>.

Hello Owen,

> > Could you perhaps elaborate on that 100 MB limit ? Is that due to a
> > limit that is caused by the Java VM heap size ? If so, could that,
> > for example, be increased to 512MB by setting mapred.child.java.opts
> > to '-Xmx512m' ?
> 
> A couple of points:
>    1. The 100MB was just for ballpark calculations. Of course if you
> have a large heap, you can fit larger values. Don't forget that the
> framework is allocating big chunks of the heap for its own buffers,
> when figuring out how big to make your heaps.
>    2. Having large keys is much harder than large values. When doing a
> N-way merge, the framework has N+1 keys and 1 value in memory at a
> time.

Ok, that makes sense. Thanks for the information!


Regards,

Leon Mergen

Re: Practical limit on emitted map/reduce values

Posted by Owen O'Malley <om...@apache.org>.

On Jun 18, 2009, at 8:45 AM, Leon Mergen wrote:

> Could you perhaps elaborate on that 100 MB limit ? Is that due to a  
> limit that is caused by the Java VM heap size ? If so, could that,  
> for example, be increased to 512MB by setting mapred.child.java.opts  
> to '-Xmx512m' ?

A couple of points:
   1. The 100MB was just for ballpark calculations. Of course if you  
have a large heap, you can fit larger values. Don't forget that the  
framework is allocating big chunks of the heap for its own buffers,  
when figuring out how big to make your heaps.
   2. Having large keys is much harder than large values. When doing a  
N-way merge, the framework has N+1 keys and 1 value in memory at a time.

-- Owen

RE: Practical limit on emitted map/reduce values

Posted by Leon Mergen <l....@solatis.com>.

Hello Owen,

> Keys and values can be large. They are certainly capped above by
> Java's 2GB limit on byte arrays. More practically, you will have
> problems running out of memory with keys or values of 100 MB. There is
> no restriction that a key/value pair fits in a single hdfs block, but
> performance would suffer. (In particular, the FileInputFormats split
> at block sized chunks, which means you will have maps that scan an
> entire block without processing anything.)

Thanks for the quick reply.

Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that is caused by the Java VM heap size ? If so, could that, for example, be increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ?

Regards,

Leon Mergen

Re: Practical limit on emitted map/reduce values

Posted by Owen O'Malley <ow...@gmail.com>.

Keys and values can be large. They are certainly capped above by
Java's 2GB limit on byte arrays. More practically, you will have
problems running out of memory with keys or values of 100 MB. There is
no restriction that a key/value pair fits in a single hdfs block, but
performance would suffer. (In particular, the FileInputFormats split
at block sized chunks, which means you will have maps that scan an
entire block without processing anything.)

-- Owen