You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Berry, Matt" <mw...@amazon.com> on 2012/07/02 18:00:04 UTC

RE: Map Reduce Theory Question, getting OutOfMemoryError while reducing

Thanks everyone for the help. Emitting each record individually from the reducer is working well, and I can still aggregate the needed information as I go.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Friday, June 29, 2012 9:40 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing

Guojun is right, the reduce() inputs are buffered and read off of disk. You are in no danger there.
On Fri, Jun 29, 2012 at 11:02 PM, GUOJUN Zhu <gu...@freddiemac.com>> wrote:

If you are referring the iterable in the reducer, they are special and not in the memory at all.  Once the iterator pass a value, it is lost and you cannot recover it.  There is nothing like linkedlist in behind.

Zhu, Guojun
Modeling Sr Graduate
571-3824370<tel:571-3824370>
guojun_zhu@freddiemac.com<ma...@freddiemac.com>
Financial Engineering
Freddie Mac

   "Berry, Matt" <mw...@amazon.com>>

   06/29/2012 01:06 PM
   Please respond to
mapreduce-user@hadoop.apache.org<ma...@hadoop.apache.org>


To

"mapreduce-user@hadoop.apache.org<ma...@hadoop.apache.org>" <ma...@hadoop.apache.org>>

cc

Subject

RE: Map Reduce Theory Question, getting OutOfMemoryError while reducing







I was actually quite curious as to how Hadoop was managing to get all of the records into the Iterable in the first place. I thought they were using a very specialized object that implements Iterable, but a heap dump shows they're likely  just using a LinkedList. All I was doing was duplicating that object. Supposing I do as you suggest, am I in danger of having their list consume all the memory if a user decides to log 2x or 3x as much as they did this time?

~Matt

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
Sent: Friday, June 29, 2012 6:52 AM
To: mapreduce-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing

Hey Matt,

As far as I can tell, Hadoop isn't at fault here truly.

If your issue is that you collect in a list before you store, you should focus on that and just avoid collecting it completely. Why don't you serialize as you receive, if the incoming order is already taken care of? As far as I can tell, your AggregateRecords probably does nothing else but serialize the stored LinkedList. So instead of using a LinkedList, or even a composed Writable such as AggregateRecords, just write them in as you receive them via each .next(). Would this not work for you? You may batch a constant bit to gain some write performance but at least you won't have to use up your memory.

You can serialize as you receive by following this:
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F


--
Harsh J
<http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F>



--
Harsh J