You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mike Forrest <mf...@trailfire.com> on 2008/01/10 23:51:05 UTC

problem with IdentityMapper

Hi,
I'm running into a problem where IdentityMapper seems to produce way too 
much data.  For example, I have a job that reads a sequence file using 
IdentityMapper and then uses IdentityReducer to write everything back 
out to another sequence file.  My input is a ~60MB sequence file and 
after the map phase has completed, the job tracker UI reports about 10GB 
for "Map output bytes".  It seems like the output collector does not get 
properly reset and so each map that gets emitted has the correct key but 
the value ends up being all the data you've encountered up to that 
point.  I think this is a known issue but I can't seem to find any 
discussion about it right now.  Has anyone else run into this, and if 
so, is there a solution?  I'm using the latest code in the 0.15 branch.
Thanks
Mike

RE: problem with IdentityMapper

Posted by Runping Qi <ru...@yahoo-inc.com>.

That explains.
 
The key/value objects are reused through the cycle of recordreader.read
and mapper calls.
The MapWritable reader perhaps does not reset the MapWritable object
passed to it.

Runping
 
> -----Original Message-----
> From: Mike Forrest [mailto:mforrest@trailfire.com] 
> Sent: Thursday, January 10, 2008 3:20 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: problem with IdentityMapper
> 
> I'm using Text for the keys and MapWritable for the values.
> 
> Joydeep Sen Sarma wrote:
> > what are the key value types in the Sequencefile?
> >  
> > seems that the maprunner calls createKey and createValue 
> just once. so if the value serializes out it's entire memory 
> allocated (and not what it last read) - it would cause this problem.
> >  
> > (I have periodically shot myself in the foot with this bullet).
> >
> > ________________________________
> >
> > From: Mike Forrest [mailto:mforrest@trailfire.com]
> > Sent: Thu 1/10/2008 2:51 PM
> > To: hadoop-user@lucene.apache.org
> > Subject: problem with IdentityMapper
> >
> >
> >
> > Hi,
> > I'm running into a problem where IdentityMapper seems to 
> produce way 
> > too much data.  For example, I have a job that reads a 
> sequence file 
> > using IdentityMapper and then uses IdentityReducer to write 
> everything 
> > back out to another sequence file.  My input is a ~60MB 
> sequence file 
> > and after the map phase has completed, the job tracker UI reports 
> > about 10GB for "Map output bytes".  It seems like the 
> output collector 
> > does not get properly reset and so each map that gets 
> emitted has the 
> > correct key but the value ends up being all the data you've 
> > encountered up to that point.  I think this is a known issue but I 
> > can't seem to find any discussion about it right now.  Has 
> anyone else 
> > run into this, and if so, is there a solution?  I'm using 
> the latest code in the 0.15 branch.
> > Thanks
> > Mike
> >
> >
> >
> >   
> 
>

Re: problem with IdentityMapper

Posted by Mike Forrest <mf...@trailfire.com>.

You were exactly right.  Your simple patch has completely fixed my 
problem.  Thank you, Joydeep and Runping.

Joydeep Sen Sarma wrote:
> ouch. MapWritable does not reset the hash table on a readFields. The hash table would just grow and grow. the write method dumps the entire hash out.
>  
> patch is simple: just do a instance.clear() in the readFields() call. (But i haven't looked at the base class).
>
> ________________________________
>
> From: Mike Forrest [mailto:mforrest@trailfire.com]
> Sent: Thu 1/10/2008 3:20 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: problem with IdentityMapper
>
>
>
> I'm using Text for the keys and MapWritable for the values.
>
> Joydeep Sen Sarma wrote:
>   
>> what are the key value types in the Sequencefile?
>>
>> seems that the maprunner calls createKey and createValue just once. so if the value serializes out it's entire memory allocated (and not what it last read) - it would cause this problem.
>>
>> (I have periodically shot myself in the foot with this bullet).
>>
>> ________________________________
>>
>> From: Mike Forrest [mailto:mforrest@trailfire.com]
>> Sent: Thu 1/10/2008 2:51 PM
>> To: hadoop-user@lucene.apache.org
>> Subject: problem with IdentityMapper
>>
>>
>>
>> Hi,
>> I'm running into a problem where IdentityMapper seems to produce way too
>> much data.  For example, I have a job that reads a sequence file using
>> IdentityMapper and then uses IdentityReducer to write everything back
>> out to another sequence file.  My input is a ~60MB sequence file and
>> after the map phase has completed, the job tracker UI reports about 10GB
>> for "Map output bytes".  It seems like the output collector does not get
>> properly reset and so each map that gets emitted has the correct key but
>> the value ends up being all the data you've encountered up to that
>> point.  I think this is a known issue but I can't seem to find any
>> discussion about it right now.  Has anyone else run into this, and if
>> so, is there a solution?  I'm using the latest code in the 0.15 branch.
>> Thanks
>> Mike
>>
>>
>>
>>  
>>     
>
>
>
>
>

RE: problem with IdentityMapper

Posted by Joydeep Sen Sarma <js...@facebook.com>.

ouch. MapWritable does not reset the hash table on a readFields. The hash table would just grow and grow. the write method dumps the entire hash out.
 
patch is simple: just do a instance.clear() in the readFields() call. (But i haven't looked at the base class).

________________________________

From: Mike Forrest [mailto:mforrest@trailfire.com]
Sent: Thu 1/10/2008 3:20 PM
To: hadoop-user@lucene.apache.org
Subject: Re: problem with IdentityMapper



I'm using Text for the keys and MapWritable for the values.

Joydeep Sen Sarma wrote:
> what are the key value types in the Sequencefile?
> 
> seems that the maprunner calls createKey and createValue just once. so if the value serializes out it's entire memory allocated (and not what it last read) - it would cause this problem.
> 
> (I have periodically shot myself in the foot with this bullet).
>
> ________________________________
>
> From: Mike Forrest [mailto:mforrest@trailfire.com]
> Sent: Thu 1/10/2008 2:51 PM
> To: hadoop-user@lucene.apache.org
> Subject: problem with IdentityMapper
>
>
>
> Hi,
> I'm running into a problem where IdentityMapper seems to produce way too
> much data.  For example, I have a job that reads a sequence file using
> IdentityMapper and then uses IdentityReducer to write everything back
> out to another sequence file.  My input is a ~60MB sequence file and
> after the map phase has completed, the job tracker UI reports about 10GB
> for "Map output bytes".  It seems like the output collector does not get
> properly reset and so each map that gets emitted has the correct key but
> the value ends up being all the data you've encountered up to that
> point.  I think this is a known issue but I can't seem to find any
> discussion about it right now.  Has anyone else run into this, and if
> so, is there a solution?  I'm using the latest code in the 0.15 branch.
> Thanks
> Mike
>
>
>
>

Re: problem with IdentityMapper

Posted by Mike Forrest <mf...@trailfire.com>.

I'm using Text for the keys and MapWritable for the values.

Joydeep Sen Sarma wrote:
> what are the key value types in the Sequencefile?
>  
> seems that the maprunner calls createKey and createValue just once. so if the value serializes out it's entire memory allocated (and not what it last read) - it would cause this problem.
>  
> (I have periodically shot myself in the foot with this bullet).
>
> ________________________________
>
> From: Mike Forrest [mailto:mforrest@trailfire.com]
> Sent: Thu 1/10/2008 2:51 PM
> To: hadoop-user@lucene.apache.org
> Subject: problem with IdentityMapper
>
>
>
> Hi,
> I'm running into a problem where IdentityMapper seems to produce way too
> much data.  For example, I have a job that reads a sequence file using
> IdentityMapper and then uses IdentityReducer to write everything back
> out to another sequence file.  My input is a ~60MB sequence file and
> after the map phase has completed, the job tracker UI reports about 10GB
> for "Map output bytes".  It seems like the output collector does not get
> properly reset and so each map that gets emitted has the correct key but
> the value ends up being all the data you've encountered up to that
> point.  I think this is a known issue but I can't seem to find any
> discussion about it right now.  Has anyone else run into this, and if
> so, is there a solution?  I'm using the latest code in the 0.15 branch.
> Thanks
> Mike
>
>
>
>

RE: problem with IdentityMapper

Posted by Joydeep Sen Sarma <js...@facebook.com>.

what are the key value types in the Sequencefile?
 
seems that the maprunner calls createKey and createValue just once. so if the value serializes out it's entire memory allocated (and not what it last read) - it would cause this problem.
 
(I have periodically shot myself in the foot with this bullet).

________________________________

From: Mike Forrest [mailto:mforrest@trailfire.com]
Sent: Thu 1/10/2008 2:51 PM
To: hadoop-user@lucene.apache.org
Subject: problem with IdentityMapper



Hi,
I'm running into a problem where IdentityMapper seems to produce way too
much data.  For example, I have a job that reads a sequence file using
IdentityMapper and then uses IdentityReducer to write everything back
out to another sequence file.  My input is a ~60MB sequence file and
after the map phase has completed, the job tracker UI reports about 10GB
for "Map output bytes".  It seems like the output collector does not get
properly reset and so each map that gets emitted has the correct key but
the value ends up being all the data you've encountered up to that
point.  I think this is a known issue but I can't seem to find any
discussion about it right now.  Has anyone else run into this, and if
so, is there a solution?  I'm using the latest code in the 0.15 branch.
Thanks
Mike