You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Rick Ross <ri...@semanticresearch.com> on 2011/09/05 02:41:32 UTC

I keep getting multiple values for unique reduce keys

Hi all, 

I have ensured that my mapper produces a unique key for every value it writes and further more that each map() call only writes one value.    I note here that the value is a custom for which I handle the Writable interface methods.

I realize that it isn't very real world to have (well, want) no combining done prior to reducing, but I'm still getting my feet wet.  

When the reducer runs, I expected to see one reduce() call for every map() call, and I do.    However, the value I get is the composite of all the reduce() calls that came before it.

So, for example, the mapper gets data like this :

   ID,     Name,          Type,          Other stuff...
   A000,   Cream,         Group,         ...
   B231,   Led Zeppelin,  Group,         ...
   A044,   Liberace,      Individual,    ...


ID is the external key from the source data and is guaranteed to be unique.

When I map it, I create a container for the row data and output that container with all the data from that row only and use the ID field as a key.

Since the key is always unique I expected the sort/shuffle step to never coalesce any two values.    So I expected my reduce() method to be called once per mapped input row, and it is.    

The problem is, as each row is processed, the reducer sees a set of cumulative value data instead of a container with a row of data in it.  So the 'value' parameter to reduce always has the information from previous reduce steps.  

For example, given the data above : 

1st Reducer Call : 
   Key = A000 
   Value = 
       Container : 
          (object 1) : Name = Cream, Type = Group, MBID = A000, ... 

2nd Reducer Call : 
   Key = B231 
   Value = 
       Container : 
          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ... 
          (object 2) : Name = Cream, Type = Group, MBID = A000, ... 

So the second reduce call has data in it from the first reduce call.   Very strange!   At a guess I would say the reducer is re-using the object when it reads the objects back from the mapping step.  I dunno.. 

If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0

Thanks!

R




Re: I keep getting multiple values for unique reduce keys

Posted by Sonal Goyal <so...@gmail.com>.
Could you share your mapper code and the container code? When your mapper
emits the keys and values, do you print them out to see that they are
correct, that is, the container only has data specific to that id?

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Tue, Sep 6, 2011 at 10:41 AM, Rick Ross <ri...@semanticresearch.com>wrote:

> I'm still poking around on this and I was wondering if there is a way to
> see the intermediate files that the mapper writes and the ones that the
> reducer reads.    I might get some clues in there.
>
> Thanks
>
> R
>
> On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:
>
> Thanks, but unless I misread you, that didn't do it.     Naturally the
> object that I am creating just has a couple of ArrayLists to gather up Name
> and Type objects.
>
> I suspect I need to extend ArrayWritable instead.   I'll try that next.
>
> Cheers.
>
> R
>
> On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:
>
> Hi,
>
> I suspect it's something to do with your custom Writable. Do you have a
> clear method on your container? If so, that should be used before the obj is
> initialized every time to avoid retaining previous values due to object
> reuse during ser-de process.
>
> Thanks
> Sudhan S
>
>
>
> On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <ri...@semanticresearch.com>wrote:
>
>> Hi all,
>>
>> I have ensured that my mapper produces a unique key for every value it
>> writes and further more that each map() call only writes one value.    I
>> note here that the value is a custom for which I handle the Writable
>> interface methods.
>>
>> I realize that it isn't very real world to have (well, want) no combining
>> done prior to reducing, but I'm still getting my feet wet.
>>
>> When the reducer runs, I expected to see one reduce() call for every map()
>> call, and I do.    However, the value I get is the composite of all the
>> reduce() calls that came before it.
>>
>> So, for example, the mapper gets data like this :
>>
>>   ID,     Name,          Type,          Other stuff...
>>   A000,   Cream,         Group,         ...
>>   B231,   Led Zeppelin,  Group,         ...
>>   A044,   Liberace,      Individual,    ...
>>
>>
>> ID is the external key from the source data and is guaranteed to be
>> unique.
>>
>> When I map it, I create a container for the row data and output that
>> container with all the data from that row only and use the ID field as a
>> key.
>>
>> Since the key is always unique I expected the sort/shuffle step to never
>> coalesce any two values.    So I expected my reduce() method to be called
>> once per mapped input row, and it is.
>>
>> The problem is, as each row is processed, the reducer sees a set of
>> cumulative value data instead of a container with a row of data in it.  So
>> the 'value' parameter to reduce always has the information from previous
>> reduce steps.
>>
>> For example, given the data above :
>>
>> 1st Reducer Call :
>>   Key = A000
>>   Value =
>>       Container :
>>          (object 1) : Name = Cream, Type = Group, MBID = A000, ...
>>
>> 2nd Reducer Call :
>>   Key = B231
>>   Value =
>>       Container :
>>          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
>>          (object 2) : Name = Cream, Type = Group, MBID = A000, ...
>>
>> So the second reduce call has data in it from the first reduce call.
>> Very strange!   At a guess I would say the reducer is re-using the object
>> when it reads the objects back from the mapping step.  I dunno..
>>
>> If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0
>>
>> Thanks!
>>
>> R
>>
>>
>>
>>
>
>
>

Re: I keep getting multiple values for unique reduce keys

Posted by Sudharsan Sampath <su...@gmail.com>.
Hi Rick,

If possible can u share your custom writable that's configured as the value
type for the reducer.

Thanks
Sudhan S

On Tue, Sep 6, 2011 at 10:41 AM, Rick Ross <ri...@semanticresearch.com>wrote:

> I'm still poking around on this and I was wondering if there is a way to
> see the intermediate files that the mapper writes and the ones that the
> reducer reads.    I might get some clues in there.
>
> Thanks
>
> R
>
> On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:
>
> Thanks, but unless I misread you, that didn't do it.     Naturally the
> object that I am creating just has a couple of ArrayLists to gather up Name
> and Type objects.
>
> I suspect I need to extend ArrayWritable instead.   I'll try that next.
>
> Cheers.
>
> R
>
> On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:
>
> Hi,
>
> I suspect it's something to do with your custom Writable. Do you have a
> clear method on your container? If so, that should be used before the obj is
> initialized every time to avoid retaining previous values due to object
> reuse during ser-de process.
>
> Thanks
> Sudhan S
>
>
>
> On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <ri...@semanticresearch.com>wrote:
>
>> Hi all,
>>
>> I have ensured that my mapper produces a unique key for every value it
>> writes and further more that each map() call only writes one value.    I
>> note here that the value is a custom for which I handle the Writable
>> interface methods.
>>
>> I realize that it isn't very real world to have (well, want) no combining
>> done prior to reducing, but I'm still getting my feet wet.
>>
>> When the reducer runs, I expected to see one reduce() call for every map()
>> call, and I do.    However, the value I get is the composite of all the
>> reduce() calls that came before it.
>>
>> So, for example, the mapper gets data like this :
>>
>>   ID,     Name,          Type,          Other stuff...
>>   A000,   Cream,         Group,         ...
>>   B231,   Led Zeppelin,  Group,         ...
>>   A044,   Liberace,      Individual,    ...
>>
>>
>> ID is the external key from the source data and is guaranteed to be
>> unique.
>>
>> When I map it, I create a container for the row data and output that
>> container with all the data from that row only and use the ID field as a
>> key.
>>
>> Since the key is always unique I expected the sort/shuffle step to never
>> coalesce any two values.    So I expected my reduce() method to be called
>> once per mapped input row, and it is.
>>
>> The problem is, as each row is processed, the reducer sees a set of
>> cumulative value data instead of a container with a row of data in it.  So
>> the 'value' parameter to reduce always has the information from previous
>> reduce steps.
>>
>> For example, given the data above :
>>
>> 1st Reducer Call :
>>   Key = A000
>>   Value =
>>       Container :
>>          (object 1) : Name = Cream, Type = Group, MBID = A000, ...
>>
>> 2nd Reducer Call :
>>   Key = B231
>>   Value =
>>       Container :
>>          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
>>          (object 2) : Name = Cream, Type = Group, MBID = A000, ...
>>
>> So the second reduce call has data in it from the first reduce call.
>> Very strange!   At a guess I would say the reducer is re-using the object
>> when it reads the objects back from the mapping step.  I dunno..
>>
>> If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0
>>
>> Thanks!
>>
>> R
>>
>>
>>
>>
>
>
>

Re: I keep getting multiple values for unique reduce keys

Posted by Rick Ross <ri...@semanticresearch.com>.
I'm still poking around on this and I was wondering if there is a way to see the intermediate files that the mapper writes and the ones that the reducer reads.    I might get some clues in there. 

Thanks

R

On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:

> Thanks, but unless I misread you, that didn't do it.     Naturally the object that I am creating just has a couple of ArrayLists to gather up Name and Type objects.   
> 
> I suspect I need to extend ArrayWritable instead.   I'll try that next.  
> 
> Cheers.
> 
> R
> 
> On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:
> 
>> Hi,
>> 
>> I suspect it's something to do with your custom Writable. Do you have a clear method on your container? If so, that should be used before the obj is initialized every time to avoid retaining previous values due to object reuse during ser-de process.
>> 
>> Thanks
>> Sudhan S
>> 
>> 
>> 
>> On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <ri...@semanticresearch.com> wrote:
>> Hi all,
>> 
>> I have ensured that my mapper produces a unique key for every value it writes and further more that each map() call only writes one value.    I note here that the value is a custom for which I handle the Writable interface methods.
>> 
>> I realize that it isn't very real world to have (well, want) no combining done prior to reducing, but I'm still getting my feet wet.
>> 
>> When the reducer runs, I expected to see one reduce() call for every map() call, and I do.    However, the value I get is the composite of all the reduce() calls that came before it.
>> 
>> So, for example, the mapper gets data like this :
>> 
>>   ID,     Name,          Type,          Other stuff...
>>   A000,   Cream,         Group,         ...
>>   B231,   Led Zeppelin,  Group,         ...
>>   A044,   Liberace,      Individual,    ...
>> 
>> 
>> ID is the external key from the source data and is guaranteed to be unique.
>> 
>> When I map it, I create a container for the row data and output that container with all the data from that row only and use the ID field as a key.
>> 
>> Since the key is always unique I expected the sort/shuffle step to never coalesce any two values.    So I expected my reduce() method to be called once per mapped input row, and it is.
>> 
>> The problem is, as each row is processed, the reducer sees a set of cumulative value data instead of a container with a row of data in it.  So the 'value' parameter to reduce always has the information from previous reduce steps.
>> 
>> For example, given the data above :
>> 
>> 1st Reducer Call :
>>   Key = A000
>>   Value =
>>       Container :
>>          (object 1) : Name = Cream, Type = Group, MBID = A000, ...
>> 
>> 2nd Reducer Call :
>>   Key = B231
>>   Value =
>>       Container :
>>          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
>>          (object 2) : Name = Cream, Type = Group, MBID = A000, ...
>> 
>> So the second reduce call has data in it from the first reduce call.   Very strange!   At a guess I would say the reducer is re-using the object when it reads the objects back from the mapping step.  I dunno..
>> 
>> If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0
>> 
>> Thanks!
>> 
>> R
>> 
>> 
>> 
>> 
> 


Re: I keep getting multiple values for unique reduce keys

Posted by Rick Ross <ri...@semanticresearch.com>.
Thanks, but unless I misread you, that didn't do it.     Naturally the object that I am creating just has a couple of ArrayLists to gather up Name and Type objects.   

I suspect I need to extend ArrayWritable instead.   I'll try that next.  

Cheers.

R

On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:

> Hi,
> 
> I suspect it's something to do with your custom Writable. Do you have a clear method on your container? If so, that should be used before the obj is initialized every time to avoid retaining previous values due to object reuse during ser-de process.
> 
> Thanks
> Sudhan S
> 
> 
> 
> On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <ri...@semanticresearch.com> wrote:
> Hi all,
> 
> I have ensured that my mapper produces a unique key for every value it writes and further more that each map() call only writes one value.    I note here that the value is a custom for which I handle the Writable interface methods.
> 
> I realize that it isn't very real world to have (well, want) no combining done prior to reducing, but I'm still getting my feet wet.
> 
> When the reducer runs, I expected to see one reduce() call for every map() call, and I do.    However, the value I get is the composite of all the reduce() calls that came before it.
> 
> So, for example, the mapper gets data like this :
> 
>   ID,     Name,          Type,          Other stuff...
>   A000,   Cream,         Group,         ...
>   B231,   Led Zeppelin,  Group,         ...
>   A044,   Liberace,      Individual,    ...
> 
> 
> ID is the external key from the source data and is guaranteed to be unique.
> 
> When I map it, I create a container for the row data and output that container with all the data from that row only and use the ID field as a key.
> 
> Since the key is always unique I expected the sort/shuffle step to never coalesce any two values.    So I expected my reduce() method to be called once per mapped input row, and it is.
> 
> The problem is, as each row is processed, the reducer sees a set of cumulative value data instead of a container with a row of data in it.  So the 'value' parameter to reduce always has the information from previous reduce steps.
> 
> For example, given the data above :
> 
> 1st Reducer Call :
>   Key = A000
>   Value =
>       Container :
>          (object 1) : Name = Cream, Type = Group, MBID = A000, ...
> 
> 2nd Reducer Call :
>   Key = B231
>   Value =
>       Container :
>          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
>          (object 2) : Name = Cream, Type = Group, MBID = A000, ...
> 
> So the second reduce call has data in it from the first reduce call.   Very strange!   At a guess I would say the reducer is re-using the object when it reads the objects back from the mapping step.  I dunno..
> 
> If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0
> 
> Thanks!
> 
> R
> 
> 
> 
> 


Re: I keep getting multiple values for unique reduce keys

Posted by Sudharsan Sampath <su...@gmail.com>.
Hi,

I suspect it's something to do with your custom Writable. Do you have a
clear method on your container? If so, that should be used before the obj is
initialized every time to avoid retaining previous values due to object
reuse during ser-de process.

Thanks
Sudhan S



On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <ri...@semanticresearch.com> wrote:

> Hi all,
>
> I have ensured that my mapper produces a unique key for every value it
> writes and further more that each map() call only writes one value.    I
> note here that the value is a custom for which I handle the Writable
> interface methods.
>
> I realize that it isn't very real world to have (well, want) no combining
> done prior to reducing, but I'm still getting my feet wet.
>
> When the reducer runs, I expected to see one reduce() call for every map()
> call, and I do.    However, the value I get is the composite of all the
> reduce() calls that came before it.
>
> So, for example, the mapper gets data like this :
>
>   ID,     Name,          Type,          Other stuff...
>   A000,   Cream,         Group,         ...
>   B231,   Led Zeppelin,  Group,         ...
>   A044,   Liberace,      Individual,    ...
>
>
> ID is the external key from the source data and is guaranteed to be unique.
>
> When I map it, I create a container for the row data and output that
> container with all the data from that row only and use the ID field as a
> key.
>
> Since the key is always unique I expected the sort/shuffle step to never
> coalesce any two values.    So I expected my reduce() method to be called
> once per mapped input row, and it is.
>
> The problem is, as each row is processed, the reducer sees a set of
> cumulative value data instead of a container with a row of data in it.  So
> the 'value' parameter to reduce always has the information from previous
> reduce steps.
>
> For example, given the data above :
>
> 1st Reducer Call :
>   Key = A000
>   Value =
>       Container :
>          (object 1) : Name = Cream, Type = Group, MBID = A000, ...
>
> 2nd Reducer Call :
>   Key = B231
>   Value =
>       Container :
>          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
>          (object 2) : Name = Cream, Type = Group, MBID = A000, ...
>
> So the second reduce call has data in it from the first reduce call.   Very
> strange!   At a guess I would say the reducer is re-using the object when it
> reads the objects back from the mapping step.  I dunno..
>
> If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0
>
> Thanks!
>
> R
>
>
>
>