You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Alberto Cordioli <co...@gmail.com> on 2012/10/15 21:11:27 UTC

GroupingComparator

Hi all,

a very strange thing is happening with my hadoop program.
My map simply emits tuples with a custom object as key (which
implement WritableComparable).
The object is made of 2 fields, and I implement my partitioner and
groupingclass in such a way that only the first field is taken into
account.
The second field is just a tag and could be 1 or 2.

This is the reducer's snippet:

tag = key.getSecondField();
Iterator it1 = values.iterator();
while(it1.hasNext()){
        it1.next();
        collector.emit(new Text("dummy"), tag);
}

I would expect in my output all the lines with:
dummy       1
...
dummy       1

but actually the value of tag changes in time and I obtain this type of output:

dummy    1
...
dummy    1
dummy    2
...
dummy    2


Someone could explain me way, please?


Thanks.





-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Yes, I know that keeping an in-memory collection ins't a good idea.
The problem is that I need to perform a join, so there is no other
possibilities! :(

Cheers,
Alberto

On 16 October 2012 11:08, Dave Beech <db...@apache.org> wrote:
> Great! Glad the problem is solved.
>
> You're right - the object returned by iterator.next() is re-used too.
> So yes, you would need to clone in this case and you'd have no choice
> but to create new objects.
>
> Please be sure though that you really do need to store values in a
> list to do what you're trying to do. Keeping an in-memory collection
> might not be very scalable. Obviously, if you've got loads of RAM or
> not a lot of data (or both), then that's fine! Just something else to
> think about...
>
> Cheers,
> Dave
>
> On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
>> Thanks Dave.
>> You solved my problem. Just a little question about your tip:
>> I suppose also the value returned by iterator.next() is re-used.
>> So if want to store some values of the Iterable list in the reducer, I
>> should create a List and put cloned objects inside it.
>> In this case there is no possibility to avoid the "new" operator, right?
>>
>>
>>
>> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>>> or IntWritable instance variable. You wouldn't need to clone the whole
>>> key.
>>>
>>> Then, instead of tag = key.getSecondField() you'd say
>>> tag.set(key.getSecondField().get());
>>> I don't know what type of object tag is (if it's Text you'll say
>>> toString() rather than get()), but you see what I mean.
>>>
>>> Also - just a tip - try to avoid creating new objects wherever
>>> possible. You'll get better performance if you create one Text object
>>> as an instance variable and re-use it by setting the value instead of
>>> calling new Text("") on every output.
>>>
>>> Thanks,
>>> Dave
>>>
>>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi Dave,
>>>>
>>>> thanks for your reply. Now it's more clear; in fact the code that I
>>>> wrote is inspired to the old api, where the behavior is another.
>>>> So, how can I achieve the same behavior as the old api? I need the
>>>> second field of the first key object to stay the same among the
>>>> iterations, in order to compare it with other objects. Do I have to
>>>> clone the object?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>>> Hi Alberto
>>>>>
>>>>> The iterator you are looping over in your reduce method isn't a
>>>>> self-contained list of values. What's actually happening is that
>>>>> you're iterating through *part* of the sorted key/value set that was
>>>>> sent to that reduce node, and it is the grouping comparator that
>>>>> decides when to break that loop and call reduce again on the next key.
>>>>>
>>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>>> the values, what's actually happening is this pointer to the
>>>>> associated key data moves with it - and you're seeing it change.
>>>>>
>>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>>> API you get the first key, and it appears to stay the same during the
>>>>> loop.
>>>>>
>>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>>> don't act the same.
>>>>>
>>>>> Hope that helps,
>>>>> Dave
>>>>>
>>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> a very strange thing is happening with my hadoop program.
>>>>>> My map simply emits tuples with a custom object as key (which
>>>>>> implement WritableComparable).
>>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>>> groupingclass in such a way that only the first field is taken into
>>>>>> account.
>>>>>> The second field is just a tag and could be 1 or 2.
>>>>>>
>>>>>> This is the reducer's snippet:
>>>>>>
>>>>>> tag = key.getSecondField();
>>>>>> Iterator it1 = values.iterator();
>>>>>> while(it1.hasNext()){
>>>>>>         it1.next();
>>>>>>         collector.emit(new Text("dummy"), tag);
>>>>>> }
>>>>>>
>>>>>> I would expect in my output all the lines with:
>>>>>> dummy       1
>>>>>> ...
>>>>>> dummy       1
>>>>>>
>>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>>
>>>>>> dummy    1
>>>>>> ...
>>>>>> dummy    1
>>>>>> dummy    2
>>>>>> ...
>>>>>> dummy    2
>>>>>>
>>>>>>
>>>>>> Someone could explain me way, please?
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alberto Cordioli
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Yes, I know that keeping an in-memory collection ins't a good idea.
The problem is that I need to perform a join, so there is no other
possibilities! :(

Cheers,
Alberto

On 16 October 2012 11:08, Dave Beech <db...@apache.org> wrote:
> Great! Glad the problem is solved.
>
> You're right - the object returned by iterator.next() is re-used too.
> So yes, you would need to clone in this case and you'd have no choice
> but to create new objects.
>
> Please be sure though that you really do need to store values in a
> list to do what you're trying to do. Keeping an in-memory collection
> might not be very scalable. Obviously, if you've got loads of RAM or
> not a lot of data (or both), then that's fine! Just something else to
> think about...
>
> Cheers,
> Dave
>
> On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
>> Thanks Dave.
>> You solved my problem. Just a little question about your tip:
>> I suppose also the value returned by iterator.next() is re-used.
>> So if want to store some values of the Iterable list in the reducer, I
>> should create a List and put cloned objects inside it.
>> In this case there is no possibility to avoid the "new" operator, right?
>>
>>
>>
>> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>>> or IntWritable instance variable. You wouldn't need to clone the whole
>>> key.
>>>
>>> Then, instead of tag = key.getSecondField() you'd say
>>> tag.set(key.getSecondField().get());
>>> I don't know what type of object tag is (if it's Text you'll say
>>> toString() rather than get()), but you see what I mean.
>>>
>>> Also - just a tip - try to avoid creating new objects wherever
>>> possible. You'll get better performance if you create one Text object
>>> as an instance variable and re-use it by setting the value instead of
>>> calling new Text("") on every output.
>>>
>>> Thanks,
>>> Dave
>>>
>>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi Dave,
>>>>
>>>> thanks for your reply. Now it's more clear; in fact the code that I
>>>> wrote is inspired to the old api, where the behavior is another.
>>>> So, how can I achieve the same behavior as the old api? I need the
>>>> second field of the first key object to stay the same among the
>>>> iterations, in order to compare it with other objects. Do I have to
>>>> clone the object?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>>> Hi Alberto
>>>>>
>>>>> The iterator you are looping over in your reduce method isn't a
>>>>> self-contained list of values. What's actually happening is that
>>>>> you're iterating through *part* of the sorted key/value set that was
>>>>> sent to that reduce node, and it is the grouping comparator that
>>>>> decides when to break that loop and call reduce again on the next key.
>>>>>
>>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>>> the values, what's actually happening is this pointer to the
>>>>> associated key data moves with it - and you're seeing it change.
>>>>>
>>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>>> API you get the first key, and it appears to stay the same during the
>>>>> loop.
>>>>>
>>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>>> don't act the same.
>>>>>
>>>>> Hope that helps,
>>>>> Dave
>>>>>
>>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> a very strange thing is happening with my hadoop program.
>>>>>> My map simply emits tuples with a custom object as key (which
>>>>>> implement WritableComparable).
>>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>>> groupingclass in such a way that only the first field is taken into
>>>>>> account.
>>>>>> The second field is just a tag and could be 1 or 2.
>>>>>>
>>>>>> This is the reducer's snippet:
>>>>>>
>>>>>> tag = key.getSecondField();
>>>>>> Iterator it1 = values.iterator();
>>>>>> while(it1.hasNext()){
>>>>>>         it1.next();
>>>>>>         collector.emit(new Text("dummy"), tag);
>>>>>> }
>>>>>>
>>>>>> I would expect in my output all the lines with:
>>>>>> dummy       1
>>>>>> ...
>>>>>> dummy       1
>>>>>>
>>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>>
>>>>>> dummy    1
>>>>>> ...
>>>>>> dummy    1
>>>>>> dummy    2
>>>>>> ...
>>>>>> dummy    2
>>>>>>
>>>>>>
>>>>>> Someone could explain me way, please?
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alberto Cordioli
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Yes, I know that keeping an in-memory collection ins't a good idea.
The problem is that I need to perform a join, so there is no other
possibilities! :(

Cheers,
Alberto

On 16 October 2012 11:08, Dave Beech <db...@apache.org> wrote:
> Great! Glad the problem is solved.
>
> You're right - the object returned by iterator.next() is re-used too.
> So yes, you would need to clone in this case and you'd have no choice
> but to create new objects.
>
> Please be sure though that you really do need to store values in a
> list to do what you're trying to do. Keeping an in-memory collection
> might not be very scalable. Obviously, if you've got loads of RAM or
> not a lot of data (or both), then that's fine! Just something else to
> think about...
>
> Cheers,
> Dave
>
> On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
>> Thanks Dave.
>> You solved my problem. Just a little question about your tip:
>> I suppose also the value returned by iterator.next() is re-used.
>> So if want to store some values of the Iterable list in the reducer, I
>> should create a List and put cloned objects inside it.
>> In this case there is no possibility to avoid the "new" operator, right?
>>
>>
>>
>> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>>> or IntWritable instance variable. You wouldn't need to clone the whole
>>> key.
>>>
>>> Then, instead of tag = key.getSecondField() you'd say
>>> tag.set(key.getSecondField().get());
>>> I don't know what type of object tag is (if it's Text you'll say
>>> toString() rather than get()), but you see what I mean.
>>>
>>> Also - just a tip - try to avoid creating new objects wherever
>>> possible. You'll get better performance if you create one Text object
>>> as an instance variable and re-use it by setting the value instead of
>>> calling new Text("") on every output.
>>>
>>> Thanks,
>>> Dave
>>>
>>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi Dave,
>>>>
>>>> thanks for your reply. Now it's more clear; in fact the code that I
>>>> wrote is inspired to the old api, where the behavior is another.
>>>> So, how can I achieve the same behavior as the old api? I need the
>>>> second field of the first key object to stay the same among the
>>>> iterations, in order to compare it with other objects. Do I have to
>>>> clone the object?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>>> Hi Alberto
>>>>>
>>>>> The iterator you are looping over in your reduce method isn't a
>>>>> self-contained list of values. What's actually happening is that
>>>>> you're iterating through *part* of the sorted key/value set that was
>>>>> sent to that reduce node, and it is the grouping comparator that
>>>>> decides when to break that loop and call reduce again on the next key.
>>>>>
>>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>>> the values, what's actually happening is this pointer to the
>>>>> associated key data moves with it - and you're seeing it change.
>>>>>
>>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>>> API you get the first key, and it appears to stay the same during the
>>>>> loop.
>>>>>
>>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>>> don't act the same.
>>>>>
>>>>> Hope that helps,
>>>>> Dave
>>>>>
>>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> a very strange thing is happening with my hadoop program.
>>>>>> My map simply emits tuples with a custom object as key (which
>>>>>> implement WritableComparable).
>>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>>> groupingclass in such a way that only the first field is taken into
>>>>>> account.
>>>>>> The second field is just a tag and could be 1 or 2.
>>>>>>
>>>>>> This is the reducer's snippet:
>>>>>>
>>>>>> tag = key.getSecondField();
>>>>>> Iterator it1 = values.iterator();
>>>>>> while(it1.hasNext()){
>>>>>>         it1.next();
>>>>>>         collector.emit(new Text("dummy"), tag);
>>>>>> }
>>>>>>
>>>>>> I would expect in my output all the lines with:
>>>>>> dummy       1
>>>>>> ...
>>>>>> dummy       1
>>>>>>
>>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>>
>>>>>> dummy    1
>>>>>> ...
>>>>>> dummy    1
>>>>>> dummy    2
>>>>>> ...
>>>>>> dummy    2
>>>>>>
>>>>>>
>>>>>> Someone could explain me way, please?
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alberto Cordioli
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Yes, I know that keeping an in-memory collection ins't a good idea.
The problem is that I need to perform a join, so there is no other
possibilities! :(

Cheers,
Alberto

On 16 October 2012 11:08, Dave Beech <db...@apache.org> wrote:
> Great! Glad the problem is solved.
>
> You're right - the object returned by iterator.next() is re-used too.
> So yes, you would need to clone in this case and you'd have no choice
> but to create new objects.
>
> Please be sure though that you really do need to store values in a
> list to do what you're trying to do. Keeping an in-memory collection
> might not be very scalable. Obviously, if you've got loads of RAM or
> not a lot of data (or both), then that's fine! Just something else to
> think about...
>
> Cheers,
> Dave
>
> On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
>> Thanks Dave.
>> You solved my problem. Just a little question about your tip:
>> I suppose also the value returned by iterator.next() is re-used.
>> So if want to store some values of the Iterable list in the reducer, I
>> should create a List and put cloned objects inside it.
>> In this case there is no possibility to avoid the "new" operator, right?
>>
>>
>>
>> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>>> or IntWritable instance variable. You wouldn't need to clone the whole
>>> key.
>>>
>>> Then, instead of tag = key.getSecondField() you'd say
>>> tag.set(key.getSecondField().get());
>>> I don't know what type of object tag is (if it's Text you'll say
>>> toString() rather than get()), but you see what I mean.
>>>
>>> Also - just a tip - try to avoid creating new objects wherever
>>> possible. You'll get better performance if you create one Text object
>>> as an instance variable and re-use it by setting the value instead of
>>> calling new Text("") on every output.
>>>
>>> Thanks,
>>> Dave
>>>
>>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi Dave,
>>>>
>>>> thanks for your reply. Now it's more clear; in fact the code that I
>>>> wrote is inspired to the old api, where the behavior is another.
>>>> So, how can I achieve the same behavior as the old api? I need the
>>>> second field of the first key object to stay the same among the
>>>> iterations, in order to compare it with other objects. Do I have to
>>>> clone the object?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>>> Hi Alberto
>>>>>
>>>>> The iterator you are looping over in your reduce method isn't a
>>>>> self-contained list of values. What's actually happening is that
>>>>> you're iterating through *part* of the sorted key/value set that was
>>>>> sent to that reduce node, and it is the grouping comparator that
>>>>> decides when to break that loop and call reduce again on the next key.
>>>>>
>>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>>> the values, what's actually happening is this pointer to the
>>>>> associated key data moves with it - and you're seeing it change.
>>>>>
>>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>>> API you get the first key, and it appears to stay the same during the
>>>>> loop.
>>>>>
>>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>>> don't act the same.
>>>>>
>>>>> Hope that helps,
>>>>> Dave
>>>>>
>>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> a very strange thing is happening with my hadoop program.
>>>>>> My map simply emits tuples with a custom object as key (which
>>>>>> implement WritableComparable).
>>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>>> groupingclass in such a way that only the first field is taken into
>>>>>> account.
>>>>>> The second field is just a tag and could be 1 or 2.
>>>>>>
>>>>>> This is the reducer's snippet:
>>>>>>
>>>>>> tag = key.getSecondField();
>>>>>> Iterator it1 = values.iterator();
>>>>>> while(it1.hasNext()){
>>>>>>         it1.next();
>>>>>>         collector.emit(new Text("dummy"), tag);
>>>>>> }
>>>>>>
>>>>>> I would expect in my output all the lines with:
>>>>>> dummy       1
>>>>>> ...
>>>>>> dummy       1
>>>>>>
>>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>>
>>>>>> dummy    1
>>>>>> ...
>>>>>> dummy    1
>>>>>> dummy    2
>>>>>> ...
>>>>>> dummy    2
>>>>>>
>>>>>>
>>>>>> Someone could explain me way, please?
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alberto Cordioli
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Great! Glad the problem is solved.

You're right - the object returned by iterator.next() is re-used too.
So yes, you would need to clone in this case and you'd have no choice
but to create new objects.

Please be sure though that you really do need to store values in a
list to do what you're trying to do. Keeping an in-memory collection
might not be very scalable. Obviously, if you've got loads of RAM or
not a lot of data (or both), then that's fine! Just something else to
think about...

Cheers,
Dave

On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
> Thanks Dave.
> You solved my problem. Just a little question about your tip:
> I suppose also the value returned by iterator.next() is re-used.
> So if want to store some values of the Iterable list in the reducer, I
> should create a List and put cloned objects inside it.
> In this case there is no possibility to avoid the "new" operator, right?
>
>
>
> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>> or IntWritable instance variable. You wouldn't need to clone the whole
>> key.
>>
>> Then, instead of tag = key.getSecondField() you'd say
>> tag.set(key.getSecondField().get());
>> I don't know what type of object tag is (if it's Text you'll say
>> toString() rather than get()), but you see what I mean.
>>
>> Also - just a tip - try to avoid creating new objects wherever
>> possible. You'll get better performance if you create one Text object
>> as an instance variable and re-use it by setting the value instead of
>> calling new Text("") on every output.
>>
>> Thanks,
>> Dave
>>
>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi Dave,
>>>
>>> thanks for your reply. Now it's more clear; in fact the code that I
>>> wrote is inspired to the old api, where the behavior is another.
>>> So, how can I achieve the same behavior as the old api? I need the
>>> second field of the first key object to stay the same among the
>>> iterations, in order to compare it with other objects. Do I have to
>>> clone the object?
>>>
>>>
>>> Thanks.
>>>
>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>> Hi Alberto
>>>>
>>>> The iterator you are looping over in your reduce method isn't a
>>>> self-contained list of values. What's actually happening is that
>>>> you're iterating through *part* of the sorted key/value set that was
>>>> sent to that reduce node, and it is the grouping comparator that
>>>> decides when to break that loop and call reduce again on the next key.
>>>>
>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>> the values, what's actually happening is this pointer to the
>>>> associated key data moves with it - and you're seeing it change.
>>>>
>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>> API you get the first key, and it appears to stay the same during the
>>>> loop.
>>>>
>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>> don't act the same.
>>>>
>>>> Hope that helps,
>>>> Dave
>>>>
>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> a very strange thing is happening with my hadoop program.
>>>>> My map simply emits tuples with a custom object as key (which
>>>>> implement WritableComparable).
>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>> groupingclass in such a way that only the first field is taken into
>>>>> account.
>>>>> The second field is just a tag and could be 1 or 2.
>>>>>
>>>>> This is the reducer's snippet:
>>>>>
>>>>> tag = key.getSecondField();
>>>>> Iterator it1 = values.iterator();
>>>>> while(it1.hasNext()){
>>>>>         it1.next();
>>>>>         collector.emit(new Text("dummy"), tag);
>>>>> }
>>>>>
>>>>> I would expect in my output all the lines with:
>>>>> dummy       1
>>>>> ...
>>>>> dummy       1
>>>>>
>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>
>>>>> dummy    1
>>>>> ...
>>>>> dummy    1
>>>>> dummy    2
>>>>> ...
>>>>> dummy    2
>>>>>
>>>>>
>>>>> Someone could explain me way, please?
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alberto Cordioli
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Great! Glad the problem is solved.

You're right - the object returned by iterator.next() is re-used too.
So yes, you would need to clone in this case and you'd have no choice
but to create new objects.

Please be sure though that you really do need to store values in a
list to do what you're trying to do. Keeping an in-memory collection
might not be very scalable. Obviously, if you've got loads of RAM or
not a lot of data (or both), then that's fine! Just something else to
think about...

Cheers,
Dave

On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
> Thanks Dave.
> You solved my problem. Just a little question about your tip:
> I suppose also the value returned by iterator.next() is re-used.
> So if want to store some values of the Iterable list in the reducer, I
> should create a List and put cloned objects inside it.
> In this case there is no possibility to avoid the "new" operator, right?
>
>
>
> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>> or IntWritable instance variable. You wouldn't need to clone the whole
>> key.
>>
>> Then, instead of tag = key.getSecondField() you'd say
>> tag.set(key.getSecondField().get());
>> I don't know what type of object tag is (if it's Text you'll say
>> toString() rather than get()), but you see what I mean.
>>
>> Also - just a tip - try to avoid creating new objects wherever
>> possible. You'll get better performance if you create one Text object
>> as an instance variable and re-use it by setting the value instead of
>> calling new Text("") on every output.
>>
>> Thanks,
>> Dave
>>
>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi Dave,
>>>
>>> thanks for your reply. Now it's more clear; in fact the code that I
>>> wrote is inspired to the old api, where the behavior is another.
>>> So, how can I achieve the same behavior as the old api? I need the
>>> second field of the first key object to stay the same among the
>>> iterations, in order to compare it with other objects. Do I have to
>>> clone the object?
>>>
>>>
>>> Thanks.
>>>
>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>> Hi Alberto
>>>>
>>>> The iterator you are looping over in your reduce method isn't a
>>>> self-contained list of values. What's actually happening is that
>>>> you're iterating through *part* of the sorted key/value set that was
>>>> sent to that reduce node, and it is the grouping comparator that
>>>> decides when to break that loop and call reduce again on the next key.
>>>>
>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>> the values, what's actually happening is this pointer to the
>>>> associated key data moves with it - and you're seeing it change.
>>>>
>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>> API you get the first key, and it appears to stay the same during the
>>>> loop.
>>>>
>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>> don't act the same.
>>>>
>>>> Hope that helps,
>>>> Dave
>>>>
>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> a very strange thing is happening with my hadoop program.
>>>>> My map simply emits tuples with a custom object as key (which
>>>>> implement WritableComparable).
>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>> groupingclass in such a way that only the first field is taken into
>>>>> account.
>>>>> The second field is just a tag and could be 1 or 2.
>>>>>
>>>>> This is the reducer's snippet:
>>>>>
>>>>> tag = key.getSecondField();
>>>>> Iterator it1 = values.iterator();
>>>>> while(it1.hasNext()){
>>>>>         it1.next();
>>>>>         collector.emit(new Text("dummy"), tag);
>>>>> }
>>>>>
>>>>> I would expect in my output all the lines with:
>>>>> dummy       1
>>>>> ...
>>>>> dummy       1
>>>>>
>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>
>>>>> dummy    1
>>>>> ...
>>>>> dummy    1
>>>>> dummy    2
>>>>> ...
>>>>> dummy    2
>>>>>
>>>>>
>>>>> Someone could explain me way, please?
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alberto Cordioli
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Great! Glad the problem is solved.

You're right - the object returned by iterator.next() is re-used too.
So yes, you would need to clone in this case and you'd have no choice
but to create new objects.

Please be sure though that you really do need to store values in a
list to do what you're trying to do. Keeping an in-memory collection
might not be very scalable. Obviously, if you've got loads of RAM or
not a lot of data (or both), then that's fine! Just something else to
think about...

Cheers,
Dave

On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
> Thanks Dave.
> You solved my problem. Just a little question about your tip:
> I suppose also the value returned by iterator.next() is re-used.
> So if want to store some values of the Iterable list in the reducer, I
> should create a List and put cloned objects inside it.
> In this case there is no possibility to avoid the "new" operator, right?
>
>
>
> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>> or IntWritable instance variable. You wouldn't need to clone the whole
>> key.
>>
>> Then, instead of tag = key.getSecondField() you'd say
>> tag.set(key.getSecondField().get());
>> I don't know what type of object tag is (if it's Text you'll say
>> toString() rather than get()), but you see what I mean.
>>
>> Also - just a tip - try to avoid creating new objects wherever
>> possible. You'll get better performance if you create one Text object
>> as an instance variable and re-use it by setting the value instead of
>> calling new Text("") on every output.
>>
>> Thanks,
>> Dave
>>
>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi Dave,
>>>
>>> thanks for your reply. Now it's more clear; in fact the code that I
>>> wrote is inspired to the old api, where the behavior is another.
>>> So, how can I achieve the same behavior as the old api? I need the
>>> second field of the first key object to stay the same among the
>>> iterations, in order to compare it with other objects. Do I have to
>>> clone the object?
>>>
>>>
>>> Thanks.
>>>
>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>> Hi Alberto
>>>>
>>>> The iterator you are looping over in your reduce method isn't a
>>>> self-contained list of values. What's actually happening is that
>>>> you're iterating through *part* of the sorted key/value set that was
>>>> sent to that reduce node, and it is the grouping comparator that
>>>> decides when to break that loop and call reduce again on the next key.
>>>>
>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>> the values, what's actually happening is this pointer to the
>>>> associated key data moves with it - and you're seeing it change.
>>>>
>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>> API you get the first key, and it appears to stay the same during the
>>>> loop.
>>>>
>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>> don't act the same.
>>>>
>>>> Hope that helps,
>>>> Dave
>>>>
>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> a very strange thing is happening with my hadoop program.
>>>>> My map simply emits tuples with a custom object as key (which
>>>>> implement WritableComparable).
>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>> groupingclass in such a way that only the first field is taken into
>>>>> account.
>>>>> The second field is just a tag and could be 1 or 2.
>>>>>
>>>>> This is the reducer's snippet:
>>>>>
>>>>> tag = key.getSecondField();
>>>>> Iterator it1 = values.iterator();
>>>>> while(it1.hasNext()){
>>>>>         it1.next();
>>>>>         collector.emit(new Text("dummy"), tag);
>>>>> }
>>>>>
>>>>> I would expect in my output all the lines with:
>>>>> dummy       1
>>>>> ...
>>>>> dummy       1
>>>>>
>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>
>>>>> dummy    1
>>>>> ...
>>>>> dummy    1
>>>>> dummy    2
>>>>> ...
>>>>> dummy    2
>>>>>
>>>>>
>>>>> Someone could explain me way, please?
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alberto Cordioli
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Great! Glad the problem is solved.

You're right - the object returned by iterator.next() is re-used too.
So yes, you would need to clone in this case and you'd have no choice
but to create new objects.

Please be sure though that you really do need to store values in a
list to do what you're trying to do. Keeping an in-memory collection
might not be very scalable. Obviously, if you've got loads of RAM or
not a lot of data (or both), then that's fine! Just something else to
think about...

Cheers,
Dave

On 16 October 2012 09:42, Alberto Cordioli <co...@gmail.com> wrote:
> Thanks Dave.
> You solved my problem. Just a little question about your tip:
> I suppose also the value returned by iterator.next() is re-used.
> So if want to store some values of the Iterable list in the reducer, I
> should create a List and put cloned objects inside it.
> In this case there is no possibility to avoid the "new" operator, right?
>
>
>
> On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>> or IntWritable instance variable. You wouldn't need to clone the whole
>> key.
>>
>> Then, instead of tag = key.getSecondField() you'd say
>> tag.set(key.getSecondField().get());
>> I don't know what type of object tag is (if it's Text you'll say
>> toString() rather than get()), but you see what I mean.
>>
>> Also - just a tip - try to avoid creating new objects wherever
>> possible. You'll get better performance if you create one Text object
>> as an instance variable and re-use it by setting the value instead of
>> calling new Text("") on every output.
>>
>> Thanks,
>> Dave
>>
>> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi Dave,
>>>
>>> thanks for your reply. Now it's more clear; in fact the code that I
>>> wrote is inspired to the old api, where the behavior is another.
>>> So, how can I achieve the same behavior as the old api? I need the
>>> second field of the first key object to stay the same among the
>>> iterations, in order to compare it with other objects. Do I have to
>>> clone the object?
>>>
>>>
>>> Thanks.
>>>
>>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>>> Hi Alberto
>>>>
>>>> The iterator you are looping over in your reduce method isn't a
>>>> self-contained list of values. What's actually happening is that
>>>> you're iterating through *part* of the sorted key/value set that was
>>>> sent to that reduce node, and it is the grouping comparator that
>>>> decides when to break that loop and call reduce again on the next key.
>>>>
>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>> the values, what's actually happening is this pointer to the
>>>> associated key data moves with it - and you're seeing it change.
>>>>
>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>> API you get the first key, and it appears to stay the same during the
>>>> loop.
>>>>
>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>> don't act the same.
>>>>
>>>> Hope that helps,
>>>> Dave
>>>>
>>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> a very strange thing is happening with my hadoop program.
>>>>> My map simply emits tuples with a custom object as key (which
>>>>> implement WritableComparable).
>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>> groupingclass in such a way that only the first field is taken into
>>>>> account.
>>>>> The second field is just a tag and could be 1 or 2.
>>>>>
>>>>> This is the reducer's snippet:
>>>>>
>>>>> tag = key.getSecondField();
>>>>> Iterator it1 = values.iterator();
>>>>> while(it1.hasNext()){
>>>>>         it1.next();
>>>>>         collector.emit(new Text("dummy"), tag);
>>>>> }
>>>>>
>>>>> I would expect in my output all the lines with:
>>>>> dummy       1
>>>>> ...
>>>>> dummy       1
>>>>>
>>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>>
>>>>> dummy    1
>>>>> ...
>>>>> dummy    1
>>>>> dummy    2
>>>>> ...
>>>>> dummy    2
>>>>>
>>>>>
>>>>> Someone could explain me way, please?
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Alberto Cordioli
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Thanks Dave.
You solved my problem. Just a little question about your tip:
I suppose also the value returned by iterator.next() is re-used.
So if want to store some values of the Iterable list in the reducer, I
should create a List and put cloned objects inside it.
In this case there is no possibility to avoid the "new" operator, right?



On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
> Well, if all you need is the tag (the 1 or 2), why not just use a Text
> or IntWritable instance variable. You wouldn't need to clone the whole
> key.
>
> Then, instead of tag = key.getSecondField() you'd say
> tag.set(key.getSecondField().get());
> I don't know what type of object tag is (if it's Text you'll say
> toString() rather than get()), but you see what I mean.
>
> Also - just a tip - try to avoid creating new objects wherever
> possible. You'll get better performance if you create one Text object
> as an instance variable and re-use it by setting the value instead of
> calling new Text("") on every output.
>
> Thanks,
> Dave
>
> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi Dave,
>>
>> thanks for your reply. Now it's more clear; in fact the code that I
>> wrote is inspired to the old api, where the behavior is another.
>> So, how can I achieve the same behavior as the old api? I need the
>> second field of the first key object to stay the same among the
>> iterations, in order to compare it with other objects. Do I have to
>> clone the object?
>>
>>
>> Thanks.
>>
>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>> Hi Alberto
>>>
>>> The iterator you are looping over in your reduce method isn't a
>>> self-contained list of values. What's actually happening is that
>>> you're iterating through *part* of the sorted key/value set that was
>>> sent to that reduce node, and it is the grouping comparator that
>>> decides when to break that loop and call reduce again on the next key.
>>>
>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>> the values, what's actually happening is this pointer to the
>>> associated key data moves with it - and you're seeing it change.
>>>
>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>> API you get the first key, and it appears to stay the same during the
>>> loop.
>>>
>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>> don't act the same.
>>>
>>> Hope that helps,
>>> Dave
>>>
>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> a very strange thing is happening with my hadoop program.
>>>> My map simply emits tuples with a custom object as key (which
>>>> implement WritableComparable).
>>>> The object is made of 2 fields, and I implement my partitioner and
>>>> groupingclass in such a way that only the first field is taken into
>>>> account.
>>>> The second field is just a tag and could be 1 or 2.
>>>>
>>>> This is the reducer's snippet:
>>>>
>>>> tag = key.getSecondField();
>>>> Iterator it1 = values.iterator();
>>>> while(it1.hasNext()){
>>>>         it1.next();
>>>>         collector.emit(new Text("dummy"), tag);
>>>> }
>>>>
>>>> I would expect in my output all the lines with:
>>>> dummy       1
>>>> ...
>>>> dummy       1
>>>>
>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>
>>>> dummy    1
>>>> ...
>>>> dummy    1
>>>> dummy    2
>>>> ...
>>>> dummy    2
>>>>
>>>>
>>>> Someone could explain me way, please?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Thanks Dave.
You solved my problem. Just a little question about your tip:
I suppose also the value returned by iterator.next() is re-used.
So if want to store some values of the Iterable list in the reducer, I
should create a List and put cloned objects inside it.
In this case there is no possibility to avoid the "new" operator, right?



On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
> Well, if all you need is the tag (the 1 or 2), why not just use a Text
> or IntWritable instance variable. You wouldn't need to clone the whole
> key.
>
> Then, instead of tag = key.getSecondField() you'd say
> tag.set(key.getSecondField().get());
> I don't know what type of object tag is (if it's Text you'll say
> toString() rather than get()), but you see what I mean.
>
> Also - just a tip - try to avoid creating new objects wherever
> possible. You'll get better performance if you create one Text object
> as an instance variable and re-use it by setting the value instead of
> calling new Text("") on every output.
>
> Thanks,
> Dave
>
> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi Dave,
>>
>> thanks for your reply. Now it's more clear; in fact the code that I
>> wrote is inspired to the old api, where the behavior is another.
>> So, how can I achieve the same behavior as the old api? I need the
>> second field of the first key object to stay the same among the
>> iterations, in order to compare it with other objects. Do I have to
>> clone the object?
>>
>>
>> Thanks.
>>
>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>> Hi Alberto
>>>
>>> The iterator you are looping over in your reduce method isn't a
>>> self-contained list of values. What's actually happening is that
>>> you're iterating through *part* of the sorted key/value set that was
>>> sent to that reduce node, and it is the grouping comparator that
>>> decides when to break that loop and call reduce again on the next key.
>>>
>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>> the values, what's actually happening is this pointer to the
>>> associated key data moves with it - and you're seeing it change.
>>>
>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>> API you get the first key, and it appears to stay the same during the
>>> loop.
>>>
>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>> don't act the same.
>>>
>>> Hope that helps,
>>> Dave
>>>
>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> a very strange thing is happening with my hadoop program.
>>>> My map simply emits tuples with a custom object as key (which
>>>> implement WritableComparable).
>>>> The object is made of 2 fields, and I implement my partitioner and
>>>> groupingclass in such a way that only the first field is taken into
>>>> account.
>>>> The second field is just a tag and could be 1 or 2.
>>>>
>>>> This is the reducer's snippet:
>>>>
>>>> tag = key.getSecondField();
>>>> Iterator it1 = values.iterator();
>>>> while(it1.hasNext()){
>>>>         it1.next();
>>>>         collector.emit(new Text("dummy"), tag);
>>>> }
>>>>
>>>> I would expect in my output all the lines with:
>>>> dummy       1
>>>> ...
>>>> dummy       1
>>>>
>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>
>>>> dummy    1
>>>> ...
>>>> dummy    1
>>>> dummy    2
>>>> ...
>>>> dummy    2
>>>>
>>>>
>>>> Someone could explain me way, please?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Thanks Dave.
You solved my problem. Just a little question about your tip:
I suppose also the value returned by iterator.next() is re-used.
So if want to store some values of the Iterable list in the reducer, I
should create a List and put cloned objects inside it.
In this case there is no possibility to avoid the "new" operator, right?



On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
> Well, if all you need is the tag (the 1 or 2), why not just use a Text
> or IntWritable instance variable. You wouldn't need to clone the whole
> key.
>
> Then, instead of tag = key.getSecondField() you'd say
> tag.set(key.getSecondField().get());
> I don't know what type of object tag is (if it's Text you'll say
> toString() rather than get()), but you see what I mean.
>
> Also - just a tip - try to avoid creating new objects wherever
> possible. You'll get better performance if you create one Text object
> as an instance variable and re-use it by setting the value instead of
> calling new Text("") on every output.
>
> Thanks,
> Dave
>
> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi Dave,
>>
>> thanks for your reply. Now it's more clear; in fact the code that I
>> wrote is inspired to the old api, where the behavior is another.
>> So, how can I achieve the same behavior as the old api? I need the
>> second field of the first key object to stay the same among the
>> iterations, in order to compare it with other objects. Do I have to
>> clone the object?
>>
>>
>> Thanks.
>>
>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>> Hi Alberto
>>>
>>> The iterator you are looping over in your reduce method isn't a
>>> self-contained list of values. What's actually happening is that
>>> you're iterating through *part* of the sorted key/value set that was
>>> sent to that reduce node, and it is the grouping comparator that
>>> decides when to break that loop and call reduce again on the next key.
>>>
>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>> the values, what's actually happening is this pointer to the
>>> associated key data moves with it - and you're seeing it change.
>>>
>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>> API you get the first key, and it appears to stay the same during the
>>> loop.
>>>
>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>> don't act the same.
>>>
>>> Hope that helps,
>>> Dave
>>>
>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> a very strange thing is happening with my hadoop program.
>>>> My map simply emits tuples with a custom object as key (which
>>>> implement WritableComparable).
>>>> The object is made of 2 fields, and I implement my partitioner and
>>>> groupingclass in such a way that only the first field is taken into
>>>> account.
>>>> The second field is just a tag and could be 1 or 2.
>>>>
>>>> This is the reducer's snippet:
>>>>
>>>> tag = key.getSecondField();
>>>> Iterator it1 = values.iterator();
>>>> while(it1.hasNext()){
>>>>         it1.next();
>>>>         collector.emit(new Text("dummy"), tag);
>>>> }
>>>>
>>>> I would expect in my output all the lines with:
>>>> dummy       1
>>>> ...
>>>> dummy       1
>>>>
>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>
>>>> dummy    1
>>>> ...
>>>> dummy    1
>>>> dummy    2
>>>> ...
>>>> dummy    2
>>>>
>>>>
>>>> Someone could explain me way, please?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Thanks Dave.
You solved my problem. Just a little question about your tip:
I suppose also the value returned by iterator.next() is re-used.
So if want to store some values of the Iterable list in the reducer, I
should create a List and put cloned objects inside it.
In this case there is no possibility to avoid the "new" operator, right?



On 15 October 2012 22:49, Dave Beech <db...@apache.org> wrote:
> Well, if all you need is the tag (the 1 or 2), why not just use a Text
> or IntWritable instance variable. You wouldn't need to clone the whole
> key.
>
> Then, instead of tag = key.getSecondField() you'd say
> tag.set(key.getSecondField().get());
> I don't know what type of object tag is (if it's Text you'll say
> toString() rather than get()), but you see what I mean.
>
> Also - just a tip - try to avoid creating new objects wherever
> possible. You'll get better performance if you create one Text object
> as an instance variable and re-use it by setting the value instead of
> calling new Text("") on every output.
>
> Thanks,
> Dave
>
> On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi Dave,
>>
>> thanks for your reply. Now it's more clear; in fact the code that I
>> wrote is inspired to the old api, where the behavior is another.
>> So, how can I achieve the same behavior as the old api? I need the
>> second field of the first key object to stay the same among the
>> iterations, in order to compare it with other objects. Do I have to
>> clone the object?
>>
>>
>> Thanks.
>>
>> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>>> Hi Alberto
>>>
>>> The iterator you are looping over in your reduce method isn't a
>>> self-contained list of values. What's actually happening is that
>>> you're iterating through *part* of the sorted key/value set that was
>>> sent to that reduce node, and it is the grouping comparator that
>>> decides when to break that loop and call reduce again on the next key.
>>>
>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>> the values, what's actually happening is this pointer to the
>>> associated key data moves with it - and you're seeing it change.
>>>
>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>> API you get the first key, and it appears to stay the same during the
>>> loop.
>>>
>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>> don't act the same.
>>>
>>> Hope that helps,
>>> Dave
>>>
>>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> a very strange thing is happening with my hadoop program.
>>>> My map simply emits tuples with a custom object as key (which
>>>> implement WritableComparable).
>>>> The object is made of 2 fields, and I implement my partitioner and
>>>> groupingclass in such a way that only the first field is taken into
>>>> account.
>>>> The second field is just a tag and could be 1 or 2.
>>>>
>>>> This is the reducer's snippet:
>>>>
>>>> tag = key.getSecondField();
>>>> Iterator it1 = values.iterator();
>>>> while(it1.hasNext()){
>>>>         it1.next();
>>>>         collector.emit(new Text("dummy"), tag);
>>>> }
>>>>
>>>> I would expect in my output all the lines with:
>>>> dummy       1
>>>> ...
>>>> dummy       1
>>>>
>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>
>>>> dummy    1
>>>> ...
>>>> dummy    1
>>>> dummy    2
>>>> ...
>>>> dummy    2
>>>>
>>>>
>>>> Someone could explain me way, please?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Well, if all you need is the tag (the 1 or 2), why not just use a Text
or IntWritable instance variable. You wouldn't need to clone the whole
key.

Then, instead of tag = key.getSecondField() you'd say
tag.set(key.getSecondField().get());
I don't know what type of object tag is (if it's Text you'll say
toString() rather than get()), but you see what I mean.

Also - just a tip - try to avoid creating new objects wherever
possible. You'll get better performance if you create one Text object
as an instance variable and re-use it by setting the value instead of
calling new Text("") on every output.

Thanks,
Dave

On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
> Hi Dave,
>
> thanks for your reply. Now it's more clear; in fact the code that I
> wrote is inspired to the old api, where the behavior is another.
> So, how can I achieve the same behavior as the old api? I need the
> second field of the first key object to stay the same among the
> iterations, in order to compare it with other objects. Do I have to
> clone the object?
>
>
> Thanks.
>
> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>> Hi Alberto
>>
>> The iterator you are looping over in your reduce method isn't a
>> self-contained list of values. What's actually happening is that
>> you're iterating through *part* of the sorted key/value set that was
>> sent to that reduce node, and it is the grouping comparator that
>> decides when to break that loop and call reduce again on the next key.
>>
>> Moreover, the "key" object is re-used. So, as you're iterating through
>> the values, what's actually happening is this pointer to the
>> associated key data moves with it - and you're seeing it change.
>>
>> This only happens in the new "mapreduce" API - in the older "mapred"
>> API you get the first key, and it appears to stay the same during the
>> loop.
>>
>> It's sometimes useful behaviour, but it's confusing how the two APIs
>> don't act the same.
>>
>> Hope that helps,
>> Dave
>>
>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi all,
>>>
>>> a very strange thing is happening with my hadoop program.
>>> My map simply emits tuples with a custom object as key (which
>>> implement WritableComparable).
>>> The object is made of 2 fields, and I implement my partitioner and
>>> groupingclass in such a way that only the first field is taken into
>>> account.
>>> The second field is just a tag and could be 1 or 2.
>>>
>>> This is the reducer's snippet:
>>>
>>> tag = key.getSecondField();
>>> Iterator it1 = values.iterator();
>>> while(it1.hasNext()){
>>>         it1.next();
>>>         collector.emit(new Text("dummy"), tag);
>>> }
>>>
>>> I would expect in my output all the lines with:
>>> dummy       1
>>> ...
>>> dummy       1
>>>
>>> but actually the value of tag changes in time and I obtain this type of output:
>>>
>>> dummy    1
>>> ...
>>> dummy    1
>>> dummy    2
>>> ...
>>> dummy    2
>>>
>>>
>>> Someone could explain me way, please?
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Well, if all you need is the tag (the 1 or 2), why not just use a Text
or IntWritable instance variable. You wouldn't need to clone the whole
key.

Then, instead of tag = key.getSecondField() you'd say
tag.set(key.getSecondField().get());
I don't know what type of object tag is (if it's Text you'll say
toString() rather than get()), but you see what I mean.

Also - just a tip - try to avoid creating new objects wherever
possible. You'll get better performance if you create one Text object
as an instance variable and re-use it by setting the value instead of
calling new Text("") on every output.

Thanks,
Dave

On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
> Hi Dave,
>
> thanks for your reply. Now it's more clear; in fact the code that I
> wrote is inspired to the old api, where the behavior is another.
> So, how can I achieve the same behavior as the old api? I need the
> second field of the first key object to stay the same among the
> iterations, in order to compare it with other objects. Do I have to
> clone the object?
>
>
> Thanks.
>
> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>> Hi Alberto
>>
>> The iterator you are looping over in your reduce method isn't a
>> self-contained list of values. What's actually happening is that
>> you're iterating through *part* of the sorted key/value set that was
>> sent to that reduce node, and it is the grouping comparator that
>> decides when to break that loop and call reduce again on the next key.
>>
>> Moreover, the "key" object is re-used. So, as you're iterating through
>> the values, what's actually happening is this pointer to the
>> associated key data moves with it - and you're seeing it change.
>>
>> This only happens in the new "mapreduce" API - in the older "mapred"
>> API you get the first key, and it appears to stay the same during the
>> loop.
>>
>> It's sometimes useful behaviour, but it's confusing how the two APIs
>> don't act the same.
>>
>> Hope that helps,
>> Dave
>>
>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi all,
>>>
>>> a very strange thing is happening with my hadoop program.
>>> My map simply emits tuples with a custom object as key (which
>>> implement WritableComparable).
>>> The object is made of 2 fields, and I implement my partitioner and
>>> groupingclass in such a way that only the first field is taken into
>>> account.
>>> The second field is just a tag and could be 1 or 2.
>>>
>>> This is the reducer's snippet:
>>>
>>> tag = key.getSecondField();
>>> Iterator it1 = values.iterator();
>>> while(it1.hasNext()){
>>>         it1.next();
>>>         collector.emit(new Text("dummy"), tag);
>>> }
>>>
>>> I would expect in my output all the lines with:
>>> dummy       1
>>> ...
>>> dummy       1
>>>
>>> but actually the value of tag changes in time and I obtain this type of output:
>>>
>>> dummy    1
>>> ...
>>> dummy    1
>>> dummy    2
>>> ...
>>> dummy    2
>>>
>>>
>>> Someone could explain me way, please?
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Well, if all you need is the tag (the 1 or 2), why not just use a Text
or IntWritable instance variable. You wouldn't need to clone the whole
key.

Then, instead of tag = key.getSecondField() you'd say
tag.set(key.getSecondField().get());
I don't know what type of object tag is (if it's Text you'll say
toString() rather than get()), but you see what I mean.

Also - just a tip - try to avoid creating new objects wherever
possible. You'll get better performance if you create one Text object
as an instance variable and re-use it by setting the value instead of
calling new Text("") on every output.

Thanks,
Dave

On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
> Hi Dave,
>
> thanks for your reply. Now it's more clear; in fact the code that I
> wrote is inspired to the old api, where the behavior is another.
> So, how can I achieve the same behavior as the old api? I need the
> second field of the first key object to stay the same among the
> iterations, in order to compare it with other objects. Do I have to
> clone the object?
>
>
> Thanks.
>
> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>> Hi Alberto
>>
>> The iterator you are looping over in your reduce method isn't a
>> self-contained list of values. What's actually happening is that
>> you're iterating through *part* of the sorted key/value set that was
>> sent to that reduce node, and it is the grouping comparator that
>> decides when to break that loop and call reduce again on the next key.
>>
>> Moreover, the "key" object is re-used. So, as you're iterating through
>> the values, what's actually happening is this pointer to the
>> associated key data moves with it - and you're seeing it change.
>>
>> This only happens in the new "mapreduce" API - in the older "mapred"
>> API you get the first key, and it appears to stay the same during the
>> loop.
>>
>> It's sometimes useful behaviour, but it's confusing how the two APIs
>> don't act the same.
>>
>> Hope that helps,
>> Dave
>>
>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi all,
>>>
>>> a very strange thing is happening with my hadoop program.
>>> My map simply emits tuples with a custom object as key (which
>>> implement WritableComparable).
>>> The object is made of 2 fields, and I implement my partitioner and
>>> groupingclass in such a way that only the first field is taken into
>>> account.
>>> The second field is just a tag and could be 1 or 2.
>>>
>>> This is the reducer's snippet:
>>>
>>> tag = key.getSecondField();
>>> Iterator it1 = values.iterator();
>>> while(it1.hasNext()){
>>>         it1.next();
>>>         collector.emit(new Text("dummy"), tag);
>>> }
>>>
>>> I would expect in my output all the lines with:
>>> dummy       1
>>> ...
>>> dummy       1
>>>
>>> but actually the value of tag changes in time and I obtain this type of output:
>>>
>>> dummy    1
>>> ...
>>> dummy    1
>>> dummy    2
>>> ...
>>> dummy    2
>>>
>>>
>>> Someone could explain me way, please?
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Well, if all you need is the tag (the 1 or 2), why not just use a Text
or IntWritable instance variable. You wouldn't need to clone the whole
key.

Then, instead of tag = key.getSecondField() you'd say
tag.set(key.getSecondField().get());
I don't know what type of object tag is (if it's Text you'll say
toString() rather than get()), but you see what I mean.

Also - just a tip - try to avoid creating new objects wherever
possible. You'll get better performance if you create one Text object
as an instance variable and re-use it by setting the value instead of
calling new Text("") on every output.

Thanks,
Dave

On 15 October 2012 21:39, Alberto Cordioli <co...@gmail.com> wrote:
> Hi Dave,
>
> thanks for your reply. Now it's more clear; in fact the code that I
> wrote is inspired to the old api, where the behavior is another.
> So, how can I achieve the same behavior as the old api? I need the
> second field of the first key object to stay the same among the
> iterations, in order to compare it with other objects. Do I have to
> clone the object?
>
>
> Thanks.
>
> On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
>> Hi Alberto
>>
>> The iterator you are looping over in your reduce method isn't a
>> self-contained list of values. What's actually happening is that
>> you're iterating through *part* of the sorted key/value set that was
>> sent to that reduce node, and it is the grouping comparator that
>> decides when to break that loop and call reduce again on the next key.
>>
>> Moreover, the "key" object is re-used. So, as you're iterating through
>> the values, what's actually happening is this pointer to the
>> associated key data moves with it - and you're seeing it change.
>>
>> This only happens in the new "mapreduce" API - in the older "mapred"
>> API you get the first key, and it appears to stay the same during the
>> loop.
>>
>> It's sometimes useful behaviour, but it's confusing how the two APIs
>> don't act the same.
>>
>> Hope that helps,
>> Dave
>>
>> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>>> Hi all,
>>>
>>> a very strange thing is happening with my hadoop program.
>>> My map simply emits tuples with a custom object as key (which
>>> implement WritableComparable).
>>> The object is made of 2 fields, and I implement my partitioner and
>>> groupingclass in such a way that only the first field is taken into
>>> account.
>>> The second field is just a tag and could be 1 or 2.
>>>
>>> This is the reducer's snippet:
>>>
>>> tag = key.getSecondField();
>>> Iterator it1 = values.iterator();
>>> while(it1.hasNext()){
>>>         it1.next();
>>>         collector.emit(new Text("dummy"), tag);
>>> }
>>>
>>> I would expect in my output all the lines with:
>>> dummy       1
>>> ...
>>> dummy       1
>>>
>>> but actually the value of tag changes in time and I obtain this type of output:
>>>
>>> dummy    1
>>> ...
>>> dummy    1
>>> dummy    2
>>> ...
>>> dummy    2
>>>
>>>
>>> Someone could explain me way, please?
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Hi Dave,

thanks for your reply. Now it's more clear; in fact the code that I
wrote is inspired to the old api, where the behavior is another.
So, how can I achieve the same behavior as the old api? I need the
second field of the first key object to stay the same among the
iterations, in order to compare it with other objects. Do I have to
clone the object?


Thanks.

On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
> Hi Alberto
>
> The iterator you are looping over in your reduce method isn't a
> self-contained list of values. What's actually happening is that
> you're iterating through *part* of the sorted key/value set that was
> sent to that reduce node, and it is the grouping comparator that
> decides when to break that loop and call reduce again on the next key.
>
> Moreover, the "key" object is re-used. So, as you're iterating through
> the values, what's actually happening is this pointer to the
> associated key data moves with it - and you're seeing it change.
>
> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
>
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.
>
> Hope that helps,
> Dave
>
> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi all,
>>
>> a very strange thing is happening with my hadoop program.
>> My map simply emits tuples with a custom object as key (which
>> implement WritableComparable).
>> The object is made of 2 fields, and I implement my partitioner and
>> groupingclass in such a way that only the first field is taken into
>> account.
>> The second field is just a tag and could be 1 or 2.
>>
>> This is the reducer's snippet:
>>
>> tag = key.getSecondField();
>> Iterator it1 = values.iterator();
>> while(it1.hasNext()){
>>         it1.next();
>>         collector.emit(new Text("dummy"), tag);
>> }
>>
>> I would expect in my output all the lines with:
>> dummy       1
>> ...
>> dummy       1
>>
>> but actually the value of tag changes in time and I obtain this type of output:
>>
>> dummy    1
>> ...
>> dummy    1
>> dummy    2
>> ...
>> dummy    2
>>
>>
>> Someone could explain me way, please?
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

On Oct 15, 2012, at 12:27 PM, Dave Beech wrote:

> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
> 
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.

Yes, it is a bit counter intuitive. In the older API, there is no way to find out the current key corresponding to the current value in case there is a grouping-comparator/secondary sort. The newer API gives you this facility and that naturally changes the behavior of the key.

FWIW, we can document this as part of the Reducer class' javadoc.

Thanks,
+Vinod

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Hi Dave,

thanks for your reply. Now it's more clear; in fact the code that I
wrote is inspired to the old api, where the behavior is another.
So, how can I achieve the same behavior as the old api? I need the
second field of the first key object to stay the same among the
iterations, in order to compare it with other objects. Do I have to
clone the object?


Thanks.

On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
> Hi Alberto
>
> The iterator you are looping over in your reduce method isn't a
> self-contained list of values. What's actually happening is that
> you're iterating through *part* of the sorted key/value set that was
> sent to that reduce node, and it is the grouping comparator that
> decides when to break that loop and call reduce again on the next key.
>
> Moreover, the "key" object is re-used. So, as you're iterating through
> the values, what's actually happening is this pointer to the
> associated key data moves with it - and you're seeing it change.
>
> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
>
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.
>
> Hope that helps,
> Dave
>
> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi all,
>>
>> a very strange thing is happening with my hadoop program.
>> My map simply emits tuples with a custom object as key (which
>> implement WritableComparable).
>> The object is made of 2 fields, and I implement my partitioner and
>> groupingclass in such a way that only the first field is taken into
>> account.
>> The second field is just a tag and could be 1 or 2.
>>
>> This is the reducer's snippet:
>>
>> tag = key.getSecondField();
>> Iterator it1 = values.iterator();
>> while(it1.hasNext()){
>>         it1.next();
>>         collector.emit(new Text("dummy"), tag);
>> }
>>
>> I would expect in my output all the lines with:
>> dummy       1
>> ...
>> dummy       1
>>
>> but actually the value of tag changes in time and I obtain this type of output:
>>
>> dummy    1
>> ...
>> dummy    1
>> dummy    2
>> ...
>> dummy    2
>>
>>
>> Someone could explain me way, please?
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

On Oct 15, 2012, at 12:27 PM, Dave Beech wrote:

> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
> 
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.

Yes, it is a bit counter intuitive. In the older API, there is no way to find out the current key corresponding to the current value in case there is a grouping-comparator/secondary sort. The newer API gives you this facility and that naturally changes the behavior of the key.

FWIW, we can document this as part of the Reducer class' javadoc.

Thanks,
+Vinod

Re: GroupingComparator

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

On Oct 15, 2012, at 12:27 PM, Dave Beech wrote:

> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
> 
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.

Yes, it is a bit counter intuitive. In the older API, there is no way to find out the current key corresponding to the current value in case there is a grouping-comparator/secondary sort. The newer API gives you this facility and that naturally changes the behavior of the key.

FWIW, we can document this as part of the Reducer class' javadoc.

Thanks,
+Vinod

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Hi Dave,

thanks for your reply. Now it's more clear; in fact the code that I
wrote is inspired to the old api, where the behavior is another.
So, how can I achieve the same behavior as the old api? I need the
second field of the first key object to stay the same among the
iterations, in order to compare it with other objects. Do I have to
clone the object?


Thanks.

On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
> Hi Alberto
>
> The iterator you are looping over in your reduce method isn't a
> self-contained list of values. What's actually happening is that
> you're iterating through *part* of the sorted key/value set that was
> sent to that reduce node, and it is the grouping comparator that
> decides when to break that loop and call reduce again on the next key.
>
> Moreover, the "key" object is re-used. So, as you're iterating through
> the values, what's actually happening is this pointer to the
> associated key data moves with it - and you're seeing it change.
>
> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
>
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.
>
> Hope that helps,
> Dave
>
> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi all,
>>
>> a very strange thing is happening with my hadoop program.
>> My map simply emits tuples with a custom object as key (which
>> implement WritableComparable).
>> The object is made of 2 fields, and I implement my partitioner and
>> groupingclass in such a way that only the first field is taken into
>> account.
>> The second field is just a tag and could be 1 or 2.
>>
>> This is the reducer's snippet:
>>
>> tag = key.getSecondField();
>> Iterator it1 = values.iterator();
>> while(it1.hasNext()){
>>         it1.next();
>>         collector.emit(new Text("dummy"), tag);
>> }
>>
>> I would expect in my output all the lines with:
>> dummy       1
>> ...
>> dummy       1
>>
>> but actually the value of tag changes in time and I obtain this type of output:
>>
>> dummy    1
>> ...
>> dummy    1
>> dummy    2
>> ...
>> dummy    2
>>
>>
>> Someone could explain me way, please?
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

On Oct 15, 2012, at 12:27 PM, Dave Beech wrote:

> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
> 
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.

Yes, it is a bit counter intuitive. In the older API, there is no way to find out the current key corresponding to the current value in case there is a grouping-comparator/secondary sort. The newer API gives you this facility and that naturally changes the behavior of the key.

FWIW, we can document this as part of the Reducer class' javadoc.

Thanks,
+Vinod

Re: GroupingComparator

Posted by Alberto Cordioli <co...@gmail.com>.

Hi Dave,

thanks for your reply. Now it's more clear; in fact the code that I
wrote is inspired to the old api, where the behavior is another.
So, how can I achieve the same behavior as the old api? I need the
second field of the first key object to stay the same among the
iterations, in order to compare it with other objects. Do I have to
clone the object?


Thanks.

On 15 October 2012 21:27, Dave Beech <db...@apache.org> wrote:
> Hi Alberto
>
> The iterator you are looping over in your reduce method isn't a
> self-contained list of values. What's actually happening is that
> you're iterating through *part* of the sorted key/value set that was
> sent to that reduce node, and it is the grouping comparator that
> decides when to break that loop and call reduce again on the next key.
>
> Moreover, the "key" object is re-used. So, as you're iterating through
> the values, what's actually happening is this pointer to the
> associated key data moves with it - and you're seeing it change.
>
> This only happens in the new "mapreduce" API - in the older "mapred"
> API you get the first key, and it appears to stay the same during the
> loop.
>
> It's sometimes useful behaviour, but it's confusing how the two APIs
> don't act the same.
>
> Hope that helps,
> Dave
>
> On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
>> Hi all,
>>
>> a very strange thing is happening with my hadoop program.
>> My map simply emits tuples with a custom object as key (which
>> implement WritableComparable).
>> The object is made of 2 fields, and I implement my partitioner and
>> groupingclass in such a way that only the first field is taken into
>> account.
>> The second field is just a tag and could be 1 or 2.
>>
>> This is the reducer's snippet:
>>
>> tag = key.getSecondField();
>> Iterator it1 = values.iterator();
>> while(it1.hasNext()){
>>         it1.next();
>>         collector.emit(new Text("dummy"), tag);
>> }
>>
>> I would expect in my output all the lines with:
>> dummy       1
>> ...
>> dummy       1
>>
>> but actually the value of tag changes in time and I obtain this type of output:
>>
>> dummy    1
>> ...
>> dummy    1
>> dummy    2
>> ...
>> dummy    2
>>
>>
>> Someone could explain me way, please?
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Hi Alberto

The iterator you are looping over in your reduce method isn't a
self-contained list of values. What's actually happening is that
you're iterating through *part* of the sorted key/value set that was
sent to that reduce node, and it is the grouping comparator that
decides when to break that loop and call reduce again on the next key.

Moreover, the "key" object is re-used. So, as you're iterating through
the values, what's actually happening is this pointer to the
associated key data moves with it - and you're seeing it change.

This only happens in the new "mapreduce" API - in the older "mapred"
API you get the first key, and it appears to stay the same during the
loop.

It's sometimes useful behaviour, but it's confusing how the two APIs
don't act the same.

Hope that helps,
Dave

On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
> Hi all,
>
> a very strange thing is happening with my hadoop program.
> My map simply emits tuples with a custom object as key (which
> implement WritableComparable).
> The object is made of 2 fields, and I implement my partitioner and
> groupingclass in such a way that only the first field is taken into
> account.
> The second field is just a tag and could be 1 or 2.
>
> This is the reducer's snippet:
>
> tag = key.getSecondField();
> Iterator it1 = values.iterator();
> while(it1.hasNext()){
>         it1.next();
>         collector.emit(new Text("dummy"), tag);
> }
>
> I would expect in my output all the lines with:
> dummy       1
> ...
> dummy       1
>
> but actually the value of tag changes in time and I obtain this type of output:
>
> dummy    1
> ...
> dummy    1
> dummy    2
> ...
> dummy    2
>
>
> Someone could explain me way, please?
>
>
> Thanks.
>
>
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Hi Alberto

The iterator you are looping over in your reduce method isn't a
self-contained list of values. What's actually happening is that
you're iterating through *part* of the sorted key/value set that was
sent to that reduce node, and it is the grouping comparator that
decides when to break that loop and call reduce again on the next key.

Moreover, the "key" object is re-used. So, as you're iterating through
the values, what's actually happening is this pointer to the
associated key data moves with it - and you're seeing it change.

This only happens in the new "mapreduce" API - in the older "mapred"
API you get the first key, and it appears to stay the same during the
loop.

It's sometimes useful behaviour, but it's confusing how the two APIs
don't act the same.

Hope that helps,
Dave

On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
> Hi all,
>
> a very strange thing is happening with my hadoop program.
> My map simply emits tuples with a custom object as key (which
> implement WritableComparable).
> The object is made of 2 fields, and I implement my partitioner and
> groupingclass in such a way that only the first field is taken into
> account.
> The second field is just a tag and could be 1 or 2.
>
> This is the reducer's snippet:
>
> tag = key.getSecondField();
> Iterator it1 = values.iterator();
> while(it1.hasNext()){
>         it1.next();
>         collector.emit(new Text("dummy"), tag);
> }
>
> I would expect in my output all the lines with:
> dummy       1
> ...
> dummy       1
>
> but actually the value of tag changes in time and I obtain this type of output:
>
> dummy    1
> ...
> dummy    1
> dummy    2
> ...
> dummy    2
>
>
> Someone could explain me way, please?
>
>
> Thanks.
>
>
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Hi Alberto

The iterator you are looping over in your reduce method isn't a
self-contained list of values. What's actually happening is that
you're iterating through *part* of the sorted key/value set that was
sent to that reduce node, and it is the grouping comparator that
decides when to break that loop and call reduce again on the next key.

Moreover, the "key" object is re-used. So, as you're iterating through
the values, what's actually happening is this pointer to the
associated key data moves with it - and you're seeing it change.

This only happens in the new "mapreduce" API - in the older "mapred"
API you get the first key, and it appears to stay the same during the
loop.

It's sometimes useful behaviour, but it's confusing how the two APIs
don't act the same.

Hope that helps,
Dave

On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
> Hi all,
>
> a very strange thing is happening with my hadoop program.
> My map simply emits tuples with a custom object as key (which
> implement WritableComparable).
> The object is made of 2 fields, and I implement my partitioner and
> groupingclass in such a way that only the first field is taken into
> account.
> The second field is just a tag and could be 1 or 2.
>
> This is the reducer's snippet:
>
> tag = key.getSecondField();
> Iterator it1 = values.iterator();
> while(it1.hasNext()){
>         it1.next();
>         collector.emit(new Text("dummy"), tag);
> }
>
> I would expect in my output all the lines with:
> dummy       1
> ...
> dummy       1
>
> but actually the value of tag changes in time and I obtain this type of output:
>
> dummy    1
> ...
> dummy    1
> dummy    2
> ...
> dummy    2
>
>
> Someone could explain me way, please?
>
>
> Thanks.
>
>
>
>
>
> --
> Alberto Cordioli

Re: GroupingComparator

Posted by Dave Beech <db...@apache.org>.

Hi Alberto

The iterator you are looping over in your reduce method isn't a
self-contained list of values. What's actually happening is that
you're iterating through *part* of the sorted key/value set that was
sent to that reduce node, and it is the grouping comparator that
decides when to break that loop and call reduce again on the next key.

Moreover, the "key" object is re-used. So, as you're iterating through
the values, what's actually happening is this pointer to the
associated key data moves with it - and you're seeing it change.

This only happens in the new "mapreduce" API - in the older "mapred"
API you get the first key, and it appears to stay the same during the
loop.

It's sometimes useful behaviour, but it's confusing how the two APIs
don't act the same.

Hope that helps,
Dave

On 15 October 2012 20:11, Alberto Cordioli <co...@gmail.com> wrote:
> Hi all,
>
> a very strange thing is happening with my hadoop program.
> My map simply emits tuples with a custom object as key (which
> implement WritableComparable).
> The object is made of 2 fields, and I implement my partitioner and
> groupingclass in such a way that only the first field is taken into
> account.
> The second field is just a tag and could be 1 or 2.
>
> This is the reducer's snippet:
>
> tag = key.getSecondField();
> Iterator it1 = values.iterator();
> while(it1.hasNext()){
>         it1.next();
>         collector.emit(new Text("dummy"), tag);
> }
>
> I would expect in my output all the lines with:
> dummy       1
> ...
> dummy       1
>
> but actually the value of tag changes in time and I obtain this type of output:
>
> dummy    1
> ...
> dummy    1
> dummy    2
> ...
> dummy    2
>
>
> Someone could explain me way, please?
>
>
> Thanks.
>
>
>
>
>
> --
> Alberto Cordioli