You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Eugene Morozov <em...@griddynamics.com> on 2013/08/23 17:19:17 UTC

Partitioner vs GroupComparator

Hello,

I have two different types of keys emerged from Map and processed by
Reduce. These keys have some part in common. And I'd like to have similar
keys in one reducer. For that purpose I used Partitioner and partition
everything gets in by this common part. It seems to be fine, but MRUnit
seems doesn't know anything about Partitioners. So, here is where
GroupComparator comes into play. It seems that MRUnit well aware of the
guy, but it surprises me: it looks like Partitioner and GroupComparator are
actually doing exactly same - they both somehow group keys to have them in
one reducer.
Could you shed some light on it, please.
--

RE: Partitioner vs GroupComparator

Posted by java8964 java8964 <ja...@hotmail.com>.
As Harsh said, sometime you want to do the 2nd sort, but for MR, it can only be sorted by key, not by value.
A lot of time, you want to the reducer output sort by a field, but only do the sort within a group, kind of like 'windowing sort' in relation DB SQL. For example, if you have a data about all the employee, you want the MR job to sort the Employee by salary, but within each department.
So what you choose the key as the omit from Mapper? Department_id? If so, then it is hard to make the result sorted by salary. Using "Department_id + salary", then we cannot put all the data from one department into one reducer.
In this case, you separate keys composing way from grouping way. You still use 'Department_id+salary' as the key, but override the GroupComparator to group ONLY by "Department_id", but in the meantime, you sort the data on both 'Department_id + salary'. The final goal is to make sure that all the data for the same department arrive in the same reducer, and when they arrive, they will be sorted by salary too, by utilizing the MR's sort/shuffle build-in ability.
Yong

Date: Fri, 23 Aug 2013 13:06:01 -0400
Subject: Re: Partitioner vs GroupComparator
From: shahab.yunus@gmail.com
To: user@hadoop.apache.org

@Jan, why not, not send the 'hidden' part of the key as a value? Why not then pass value as null or with some other value part. So in the reducer side there is no duplication and you can extract the 'hidden' part of the key yourself (which should be possible as you will be encapsulating it in a some class/object model...?

Regards,Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <ja...@firma.seznam.cz> wrote:

Hi all,



when speaking about this, has anyone ever measured how much more data needs to be transferred over the network when using GroupingComparator the way Harsh suggests? What do I mean, when you use the GroupingComparator, it hides you the real key that you have emitted from Mapper. You just see the first key in the reduce group and any data that was carried in the key needs to be duplicated in the value in order to be accessible on the reduce end.




Let's say you have key consisting of two parts (base, extension), you partition by the 'base' part and use GroupingComparator to group keys with the same base part. Than you have no other chance than to emit from Mapper something like this - (key: (base, extension), value: extension), which means the 'extension' part is duplicated in the data, that has to be transferred over the network. This overhead can be diminished by using compression between map and reduce side, but I believe that in some cases this can be significant.




It would be nice if the API allowed to access the 'real' key for each value, not only the first key of the reduce group. The only way to get rid of this overhead now is by not using the GroupingComparator and instead store some internal state in the Reducer class, that is persisted across mutliple calls to reduce() method, which in my opinion makes using GroupingComparator this way less 'preferred' way of doing secondary sort.




Does anyone have any experience with this overhead?



Jan



On 08/23/2013 06:05 PM, Harsh J wrote:


The partitioner runs on the map-end. It assigns a partition ID

(reducer ID) to each key.

The grouping comparator runs on the reduce-end. It helps reducers,

which read off a merge-sorted single file, to understand how to break

the sequential file into reduce calls of <key, values[]>.



Typically one never overrides the GroupingComparator, and it is

usually the same as the SortComparator. But if you wish to do things

such as Secondary Sort, then overriding this comes useful - cause you

may want to sort over two parts of a key object, but only group by one

part, etc..



On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov

<em...@griddynamics.com> wrote:


Hello,



I have two different types of keys emerged from Map and processed by Reduce.

These keys have some part in common. And I'd like to have similar keys in

one reducer. For that purpose I used Partitioner and partition everything

gets in by this common part. It seems to be fine, but MRUnit seems doesn't

know anything about Partitioners. So, here is where GroupComparator comes

into play. It seems that MRUnit well aware of the guy, but it surprises me:

it looks like Partitioner and GroupComparator are actually doing exactly

same - they both somehow group keys to have them in one reducer.

Could you shed some light on it, please.

--










 		 	   		  

RE: Partitioner vs GroupComparator

Posted by java8964 java8964 <ja...@hotmail.com>.
As Harsh said, sometime you want to do the 2nd sort, but for MR, it can only be sorted by key, not by value.
A lot of time, you want to the reducer output sort by a field, but only do the sort within a group, kind of like 'windowing sort' in relation DB SQL. For example, if you have a data about all the employee, you want the MR job to sort the Employee by salary, but within each department.
So what you choose the key as the omit from Mapper? Department_id? If so, then it is hard to make the result sorted by salary. Using "Department_id + salary", then we cannot put all the data from one department into one reducer.
In this case, you separate keys composing way from grouping way. You still use 'Department_id+salary' as the key, but override the GroupComparator to group ONLY by "Department_id", but in the meantime, you sort the data on both 'Department_id + salary'. The final goal is to make sure that all the data for the same department arrive in the same reducer, and when they arrive, they will be sorted by salary too, by utilizing the MR's sort/shuffle build-in ability.
Yong

Date: Fri, 23 Aug 2013 13:06:01 -0400
Subject: Re: Partitioner vs GroupComparator
From: shahab.yunus@gmail.com
To: user@hadoop.apache.org

@Jan, why not, not send the 'hidden' part of the key as a value? Why not then pass value as null or with some other value part. So in the reducer side there is no duplication and you can extract the 'hidden' part of the key yourself (which should be possible as you will be encapsulating it in a some class/object model...?

Regards,Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <ja...@firma.seznam.cz> wrote:

Hi all,



when speaking about this, has anyone ever measured how much more data needs to be transferred over the network when using GroupingComparator the way Harsh suggests? What do I mean, when you use the GroupingComparator, it hides you the real key that you have emitted from Mapper. You just see the first key in the reduce group and any data that was carried in the key needs to be duplicated in the value in order to be accessible on the reduce end.




Let's say you have key consisting of two parts (base, extension), you partition by the 'base' part and use GroupingComparator to group keys with the same base part. Than you have no other chance than to emit from Mapper something like this - (key: (base, extension), value: extension), which means the 'extension' part is duplicated in the data, that has to be transferred over the network. This overhead can be diminished by using compression between map and reduce side, but I believe that in some cases this can be significant.




It would be nice if the API allowed to access the 'real' key for each value, not only the first key of the reduce group. The only way to get rid of this overhead now is by not using the GroupingComparator and instead store some internal state in the Reducer class, that is persisted across mutliple calls to reduce() method, which in my opinion makes using GroupingComparator this way less 'preferred' way of doing secondary sort.




Does anyone have any experience with this overhead?



Jan



On 08/23/2013 06:05 PM, Harsh J wrote:


The partitioner runs on the map-end. It assigns a partition ID

(reducer ID) to each key.

The grouping comparator runs on the reduce-end. It helps reducers,

which read off a merge-sorted single file, to understand how to break

the sequential file into reduce calls of <key, values[]>.



Typically one never overrides the GroupingComparator, and it is

usually the same as the SortComparator. But if you wish to do things

such as Secondary Sort, then overriding this comes useful - cause you

may want to sort over two parts of a key object, but only group by one

part, etc..



On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov

<em...@griddynamics.com> wrote:


Hello,



I have two different types of keys emerged from Map and processed by Reduce.

These keys have some part in common. And I'd like to have similar keys in

one reducer. For that purpose I used Partitioner and partition everything

gets in by this common part. It seems to be fine, but MRUnit seems doesn't

know anything about Partitioners. So, here is where GroupComparator comes

into play. It seems that MRUnit well aware of the guy, but it surprises me:

it looks like Partitioner and GroupComparator are actually doing exactly

same - they both somehow group keys to have them in one reducer.

Could you shed some light on it, please.

--










 		 	   		  

RE: Partitioner vs GroupComparator

Posted by java8964 java8964 <ja...@hotmail.com>.
As Harsh said, sometime you want to do the 2nd sort, but for MR, it can only be sorted by key, not by value.
A lot of time, you want to the reducer output sort by a field, but only do the sort within a group, kind of like 'windowing sort' in relation DB SQL. For example, if you have a data about all the employee, you want the MR job to sort the Employee by salary, but within each department.
So what you choose the key as the omit from Mapper? Department_id? If so, then it is hard to make the result sorted by salary. Using "Department_id + salary", then we cannot put all the data from one department into one reducer.
In this case, you separate keys composing way from grouping way. You still use 'Department_id+salary' as the key, but override the GroupComparator to group ONLY by "Department_id", but in the meantime, you sort the data on both 'Department_id + salary'. The final goal is to make sure that all the data for the same department arrive in the same reducer, and when they arrive, they will be sorted by salary too, by utilizing the MR's sort/shuffle build-in ability.
Yong

Date: Fri, 23 Aug 2013 13:06:01 -0400
Subject: Re: Partitioner vs GroupComparator
From: shahab.yunus@gmail.com
To: user@hadoop.apache.org

@Jan, why not, not send the 'hidden' part of the key as a value? Why not then pass value as null or with some other value part. So in the reducer side there is no duplication and you can extract the 'hidden' part of the key yourself (which should be possible as you will be encapsulating it in a some class/object model...?

Regards,Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <ja...@firma.seznam.cz> wrote:

Hi all,



when speaking about this, has anyone ever measured how much more data needs to be transferred over the network when using GroupingComparator the way Harsh suggests? What do I mean, when you use the GroupingComparator, it hides you the real key that you have emitted from Mapper. You just see the first key in the reduce group and any data that was carried in the key needs to be duplicated in the value in order to be accessible on the reduce end.




Let's say you have key consisting of two parts (base, extension), you partition by the 'base' part and use GroupingComparator to group keys with the same base part. Than you have no other chance than to emit from Mapper something like this - (key: (base, extension), value: extension), which means the 'extension' part is duplicated in the data, that has to be transferred over the network. This overhead can be diminished by using compression between map and reduce side, but I believe that in some cases this can be significant.




It would be nice if the API allowed to access the 'real' key for each value, not only the first key of the reduce group. The only way to get rid of this overhead now is by not using the GroupingComparator and instead store some internal state in the Reducer class, that is persisted across mutliple calls to reduce() method, which in my opinion makes using GroupingComparator this way less 'preferred' way of doing secondary sort.




Does anyone have any experience with this overhead?



Jan



On 08/23/2013 06:05 PM, Harsh J wrote:


The partitioner runs on the map-end. It assigns a partition ID

(reducer ID) to each key.

The grouping comparator runs on the reduce-end. It helps reducers,

which read off a merge-sorted single file, to understand how to break

the sequential file into reduce calls of <key, values[]>.



Typically one never overrides the GroupingComparator, and it is

usually the same as the SortComparator. But if you wish to do things

such as Secondary Sort, then overriding this comes useful - cause you

may want to sort over two parts of a key object, but only group by one

part, etc..



On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov

<em...@griddynamics.com> wrote:


Hello,



I have two different types of keys emerged from Map and processed by Reduce.

These keys have some part in common. And I'd like to have similar keys in

one reducer. For that purpose I used Partitioner and partition everything

gets in by this common part. It seems to be fine, but MRUnit seems doesn't

know anything about Partitioners. So, here is where GroupComparator comes

into play. It seems that MRUnit well aware of the guy, but it surprises me:

it looks like Partitioner and GroupComparator are actually doing exactly

same - they both somehow group keys to have them in one reducer.

Could you shed some light on it, please.

--










 		 	   		  

RE: Partitioner vs GroupComparator

Posted by java8964 java8964 <ja...@hotmail.com>.
As Harsh said, sometime you want to do the 2nd sort, but for MR, it can only be sorted by key, not by value.
A lot of time, you want to the reducer output sort by a field, but only do the sort within a group, kind of like 'windowing sort' in relation DB SQL. For example, if you have a data about all the employee, you want the MR job to sort the Employee by salary, but within each department.
So what you choose the key as the omit from Mapper? Department_id? If so, then it is hard to make the result sorted by salary. Using "Department_id + salary", then we cannot put all the data from one department into one reducer.
In this case, you separate keys composing way from grouping way. You still use 'Department_id+salary' as the key, but override the GroupComparator to group ONLY by "Department_id", but in the meantime, you sort the data on both 'Department_id + salary'. The final goal is to make sure that all the data for the same department arrive in the same reducer, and when they arrive, they will be sorted by salary too, by utilizing the MR's sort/shuffle build-in ability.
Yong

Date: Fri, 23 Aug 2013 13:06:01 -0400
Subject: Re: Partitioner vs GroupComparator
From: shahab.yunus@gmail.com
To: user@hadoop.apache.org

@Jan, why not, not send the 'hidden' part of the key as a value? Why not then pass value as null or with some other value part. So in the reducer side there is no duplication and you can extract the 'hidden' part of the key yourself (which should be possible as you will be encapsulating it in a some class/object model...?

Regards,Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <ja...@firma.seznam.cz> wrote:

Hi all,



when speaking about this, has anyone ever measured how much more data needs to be transferred over the network when using GroupingComparator the way Harsh suggests? What do I mean, when you use the GroupingComparator, it hides you the real key that you have emitted from Mapper. You just see the first key in the reduce group and any data that was carried in the key needs to be duplicated in the value in order to be accessible on the reduce end.




Let's say you have key consisting of two parts (base, extension), you partition by the 'base' part and use GroupingComparator to group keys with the same base part. Than you have no other chance than to emit from Mapper something like this - (key: (base, extension), value: extension), which means the 'extension' part is duplicated in the data, that has to be transferred over the network. This overhead can be diminished by using compression between map and reduce side, but I believe that in some cases this can be significant.




It would be nice if the API allowed to access the 'real' key for each value, not only the first key of the reduce group. The only way to get rid of this overhead now is by not using the GroupingComparator and instead store some internal state in the Reducer class, that is persisted across mutliple calls to reduce() method, which in my opinion makes using GroupingComparator this way less 'preferred' way of doing secondary sort.




Does anyone have any experience with this overhead?



Jan



On 08/23/2013 06:05 PM, Harsh J wrote:


The partitioner runs on the map-end. It assigns a partition ID

(reducer ID) to each key.

The grouping comparator runs on the reduce-end. It helps reducers,

which read off a merge-sorted single file, to understand how to break

the sequential file into reduce calls of <key, values[]>.



Typically one never overrides the GroupingComparator, and it is

usually the same as the SortComparator. But if you wish to do things

such as Secondary Sort, then overriding this comes useful - cause you

may want to sort over two parts of a key object, but only group by one

part, etc..



On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov

<em...@griddynamics.com> wrote:


Hello,



I have two different types of keys emerged from Map and processed by Reduce.

These keys have some part in common. And I'd like to have similar keys in

one reducer. For that purpose I used Partitioner and partition everything

gets in by this common part. It seems to be fine, but MRUnit seems doesn't

know anything about Partitioners. So, here is where GroupComparator comes

into play. It seems that MRUnit well aware of the guy, but it surprises me:

it looks like Partitioner and GroupComparator are actually doing exactly

same - they both somehow group keys to have them in one reducer.

Could you shed some light on it, please.

--










 		 	   		  

Re: Partitioner vs GroupComparator

Posted by Shahab Yunus <sh...@gmail.com>.
@Jan, why not, not send the 'hidden' part of the key as a value? Why not
then pass value as null or with some other value part. So in the reducer
side there is no duplication and you can extract the 'hidden' part of the
key yourself (which should be possible as you will be encapsulating it in a
some class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <jan.lukavsky@firma.seznam.cz
> wrote:

> Hi all,
>
> when speaking about this, has anyone ever measured how much more data
> needs to be transferred over the network when using GroupingComparator the
> way Harsh suggests? What do I mean, when you use the GroupingComparator, it
> hides you the real key that you have emitted from Mapper. You just see the
> first key in the reduce group and any data that was carried in the key
> needs to be duplicated in the value in order to be accessible on the reduce
> end.
>
> Let's say you have key consisting of two parts (base, extension), you
> partition by the 'base' part and use GroupingComparator to group keys with
> the same base part. Than you have no other chance than to emit from Mapper
> something like this - (key: (base, extension), value: extension), which
> means the 'extension' part is duplicated in the data, that has to be
> transferred over the network. This overhead can be diminished by using
> compression between map and reduce side, but I believe that in some cases
> this can be significant.
>
> It would be nice if the API allowed to access the 'real' key for each
> value, not only the first key of the reduce group. The only way to get rid
> of this overhead now is by not using the GroupingComparator and instead
> store some internal state in the Reducer class, that is persisted across
> mutliple calls to reduce() method, which in my opinion makes using
> GroupingComparator this way less 'preferred' way of doing secondary sort.
>
> Does anyone have any experience with this overhead?
>
> Jan
>
>
> On 08/23/2013 06:05 PM, Harsh J wrote:
>
>> The partitioner runs on the map-end. It assigns a partition ID
>> (reducer ID) to each key.
>> The grouping comparator runs on the reduce-end. It helps reducers,
>> which read off a merge-sorted single file, to understand how to break
>> the sequential file into reduce calls of <key, values[]>.
>>
>> Typically one never overrides the GroupingComparator, and it is
>> usually the same as the SortComparator. But if you wish to do things
>> such as Secondary Sort, then overriding this comes useful - cause you
>> may want to sort over two parts of a key object, but only group by one
>> part, etc..
>>
>> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
>> <em...@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I have two different types of keys emerged from Map and processed by
>>> Reduce.
>>> These keys have some part in common. And I'd like to have similar keys in
>>> one reducer. For that purpose I used Partitioner and partition everything
>>> gets in by this common part. It seems to be fine, but MRUnit seems
>>> doesn't
>>> know anything about Partitioners. So, here is where GroupComparator comes
>>> into play. It seems that MRUnit well aware of the guy, but it surprises
>>> me:
>>> it looks like Partitioner and GroupComparator are actually doing exactly
>>> same - they both somehow group keys to have them in one reducer.
>>> Could you shed some light on it, please.
>>> --
>>>
>>>
>>
>>

Re: Partitioner vs GroupComparator

Posted by Shahab Yunus <sh...@gmail.com>.
@Jan, why not, not send the 'hidden' part of the key as a value? Why not
then pass value as null or with some other value part. So in the reducer
side there is no duplication and you can extract the 'hidden' part of the
key yourself (which should be possible as you will be encapsulating it in a
some class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <jan.lukavsky@firma.seznam.cz
> wrote:

> Hi all,
>
> when speaking about this, has anyone ever measured how much more data
> needs to be transferred over the network when using GroupingComparator the
> way Harsh suggests? What do I mean, when you use the GroupingComparator, it
> hides you the real key that you have emitted from Mapper. You just see the
> first key in the reduce group and any data that was carried in the key
> needs to be duplicated in the value in order to be accessible on the reduce
> end.
>
> Let's say you have key consisting of two parts (base, extension), you
> partition by the 'base' part and use GroupingComparator to group keys with
> the same base part. Than you have no other chance than to emit from Mapper
> something like this - (key: (base, extension), value: extension), which
> means the 'extension' part is duplicated in the data, that has to be
> transferred over the network. This overhead can be diminished by using
> compression between map and reduce side, but I believe that in some cases
> this can be significant.
>
> It would be nice if the API allowed to access the 'real' key for each
> value, not only the first key of the reduce group. The only way to get rid
> of this overhead now is by not using the GroupingComparator and instead
> store some internal state in the Reducer class, that is persisted across
> mutliple calls to reduce() method, which in my opinion makes using
> GroupingComparator this way less 'preferred' way of doing secondary sort.
>
> Does anyone have any experience with this overhead?
>
> Jan
>
>
> On 08/23/2013 06:05 PM, Harsh J wrote:
>
>> The partitioner runs on the map-end. It assigns a partition ID
>> (reducer ID) to each key.
>> The grouping comparator runs on the reduce-end. It helps reducers,
>> which read off a merge-sorted single file, to understand how to break
>> the sequential file into reduce calls of <key, values[]>.
>>
>> Typically one never overrides the GroupingComparator, and it is
>> usually the same as the SortComparator. But if you wish to do things
>> such as Secondary Sort, then overriding this comes useful - cause you
>> may want to sort over two parts of a key object, but only group by one
>> part, etc..
>>
>> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
>> <em...@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I have two different types of keys emerged from Map and processed by
>>> Reduce.
>>> These keys have some part in common. And I'd like to have similar keys in
>>> one reducer. For that purpose I used Partitioner and partition everything
>>> gets in by this common part. It seems to be fine, but MRUnit seems
>>> doesn't
>>> know anything about Partitioners. So, here is where GroupComparator comes
>>> into play. It seems that MRUnit well aware of the guy, but it surprises
>>> me:
>>> it looks like Partitioner and GroupComparator are actually doing exactly
>>> same - they both somehow group keys to have them in one reducer.
>>> Could you shed some light on it, please.
>>> --
>>>
>>>
>>
>>

Re: Partitioner vs GroupComparator

Posted by Shahab Yunus <sh...@gmail.com>.
@Jan, why not, not send the 'hidden' part of the key as a value? Why not
then pass value as null or with some other value part. So in the reducer
side there is no duplication and you can extract the 'hidden' part of the
key yourself (which should be possible as you will be encapsulating it in a
some class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <jan.lukavsky@firma.seznam.cz
> wrote:

> Hi all,
>
> when speaking about this, has anyone ever measured how much more data
> needs to be transferred over the network when using GroupingComparator the
> way Harsh suggests? What do I mean, when you use the GroupingComparator, it
> hides you the real key that you have emitted from Mapper. You just see the
> first key in the reduce group and any data that was carried in the key
> needs to be duplicated in the value in order to be accessible on the reduce
> end.
>
> Let's say you have key consisting of two parts (base, extension), you
> partition by the 'base' part and use GroupingComparator to group keys with
> the same base part. Than you have no other chance than to emit from Mapper
> something like this - (key: (base, extension), value: extension), which
> means the 'extension' part is duplicated in the data, that has to be
> transferred over the network. This overhead can be diminished by using
> compression between map and reduce side, but I believe that in some cases
> this can be significant.
>
> It would be nice if the API allowed to access the 'real' key for each
> value, not only the first key of the reduce group. The only way to get rid
> of this overhead now is by not using the GroupingComparator and instead
> store some internal state in the Reducer class, that is persisted across
> mutliple calls to reduce() method, which in my opinion makes using
> GroupingComparator this way less 'preferred' way of doing secondary sort.
>
> Does anyone have any experience with this overhead?
>
> Jan
>
>
> On 08/23/2013 06:05 PM, Harsh J wrote:
>
>> The partitioner runs on the map-end. It assigns a partition ID
>> (reducer ID) to each key.
>> The grouping comparator runs on the reduce-end. It helps reducers,
>> which read off a merge-sorted single file, to understand how to break
>> the sequential file into reduce calls of <key, values[]>.
>>
>> Typically one never overrides the GroupingComparator, and it is
>> usually the same as the SortComparator. But if you wish to do things
>> such as Secondary Sort, then overriding this comes useful - cause you
>> may want to sort over two parts of a key object, but only group by one
>> part, etc..
>>
>> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
>> <em...@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I have two different types of keys emerged from Map and processed by
>>> Reduce.
>>> These keys have some part in common. And I'd like to have similar keys in
>>> one reducer. For that purpose I used Partitioner and partition everything
>>> gets in by this common part. It seems to be fine, but MRUnit seems
>>> doesn't
>>> know anything about Partitioners. So, here is where GroupComparator comes
>>> into play. It seems that MRUnit well aware of the guy, but it surprises
>>> me:
>>> it looks like Partitioner and GroupComparator are actually doing exactly
>>> same - they both somehow group keys to have them in one reducer.
>>> Could you shed some light on it, please.
>>> --
>>>
>>>
>>
>>

Re: Partitioner vs GroupComparator

Posted by Shahab Yunus <sh...@gmail.com>.
@Jan, why not, not send the 'hidden' part of the key as a value? Why not
then pass value as null or with some other value part. So in the reducer
side there is no duplication and you can extract the 'hidden' part of the
key yourself (which should be possible as you will be encapsulating it in a
some class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <jan.lukavsky@firma.seznam.cz
> wrote:

> Hi all,
>
> when speaking about this, has anyone ever measured how much more data
> needs to be transferred over the network when using GroupingComparator the
> way Harsh suggests? What do I mean, when you use the GroupingComparator, it
> hides you the real key that you have emitted from Mapper. You just see the
> first key in the reduce group and any data that was carried in the key
> needs to be duplicated in the value in order to be accessible on the reduce
> end.
>
> Let's say you have key consisting of two parts (base, extension), you
> partition by the 'base' part and use GroupingComparator to group keys with
> the same base part. Than you have no other chance than to emit from Mapper
> something like this - (key: (base, extension), value: extension), which
> means the 'extension' part is duplicated in the data, that has to be
> transferred over the network. This overhead can be diminished by using
> compression between map and reduce side, but I believe that in some cases
> this can be significant.
>
> It would be nice if the API allowed to access the 'real' key for each
> value, not only the first key of the reduce group. The only way to get rid
> of this overhead now is by not using the GroupingComparator and instead
> store some internal state in the Reducer class, that is persisted across
> mutliple calls to reduce() method, which in my opinion makes using
> GroupingComparator this way less 'preferred' way of doing secondary sort.
>
> Does anyone have any experience with this overhead?
>
> Jan
>
>
> On 08/23/2013 06:05 PM, Harsh J wrote:
>
>> The partitioner runs on the map-end. It assigns a partition ID
>> (reducer ID) to each key.
>> The grouping comparator runs on the reduce-end. It helps reducers,
>> which read off a merge-sorted single file, to understand how to break
>> the sequential file into reduce calls of <key, values[]>.
>>
>> Typically one never overrides the GroupingComparator, and it is
>> usually the same as the SortComparator. But if you wish to do things
>> such as Secondary Sort, then overriding this comes useful - cause you
>> may want to sort over two parts of a key object, but only group by one
>> part, etc..
>>
>> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
>> <em...@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I have two different types of keys emerged from Map and processed by
>>> Reduce.
>>> These keys have some part in common. And I'd like to have similar keys in
>>> one reducer. For that purpose I used Partitioner and partition everything
>>> gets in by this common part. It seems to be fine, but MRUnit seems
>>> doesn't
>>> know anything about Partitioners. So, here is where GroupComparator comes
>>> into play. It seems that MRUnit well aware of the guy, but it surprises
>>> me:
>>> it looks like Partitioner and GroupComparator are actually doing exactly
>>> same - they both somehow group keys to have them in one reducer.
>>> Could you shed some light on it, please.
>>> --
>>>
>>>
>>
>>

Re: Partitioner vs GroupComparator

Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi all,

when speaking about this, has anyone ever measured how much more data 
needs to be transferred over the network when using GroupingComparator 
the way Harsh suggests? What do I mean, when you use the 
GroupingComparator, it hides you the real key that you have emitted from 
Mapper. You just see the first key in the reduce group and any data that 
was carried in the key needs to be duplicated in the value in order to 
be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you 
partition by the 'base' part and use GroupingComparator to group keys 
with the same base part. Than you have no other chance than to emit from 
Mapper something like this - (key: (base, extension), value: extension), 
which means the 'extension' part is duplicated in the data, that has to 
be transferred over the network. This overhead can be diminished by 
using compression between map and reduce side, but I believe that in 
some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each 
value, not only the first key of the reduce group. The only way to get 
rid of this overhead now is by not using the GroupingComparator and 
instead store some internal state in the Reducer class, that is 
persisted across mutliple calls to reduce() method, which in my opinion 
makes using GroupingComparator this way less 'preferred' way of doing 
secondary sort.

Does anyone have any experience with this overhead?

Jan

On 08/23/2013 06:05 PM, Harsh J wrote:
> The partitioner runs on the map-end. It assigns a partition ID
> (reducer ID) to each key.
> The grouping comparator runs on the reduce-end. It helps reducers,
> which read off a merge-sorted single file, to understand how to break
> the sequential file into reduce calls of <key, values[]>.
>
> Typically one never overrides the GroupingComparator, and it is
> usually the same as the SortComparator. But if you wish to do things
> such as Secondary Sort, then overriding this comes useful - cause you
> may want to sort over two parts of a key object, but only group by one
> part, etc..
>
> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
> <em...@griddynamics.com> wrote:
>> Hello,
>>
>> I have two different types of keys emerged from Map and processed by Reduce.
>> These keys have some part in common. And I'd like to have similar keys in
>> one reducer. For that purpose I used Partitioner and partition everything
>> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
>> know anything about Partitioners. So, here is where GroupComparator comes
>> into play. It seems that MRUnit well aware of the guy, but it surprises me:
>> it looks like Partitioner and GroupComparator are actually doing exactly
>> same - they both somehow group keys to have them in one reducer.
>> Could you shed some light on it, please.
>> --
>>
>
>

Re: Partitioner vs GroupComparator

Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi all,

when speaking about this, has anyone ever measured how much more data 
needs to be transferred over the network when using GroupingComparator 
the way Harsh suggests? What do I mean, when you use the 
GroupingComparator, it hides you the real key that you have emitted from 
Mapper. You just see the first key in the reduce group and any data that 
was carried in the key needs to be duplicated in the value in order to 
be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you 
partition by the 'base' part and use GroupingComparator to group keys 
with the same base part. Than you have no other chance than to emit from 
Mapper something like this - (key: (base, extension), value: extension), 
which means the 'extension' part is duplicated in the data, that has to 
be transferred over the network. This overhead can be diminished by 
using compression between map and reduce side, but I believe that in 
some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each 
value, not only the first key of the reduce group. The only way to get 
rid of this overhead now is by not using the GroupingComparator and 
instead store some internal state in the Reducer class, that is 
persisted across mutliple calls to reduce() method, which in my opinion 
makes using GroupingComparator this way less 'preferred' way of doing 
secondary sort.

Does anyone have any experience with this overhead?

Jan

On 08/23/2013 06:05 PM, Harsh J wrote:
> The partitioner runs on the map-end. It assigns a partition ID
> (reducer ID) to each key.
> The grouping comparator runs on the reduce-end. It helps reducers,
> which read off a merge-sorted single file, to understand how to break
> the sequential file into reduce calls of <key, values[]>.
>
> Typically one never overrides the GroupingComparator, and it is
> usually the same as the SortComparator. But if you wish to do things
> such as Secondary Sort, then overriding this comes useful - cause you
> may want to sort over two parts of a key object, but only group by one
> part, etc..
>
> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
> <em...@griddynamics.com> wrote:
>> Hello,
>>
>> I have two different types of keys emerged from Map and processed by Reduce.
>> These keys have some part in common. And I'd like to have similar keys in
>> one reducer. For that purpose I used Partitioner and partition everything
>> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
>> know anything about Partitioners. So, here is where GroupComparator comes
>> into play. It seems that MRUnit well aware of the guy, but it surprises me:
>> it looks like Partitioner and GroupComparator are actually doing exactly
>> same - they both somehow group keys to have them in one reducer.
>> Could you shed some light on it, please.
>> --
>>
>
>

Re: Partitioner vs GroupComparator

Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi all,

when speaking about this, has anyone ever measured how much more data 
needs to be transferred over the network when using GroupingComparator 
the way Harsh suggests? What do I mean, when you use the 
GroupingComparator, it hides you the real key that you have emitted from 
Mapper. You just see the first key in the reduce group and any data that 
was carried in the key needs to be duplicated in the value in order to 
be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you 
partition by the 'base' part and use GroupingComparator to group keys 
with the same base part. Than you have no other chance than to emit from 
Mapper something like this - (key: (base, extension), value: extension), 
which means the 'extension' part is duplicated in the data, that has to 
be transferred over the network. This overhead can be diminished by 
using compression between map and reduce side, but I believe that in 
some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each 
value, not only the first key of the reduce group. The only way to get 
rid of this overhead now is by not using the GroupingComparator and 
instead store some internal state in the Reducer class, that is 
persisted across mutliple calls to reduce() method, which in my opinion 
makes using GroupingComparator this way less 'preferred' way of doing 
secondary sort.

Does anyone have any experience with this overhead?

Jan

On 08/23/2013 06:05 PM, Harsh J wrote:
> The partitioner runs on the map-end. It assigns a partition ID
> (reducer ID) to each key.
> The grouping comparator runs on the reduce-end. It helps reducers,
> which read off a merge-sorted single file, to understand how to break
> the sequential file into reduce calls of <key, values[]>.
>
> Typically one never overrides the GroupingComparator, and it is
> usually the same as the SortComparator. But if you wish to do things
> such as Secondary Sort, then overriding this comes useful - cause you
> may want to sort over two parts of a key object, but only group by one
> part, etc..
>
> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
> <em...@griddynamics.com> wrote:
>> Hello,
>>
>> I have two different types of keys emerged from Map and processed by Reduce.
>> These keys have some part in common. And I'd like to have similar keys in
>> one reducer. For that purpose I used Partitioner and partition everything
>> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
>> know anything about Partitioners. So, here is where GroupComparator comes
>> into play. It seems that MRUnit well aware of the guy, but it surprises me:
>> it looks like Partitioner and GroupComparator are actually doing exactly
>> same - they both somehow group keys to have them in one reducer.
>> Could you shed some light on it, please.
>> --
>>
>
>

Re: Partitioner vs GroupComparator

Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi all,

when speaking about this, has anyone ever measured how much more data 
needs to be transferred over the network when using GroupingComparator 
the way Harsh suggests? What do I mean, when you use the 
GroupingComparator, it hides you the real key that you have emitted from 
Mapper. You just see the first key in the reduce group and any data that 
was carried in the key needs to be duplicated in the value in order to 
be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you 
partition by the 'base' part and use GroupingComparator to group keys 
with the same base part. Than you have no other chance than to emit from 
Mapper something like this - (key: (base, extension), value: extension), 
which means the 'extension' part is duplicated in the data, that has to 
be transferred over the network. This overhead can be diminished by 
using compression between map and reduce side, but I believe that in 
some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each 
value, not only the first key of the reduce group. The only way to get 
rid of this overhead now is by not using the GroupingComparator and 
instead store some internal state in the Reducer class, that is 
persisted across mutliple calls to reduce() method, which in my opinion 
makes using GroupingComparator this way less 'preferred' way of doing 
secondary sort.

Does anyone have any experience with this overhead?

Jan

On 08/23/2013 06:05 PM, Harsh J wrote:
> The partitioner runs on the map-end. It assigns a partition ID
> (reducer ID) to each key.
> The grouping comparator runs on the reduce-end. It helps reducers,
> which read off a merge-sorted single file, to understand how to break
> the sequential file into reduce calls of <key, values[]>.
>
> Typically one never overrides the GroupingComparator, and it is
> usually the same as the SortComparator. But if you wish to do things
> such as Secondary Sort, then overriding this comes useful - cause you
> may want to sort over two parts of a key object, but only group by one
> part, etc..
>
> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
> <em...@griddynamics.com> wrote:
>> Hello,
>>
>> I have two different types of keys emerged from Map and processed by Reduce.
>> These keys have some part in common. And I'd like to have similar keys in
>> one reducer. For that purpose I used Partitioner and partition everything
>> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
>> know anything about Partitioners. So, here is where GroupComparator comes
>> into play. It seems that MRUnit well aware of the guy, but it surprises me:
>> it looks like Partitioner and GroupComparator are actually doing exactly
>> same - they both somehow group keys to have them in one reducer.
>> Could you shed some light on it, please.
>> --
>>
>
>

Re: Partitioner vs GroupComparator

Posted by Harsh J <ha...@cloudera.com>.
The partitioner runs on the map-end. It assigns a partition ID
(reducer ID) to each key.
The grouping comparator runs on the reduce-end. It helps reducers,
which read off a merge-sorted single file, to understand how to break
the sequential file into reduce calls of <key, values[]>.

Typically one never overrides the GroupingComparator, and it is
usually the same as the SortComparator. But if you wish to do things
such as Secondary Sort, then overriding this comes useful - cause you
may want to sort over two parts of a key object, but only group by one
part, etc..

On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
<em...@griddynamics.com> wrote:
> Hello,
>
> I have two different types of keys emerged from Map and processed by Reduce.
> These keys have some part in common. And I'd like to have similar keys in
> one reducer. For that purpose I used Partitioner and partition everything
> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
> know anything about Partitioners. So, here is where GroupComparator comes
> into play. It seems that MRUnit well aware of the guy, but it surprises me:
> it looks like Partitioner and GroupComparator are actually doing exactly
> same - they both somehow group keys to have them in one reducer.
> Could you shed some light on it, please.
> --
>



-- 
Harsh J

Re: Partitioner vs GroupComparator

Posted by Harsh J <ha...@cloudera.com>.
The partitioner runs on the map-end. It assigns a partition ID
(reducer ID) to each key.
The grouping comparator runs on the reduce-end. It helps reducers,
which read off a merge-sorted single file, to understand how to break
the sequential file into reduce calls of <key, values[]>.

Typically one never overrides the GroupingComparator, and it is
usually the same as the SortComparator. But if you wish to do things
such as Secondary Sort, then overriding this comes useful - cause you
may want to sort over two parts of a key object, but only group by one
part, etc..

On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
<em...@griddynamics.com> wrote:
> Hello,
>
> I have two different types of keys emerged from Map and processed by Reduce.
> These keys have some part in common. And I'd like to have similar keys in
> one reducer. For that purpose I used Partitioner and partition everything
> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
> know anything about Partitioners. So, here is where GroupComparator comes
> into play. It seems that MRUnit well aware of the guy, but it surprises me:
> it looks like Partitioner and GroupComparator are actually doing exactly
> same - they both somehow group keys to have them in one reducer.
> Could you shed some light on it, please.
> --
>



-- 
Harsh J

Re: Partitioner vs GroupComparator

Posted by Harsh J <ha...@cloudera.com>.
The partitioner runs on the map-end. It assigns a partition ID
(reducer ID) to each key.
The grouping comparator runs on the reduce-end. It helps reducers,
which read off a merge-sorted single file, to understand how to break
the sequential file into reduce calls of <key, values[]>.

Typically one never overrides the GroupingComparator, and it is
usually the same as the SortComparator. But if you wish to do things
such as Secondary Sort, then overriding this comes useful - cause you
may want to sort over two parts of a key object, but only group by one
part, etc..

On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
<em...@griddynamics.com> wrote:
> Hello,
>
> I have two different types of keys emerged from Map and processed by Reduce.
> These keys have some part in common. And I'd like to have similar keys in
> one reducer. For that purpose I used Partitioner and partition everything
> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
> know anything about Partitioners. So, here is where GroupComparator comes
> into play. It seems that MRUnit well aware of the guy, but it surprises me:
> it looks like Partitioner and GroupComparator are actually doing exactly
> same - they both somehow group keys to have them in one reducer.
> Could you shed some light on it, please.
> --
>



-- 
Harsh J

Re: Partitioner vs GroupComparator

Posted by Harsh J <ha...@cloudera.com>.
The partitioner runs on the map-end. It assigns a partition ID
(reducer ID) to each key.
The grouping comparator runs on the reduce-end. It helps reducers,
which read off a merge-sorted single file, to understand how to break
the sequential file into reduce calls of <key, values[]>.

Typically one never overrides the GroupingComparator, and it is
usually the same as the SortComparator. But if you wish to do things
such as Secondary Sort, then overriding this comes useful - cause you
may want to sort over two parts of a key object, but only group by one
part, etc..

On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
<em...@griddynamics.com> wrote:
> Hello,
>
> I have two different types of keys emerged from Map and processed by Reduce.
> These keys have some part in common. And I'd like to have similar keys in
> one reducer. For that purpose I used Partitioner and partition everything
> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
> know anything about Partitioners. So, here is where GroupComparator comes
> into play. It seems that MRUnit well aware of the guy, but it surprises me:
> it looks like Partitioner and GroupComparator are actually doing exactly
> same - they both somehow group keys to have them in one reducer.
> Could you shed some light on it, please.
> --
>



-- 
Harsh J