You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Jan Lukavský <ja...@firma.seznam.cz> on 2013/09/02 15:29:40 UTC
M/R API and Writable semantics in reducer
Hi all,
some time ago, I wrote a note to this conference, that it would be nice
if it would be possible to get the *real* key emitted from mapper to
reducer, when using the GroupingComparator. I got the answer, that it is
possible, because of the Writable semantics and that currently the
following holds:
@Override
protected void reduce(Key key, Iterable<Value> values, Context context)
{
for (Value v : values) {
// The key MIGHT change its value in this cycle, because
readFields() will be called on it.
// When using GroupingComparator that groups only by some part of
the key,
// many different keys might be considered single group, so the
*real* data matters.
}
}
When you use GroupingComparator the contents of the key can matter,
because if you cannot access it, you have to duplicate the data in value
(which means more network traffic in shuffle phase, and more I/O generally).
Now, the question is, how much is this a matter of API that is reliable,
or how much it is likely, that relying on this feature might break in
future versions. To me, it seems more like a side effect, that is not
guaranteed to be maintained in the future. There already exists a
suggestion, that this is probably very fragile, because MRUnit seems not
to update the key during the iteration.
Does anyone have any suggested way around? Is the 'official' preferred
way of accessing the original key to call context.getCurrentKey()? Isn't
this the same case? Wouldn't it be nice, if the API itself had some
guaranties or suggestions how it works? I can imagine modified reduce()
metod, with a signature like
protected void reduce(Key key, Iterable<Pair<Key, Value>> keyValues,
Context context);
This seems easily transformable to the old call (which could be default
implementation of this method).
Any opinion on this?
Thanks,
Jan
Re: M/R API and Writable semantics in reducer
Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi,
is there anyone interested in this topic? Basically, what I'm trying to
find out is, whether it is 'safe' to rely on the side-effect of updating
key during iterating values. I believe that there must be someone who is
also interested in this, the secondary sort pattern is very common (at
least in our jobs). So far, we have been emulating the
GroupingComparator by holding state in the Reducer class and therefore
being able to keep track of 'groups' of keys among several calls to
reduce() method. This method seems quite safe in the sense of API, but
in the sense of code is not as pretty (and vulnerable to ugly bugs if
you forget to reset the state correctly for instance).
On the other hand, if the way key gets updated while iterating the
values is to be considered contract of the MapReduce API, I think it
should be implemented in MRUnit (or you basically cannot use MRUnit to
unittest your job) and if it isn't, than it is probably a bug. If this
is internal behavior and might be subject to change anytime, than it
clearly seems that keeping the state in Reducer is the only option.
Does anyone else have similar considerations? How do others implement
the secondary sort?
Thanks,
Jan
On 09/02/2013 03:29 PM, Jan Lukavský wrote:
> Hi all,
>
> some time ago, I wrote a note to this conference, that it would be
> nice if it would be possible to get the *real* key emitted from mapper
> to reducer, when using the GroupingComparator. I got the answer, that
> it is possible, because of the Writable semantics and that currently
> the following holds:
>
> @Override
> protected void reduce(Key key, Iterable<Value> values, Context context)
> {
> for (Value v : values) {
> // The key MIGHT change its value in this cycle, because
> readFields() will be called on it.
> // When using GroupingComparator that groups only by some part of
> the key,
> // many different keys might be considered single group, so the
> *real* data matters.
> }
> }
>
> When you use GroupingComparator the contents of the key can matter,
> because if you cannot access it, you have to duplicate the data in
> value (which means more network traffic in shuffle phase, and more I/O
> generally).
>
> Now, the question is, how much is this a matter of API that is
> reliable, or how much it is likely, that relying on this feature might
> break in future versions. To me, it seems more like a side effect,
> that is not guaranteed to be maintained in the future. There already
> exists a suggestion, that this is probably very fragile, because
> MRUnit seems not to update the key during the iteration.
>
> Does anyone have any suggested way around? Is the 'official' preferred
> way of accessing the original key to call context.getCurrentKey()?
> Isn't this the same case? Wouldn't it be nice, if the API itself had
> some guaranties or suggestions how it works? I can imagine modified
> reduce() metod, with a signature like
>
> protected void reduce(Key key, Iterable<Pair<Key, Value>> keyValues,
> Context context);
>
> This seems easily transformable to the old call (which could be
> default implementation of this method).
>
> Any opinion on this?
>
> Thanks,
> Jan
Re: M/R API and Writable semantics in reducer
Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi,
is there anyone interested in this topic? Basically, what I'm trying to
find out is, whether it is 'safe' to rely on the side-effect of updating
key during iterating values. I believe that there must be someone who is
also interested in this, the secondary sort pattern is very common (at
least in our jobs). So far, we have been emulating the
GroupingComparator by holding state in the Reducer class and therefore
being able to keep track of 'groups' of keys among several calls to
reduce() method. This method seems quite safe in the sense of API, but
in the sense of code is not as pretty (and vulnerable to ugly bugs if
you forget to reset the state correctly for instance).
On the other hand, if the way key gets updated while iterating the
values is to be considered contract of the MapReduce API, I think it
should be implemented in MRUnit (or you basically cannot use MRUnit to
unittest your job) and if it isn't, than it is probably a bug. If this
is internal behavior and might be subject to change anytime, than it
clearly seems that keeping the state in Reducer is the only option.
Does anyone else have similar considerations? How do others implement
the secondary sort?
Thanks,
Jan
On 09/02/2013 03:29 PM, Jan Lukavský wrote:
> Hi all,
>
> some time ago, I wrote a note to this conference, that it would be
> nice if it would be possible to get the *real* key emitted from mapper
> to reducer, when using the GroupingComparator. I got the answer, that
> it is possible, because of the Writable semantics and that currently
> the following holds:
>
> @Override
> protected void reduce(Key key, Iterable<Value> values, Context context)
> {
> for (Value v : values) {
> // The key MIGHT change its value in this cycle, because
> readFields() will be called on it.
> // When using GroupingComparator that groups only by some part of
> the key,
> // many different keys might be considered single group, so the
> *real* data matters.
> }
> }
>
> When you use GroupingComparator the contents of the key can matter,
> because if you cannot access it, you have to duplicate the data in
> value (which means more network traffic in shuffle phase, and more I/O
> generally).
>
> Now, the question is, how much is this a matter of API that is
> reliable, or how much it is likely, that relying on this feature might
> break in future versions. To me, it seems more like a side effect,
> that is not guaranteed to be maintained in the future. There already
> exists a suggestion, that this is probably very fragile, because
> MRUnit seems not to update the key during the iteration.
>
> Does anyone have any suggested way around? Is the 'official' preferred
> way of accessing the original key to call context.getCurrentKey()?
> Isn't this the same case? Wouldn't it be nice, if the API itself had
> some guaranties or suggestions how it works? I can imagine modified
> reduce() metod, with a signature like
>
> protected void reduce(Key key, Iterable<Pair<Key, Value>> keyValues,
> Context context);
>
> This seems easily transformable to the old call (which could be
> default implementation of this method).
>
> Any opinion on this?
>
> Thanks,
> Jan
Re: M/R API and Writable semantics in reducer
Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi,
is there anyone interested in this topic? Basically, what I'm trying to
find out is, whether it is 'safe' to rely on the side-effect of updating
key during iterating values. I believe that there must be someone who is
also interested in this, the secondary sort pattern is very common (at
least in our jobs). So far, we have been emulating the
GroupingComparator by holding state in the Reducer class and therefore
being able to keep track of 'groups' of keys among several calls to
reduce() method. This method seems quite safe in the sense of API, but
in the sense of code is not as pretty (and vulnerable to ugly bugs if
you forget to reset the state correctly for instance).
On the other hand, if the way key gets updated while iterating the
values is to be considered contract of the MapReduce API, I think it
should be implemented in MRUnit (or you basically cannot use MRUnit to
unittest your job) and if it isn't, than it is probably a bug. If this
is internal behavior and might be subject to change anytime, than it
clearly seems that keeping the state in Reducer is the only option.
Does anyone else have similar considerations? How do others implement
the secondary sort?
Thanks,
Jan
On 09/02/2013 03:29 PM, Jan Lukavský wrote:
> Hi all,
>
> some time ago, I wrote a note to this conference, that it would be
> nice if it would be possible to get the *real* key emitted from mapper
> to reducer, when using the GroupingComparator. I got the answer, that
> it is possible, because of the Writable semantics and that currently
> the following holds:
>
> @Override
> protected void reduce(Key key, Iterable<Value> values, Context context)
> {
> for (Value v : values) {
> // The key MIGHT change its value in this cycle, because
> readFields() will be called on it.
> // When using GroupingComparator that groups only by some part of
> the key,
> // many different keys might be considered single group, so the
> *real* data matters.
> }
> }
>
> When you use GroupingComparator the contents of the key can matter,
> because if you cannot access it, you have to duplicate the data in
> value (which means more network traffic in shuffle phase, and more I/O
> generally).
>
> Now, the question is, how much is this a matter of API that is
> reliable, or how much it is likely, that relying on this feature might
> break in future versions. To me, it seems more like a side effect,
> that is not guaranteed to be maintained in the future. There already
> exists a suggestion, that this is probably very fragile, because
> MRUnit seems not to update the key during the iteration.
>
> Does anyone have any suggested way around? Is the 'official' preferred
> way of accessing the original key to call context.getCurrentKey()?
> Isn't this the same case? Wouldn't it be nice, if the API itself had
> some guaranties or suggestions how it works? I can imagine modified
> reduce() metod, with a signature like
>
> protected void reduce(Key key, Iterable<Pair<Key, Value>> keyValues,
> Context context);
>
> This seems easily transformable to the old call (which could be
> default implementation of this method).
>
> Any opinion on this?
>
> Thanks,
> Jan
Re: M/R API and Writable semantics in reducer
Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi,
is there anyone interested in this topic? Basically, what I'm trying to
find out is, whether it is 'safe' to rely on the side-effect of updating
key during iterating values. I believe that there must be someone who is
also interested in this, the secondary sort pattern is very common (at
least in our jobs). So far, we have been emulating the
GroupingComparator by holding state in the Reducer class and therefore
being able to keep track of 'groups' of keys among several calls to
reduce() method. This method seems quite safe in the sense of API, but
in the sense of code is not as pretty (and vulnerable to ugly bugs if
you forget to reset the state correctly for instance).
On the other hand, if the way key gets updated while iterating the
values is to be considered contract of the MapReduce API, I think it
should be implemented in MRUnit (or you basically cannot use MRUnit to
unittest your job) and if it isn't, than it is probably a bug. If this
is internal behavior and might be subject to change anytime, than it
clearly seems that keeping the state in Reducer is the only option.
Does anyone else have similar considerations? How do others implement
the secondary sort?
Thanks,
Jan
On 09/02/2013 03:29 PM, Jan Lukavský wrote:
> Hi all,
>
> some time ago, I wrote a note to this conference, that it would be
> nice if it would be possible to get the *real* key emitted from mapper
> to reducer, when using the GroupingComparator. I got the answer, that
> it is possible, because of the Writable semantics and that currently
> the following holds:
>
> @Override
> protected void reduce(Key key, Iterable<Value> values, Context context)
> {
> for (Value v : values) {
> // The key MIGHT change its value in this cycle, because
> readFields() will be called on it.
> // When using GroupingComparator that groups only by some part of
> the key,
> // many different keys might be considered single group, so the
> *real* data matters.
> }
> }
>
> When you use GroupingComparator the contents of the key can matter,
> because if you cannot access it, you have to duplicate the data in
> value (which means more network traffic in shuffle phase, and more I/O
> generally).
>
> Now, the question is, how much is this a matter of API that is
> reliable, or how much it is likely, that relying on this feature might
> break in future versions. To me, it seems more like a side effect,
> that is not guaranteed to be maintained in the future. There already
> exists a suggestion, that this is probably very fragile, because
> MRUnit seems not to update the key during the iteration.
>
> Does anyone have any suggested way around? Is the 'official' preferred
> way of accessing the original key to call context.getCurrentKey()?
> Isn't this the same case? Wouldn't it be nice, if the API itself had
> some guaranties or suggestions how it works? I can imagine modified
> reduce() metod, with a signature like
>
> protected void reduce(Key key, Iterable<Pair<Key, Value>> keyValues,
> Context context);
>
> This seems easily transformable to the old call (which could be
> default implementation of this method).
>
> Any opinion on this?
>
> Thanks,
> Jan