You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Aaron Kimball <aa...@cloudera.com> on 2009/07/23 05:21:21 UTC

Re: Testing Mappers in Hadoop 0.20.0

er, +CC mapreduce-dev...
- A

On Wed, Jul 22, 2009 at 8:17 PM, Aaron Kimball<aa...@cloudera.com> wrote:
> +CC mapred-dev
>
> Hm.. Making this change is actually really difficult.
>
> After changing Mapper.java, I understand why this was made a
> non-static member.  By making Context non-static, it can inherit from
> MapContext<W, X, Y, Z> and bind to the type qualifiers already
> specified in the class definition. So you can't make Context static
> because it needs to inherit the type parameters from the Mapper class.
> (Since Mapper has type qualifiers, it actually can't have static inner
> classes anyway.) But why even have Context in there anyway, given that
> it does nothing beyond MapContext's implementation?
>
> We can still change the map method's signature to use MapContext.
>
> As it is, you have to write:
> class MyMapper extends Mapper<W, X, Y, Z> {
>  void map(W k, X v, Context c) {
>     ....
>  }
> }
>
> To make the change, you would now need to write:
>
> class MyMapper extends Mapper<W, X, Y, Z> {
>  void map(W k, X v, MapContext<W,X,Y,Z> c) {
>     ....
>  }
> }
>
> So I think the primary reason for the non-static inner member is to
> save you some typing. That having been said, it makes adding mock
> implementations of Context really difficult.
>
> The real reason this is a tricky change, though, is that this is
> (surprisingly, to me) an incompatible change -- even though Context
> subclasses MapContext, so it's a type-safe API change to widen the set
> of inputs we accept in map(), anyone who specified @Override on the
> method and uses Context as an argument will get a compiler error :\
> (@Override, it seems, uses the literal names even if there's a
> type-safe promotion to a method in a superclass.)
>
> Can we still make incompatible API changes on the new API, or is it
> officially frozen? If incompatible changes are allowed, I'd like to
> see this in. I think that in the interest of better
> mockability/extensibility, it'd be cleaner to ditch the inner Context
> class in favor of explicit use of MapContext. We know Java's type
> system is verbose, but that doesn't mean we should try to hack around
> it, if that means losing functionality. (I think I know how to get
> around this in MRUnit, but it'd be cleaner to not have to.)
>
> - Aaron
>
>
>
> On Wed, Jul 22, 2009 at 7:22 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>> Both of those are good points. I'll submit a patch.
>> - Aaron
>>
>> On Wed, Jul 22, 2009 at 6:24 PM, Ted Dunning<te...@gmail.com> wrote:
>>> To amplify David's point, why is the argument a Mapper.Context rather than
>>> MapContext?
>>>
>>> Also, why is the Mapper.Context not static?
>>>
>>> On Wed, Jul 22, 2009 at 5:29 PM, David Hall <dl...@cs.stanford.edu> wrote:
>>>
>>>> This is nice, but doesn't it suffer from the same problem? MRUnit uses
>>>> the mapred API, which is deprecated, and the new API doesn't use
>>>> OutputCollector, but a non-static inner class.
>>>>
>>>> -- David
>>>>
>>>> On Wed, Jul 22, 2009 at 4:52 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>>> > Hi David,
>>>> >
>>>> > I wrote a contrib module called MRUnit
>>>> > (http://issues.apache.org/jira/browse/hadoop-5518) designed to allow
>>>> > unit tests for mappers/reducers more easily. It's slated for inclusion
>>>> > in 0.21, not 0.20 unfortunately, but you can download the patch above
>>>> > as well as MAPREDUCE-680 and build it against any earlier version of
>>>> > Hadoop. Unfortunately, it doesn't currently support the new APIs
>>>> > (e.g., with Context objects), but I imagine this could be added with
>>>> > little difficulty. I just haven't had time to do it myself ;) If you'd
>>>> > like to take a stab at it, I'd love some help!
>>>> >
>>>> > More info is at www.cloudera.com/hadoop-mrunit
>>>> > Cheers,
>>>> > - Aaron
>>>> >
>>>> > On Wed, Jul 22, 2009 at 2:49 PM, David Hall<dl...@cs.stanford.edu> wrote:
>>>> >> Hi,
>>>> >>
>>>> >> I'm a student working with Apache Mahout for the Google Summer of
>>>> >> Code. We recently moved to 0.20.0, and I was porting my code to the
>>>> >> new API. Unfortunately, I (and the whole project team) seem to have
>>>> >> run into a problem when it comes to testing them.
>>>> >>
>>>> >> Historically, we would create a Mapper in a unit test, and a special
>>>> >> "DummyOutputCollector", which was essentially a multimap dressed up to
>>>> >> conform to OutputCollector. In Hadoop 0.20.0, this isn't possible
>>>> >> anymore, because Mappers take an instance of an inner class.
>>>> >>
>>>> >> It's of course possible to dress up the Context in something else
>>>> >> (say, something just like an OutputCollector), and to specify that
>>>> >> Mahout Mappers should just delegate to a method that takes an
>>>> >> OutputCollector. But, this seems to not be very idiomatic.
>>>> >>
>>>> >> All this goes to say, what would be a "best practice" for testing
>>>> >> Mappers and Reducers in 0.20.0?
>>>> >>
>>>> >> Thanks,
>>>> >> David Hall
>>>> >>
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>

Re: Testing Mappers in Hadoop 0.20.0

Posted by Aaron Kimball <aa...@cloudera.com>.
I hacked around this in MRUnit last night. MRUnit now has support for
the new API -- See MAPREDUCE-800.

You can, in fact, subclass Mapper.Context and Reducer.Context, since
they don't actually share any state with the outer class
Mapper/Reducer implementation, just the type signatures. But doing
this requires making a dummy "Mapper" subclass to wrap around your own
Context subclass, which is somewhat unsightly. C'est la vie.

- Aaron

On Wed, Jul 22, 2009 at 10:44 PM, David Hall<dl...@cs.stanford.edu> wrote:
> For what it's worth, we ended up solving this problem (today) by using
> EasyMock with ClassExtension. It's an awful lot of magic, but it seems
> to work just fine for our purposes.  It would be great if doing
> bytecode weaving under the hood weren't necessary just to write test
> code, though.
>
> -- David
>
> On Wed, Jul 22, 2009 at 8:21 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>> er, +CC mapreduce-dev...
>> - A
>>
>> On Wed, Jul 22, 2009 at 8:17 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>> +CC mapred-dev
>>>
>>> Hm.. Making this change is actually really difficult.
>>>
>>> After changing Mapper.java, I understand why this was made a
>>> non-static member.  By making Context non-static, it can inherit from
>>> MapContext<W, X, Y, Z> and bind to the type qualifiers already
>>> specified in the class definition. So you can't make Context static
>>> because it needs to inherit the type parameters from the Mapper class.
>>> (Since Mapper has type qualifiers, it actually can't have static inner
>>> classes anyway.) But why even have Context in there anyway, given that
>>> it does nothing beyond MapContext's implementation?
>>>
>>> We can still change the map method's signature to use MapContext.
>>>
>>> As it is, you have to write:
>>> class MyMapper extends Mapper<W, X, Y, Z> {
>>>  void map(W k, X v, Context c) {
>>>     ....
>>>  }
>>> }
>>>
>>> To make the change, you would now need to write:
>>>
>>> class MyMapper extends Mapper<W, X, Y, Z> {
>>>  void map(W k, X v, MapContext<W,X,Y,Z> c) {
>>>     ....
>>>  }
>>> }
>>>
>>> So I think the primary reason for the non-static inner member is to
>>> save you some typing. That having been said, it makes adding mock
>>> implementations of Context really difficult.
>>>
>>> The real reason this is a tricky change, though, is that this is
>>> (surprisingly, to me) an incompatible change -- even though Context
>>> subclasses MapContext, so it's a type-safe API change to widen the set
>>> of inputs we accept in map(), anyone who specified @Override on the
>>> method and uses Context as an argument will get a compiler error :\
>>> (@Override, it seems, uses the literal names even if there's a
>>> type-safe promotion to a method in a superclass.)
>>>
>>> Can we still make incompatible API changes on the new API, or is it
>>> officially frozen? If incompatible changes are allowed, I'd like to
>>> see this in. I think that in the interest of better
>>> mockability/extensibility, it'd be cleaner to ditch the inner Context
>>> class in favor of explicit use of MapContext. We know Java's type
>>> system is verbose, but that doesn't mean we should try to hack around
>>> it, if that means losing functionality. (I think I know how to get
>>> around this in MRUnit, but it'd be cleaner to not have to.)
>>>
>>> - Aaron
>>>
>>>
>>>
>>> On Wed, Jul 22, 2009 at 7:22 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>>> Both of those are good points. I'll submit a patch.
>>>> - Aaron
>>>>
>>>> On Wed, Jul 22, 2009 at 6:24 PM, Ted Dunning<te...@gmail.com> wrote:
>>>>> To amplify David's point, why is the argument a Mapper.Context rather than
>>>>> MapContext?
>>>>>
>>>>> Also, why is the Mapper.Context not static?
>>>>>
>>>>> On Wed, Jul 22, 2009 at 5:29 PM, David Hall <dl...@cs.stanford.edu> wrote:
>>>>>
>>>>>> This is nice, but doesn't it suffer from the same problem? MRUnit uses
>>>>>> the mapred API, which is deprecated, and the new API doesn't use
>>>>>> OutputCollector, but a non-static inner class.
>>>>>>
>>>>>> -- David
>>>>>>
>>>>>> On Wed, Jul 22, 2009 at 4:52 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>>>>> > Hi David,
>>>>>> >
>>>>>> > I wrote a contrib module called MRUnit
>>>>>> > (http://issues.apache.org/jira/browse/hadoop-5518) designed to allow
>>>>>> > unit tests for mappers/reducers more easily. It's slated for inclusion
>>>>>> > in 0.21, not 0.20 unfortunately, but you can download the patch above
>>>>>> > as well as MAPREDUCE-680 and build it against any earlier version of
>>>>>> > Hadoop. Unfortunately, it doesn't currently support the new APIs
>>>>>> > (e.g., with Context objects), but I imagine this could be added with
>>>>>> > little difficulty. I just haven't had time to do it myself ;) If you'd
>>>>>> > like to take a stab at it, I'd love some help!
>>>>>> >
>>>>>> > More info is at www.cloudera.com/hadoop-mrunit
>>>>>> > Cheers,
>>>>>> > - Aaron
>>>>>> >
>>>>>> > On Wed, Jul 22, 2009 at 2:49 PM, David Hall<dl...@cs.stanford.edu> wrote:
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >> I'm a student working with Apache Mahout for the Google Summer of
>>>>>> >> Code. We recently moved to 0.20.0, and I was porting my code to the
>>>>>> >> new API. Unfortunately, I (and the whole project team) seem to have
>>>>>> >> run into a problem when it comes to testing them.
>>>>>> >>
>>>>>> >> Historically, we would create a Mapper in a unit test, and a special
>>>>>> >> "DummyOutputCollector", which was essentially a multimap dressed up to
>>>>>> >> conform to OutputCollector. In Hadoop 0.20.0, this isn't possible
>>>>>> >> anymore, because Mappers take an instance of an inner class.
>>>>>> >>
>>>>>> >> It's of course possible to dress up the Context in something else
>>>>>> >> (say, something just like an OutputCollector), and to specify that
>>>>>> >> Mahout Mappers should just delegate to a method that takes an
>>>>>> >> OutputCollector. But, this seems to not be very idiomatic.
>>>>>> >>
>>>>>> >> All this goes to say, what would be a "best practice" for testing
>>>>>> >> Mappers and Reducers in 0.20.0?
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> David Hall
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ted Dunning, CTO
>>>>> DeepDyve
>>>>>
>>>>
>>>
>>
>

Re: Testing Mappers in Hadoop 0.20.0

Posted by Aaron Kimball <aa...@cloudera.com>.
I hacked around this in MRUnit last night. MRUnit now has support for
the new API -- See MAPREDUCE-800.

You can, in fact, subclass Mapper.Context and Reducer.Context, since
they don't actually share any state with the outer class
Mapper/Reducer implementation, just the type signatures. But doing
this requires making a dummy "Mapper" subclass to wrap around your own
Context subclass, which is somewhat unsightly. C'est la vie.

- Aaron

On Wed, Jul 22, 2009 at 10:44 PM, David Hall<dl...@cs.stanford.edu> wrote:
> For what it's worth, we ended up solving this problem (today) by using
> EasyMock with ClassExtension. It's an awful lot of magic, but it seems
> to work just fine for our purposes.  It would be great if doing
> bytecode weaving under the hood weren't necessary just to write test
> code, though.
>
> -- David
>
> On Wed, Jul 22, 2009 at 8:21 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>> er, +CC mapreduce-dev...
>> - A
>>
>> On Wed, Jul 22, 2009 at 8:17 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>> +CC mapred-dev
>>>
>>> Hm.. Making this change is actually really difficult.
>>>
>>> After changing Mapper.java, I understand why this was made a
>>> non-static member.  By making Context non-static, it can inherit from
>>> MapContext<W, X, Y, Z> and bind to the type qualifiers already
>>> specified in the class definition. So you can't make Context static
>>> because it needs to inherit the type parameters from the Mapper class.
>>> (Since Mapper has type qualifiers, it actually can't have static inner
>>> classes anyway.) But why even have Context in there anyway, given that
>>> it does nothing beyond MapContext's implementation?
>>>
>>> We can still change the map method's signature to use MapContext.
>>>
>>> As it is, you have to write:
>>> class MyMapper extends Mapper<W, X, Y, Z> {
>>>  void map(W k, X v, Context c) {
>>>     ....
>>>  }
>>> }
>>>
>>> To make the change, you would now need to write:
>>>
>>> class MyMapper extends Mapper<W, X, Y, Z> {
>>>  void map(W k, X v, MapContext<W,X,Y,Z> c) {
>>>     ....
>>>  }
>>> }
>>>
>>> So I think the primary reason for the non-static inner member is to
>>> save you some typing. That having been said, it makes adding mock
>>> implementations of Context really difficult.
>>>
>>> The real reason this is a tricky change, though, is that this is
>>> (surprisingly, to me) an incompatible change -- even though Context
>>> subclasses MapContext, so it's a type-safe API change to widen the set
>>> of inputs we accept in map(), anyone who specified @Override on the
>>> method and uses Context as an argument will get a compiler error :\
>>> (@Override, it seems, uses the literal names even if there's a
>>> type-safe promotion to a method in a superclass.)
>>>
>>> Can we still make incompatible API changes on the new API, or is it
>>> officially frozen? If incompatible changes are allowed, I'd like to
>>> see this in. I think that in the interest of better
>>> mockability/extensibility, it'd be cleaner to ditch the inner Context
>>> class in favor of explicit use of MapContext. We know Java's type
>>> system is verbose, but that doesn't mean we should try to hack around
>>> it, if that means losing functionality. (I think I know how to get
>>> around this in MRUnit, but it'd be cleaner to not have to.)
>>>
>>> - Aaron
>>>
>>>
>>>
>>> On Wed, Jul 22, 2009 at 7:22 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>>> Both of those are good points. I'll submit a patch.
>>>> - Aaron
>>>>
>>>> On Wed, Jul 22, 2009 at 6:24 PM, Ted Dunning<te...@gmail.com> wrote:
>>>>> To amplify David's point, why is the argument a Mapper.Context rather than
>>>>> MapContext?
>>>>>
>>>>> Also, why is the Mapper.Context not static?
>>>>>
>>>>> On Wed, Jul 22, 2009 at 5:29 PM, David Hall <dl...@cs.stanford.edu> wrote:
>>>>>
>>>>>> This is nice, but doesn't it suffer from the same problem? MRUnit uses
>>>>>> the mapred API, which is deprecated, and the new API doesn't use
>>>>>> OutputCollector, but a non-static inner class.
>>>>>>
>>>>>> -- David
>>>>>>
>>>>>> On Wed, Jul 22, 2009 at 4:52 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>>>>> > Hi David,
>>>>>> >
>>>>>> > I wrote a contrib module called MRUnit
>>>>>> > (http://issues.apache.org/jira/browse/hadoop-5518) designed to allow
>>>>>> > unit tests for mappers/reducers more easily. It's slated for inclusion
>>>>>> > in 0.21, not 0.20 unfortunately, but you can download the patch above
>>>>>> > as well as MAPREDUCE-680 and build it against any earlier version of
>>>>>> > Hadoop. Unfortunately, it doesn't currently support the new APIs
>>>>>> > (e.g., with Context objects), but I imagine this could be added with
>>>>>> > little difficulty. I just haven't had time to do it myself ;) If you'd
>>>>>> > like to take a stab at it, I'd love some help!
>>>>>> >
>>>>>> > More info is at www.cloudera.com/hadoop-mrunit
>>>>>> > Cheers,
>>>>>> > - Aaron
>>>>>> >
>>>>>> > On Wed, Jul 22, 2009 at 2:49 PM, David Hall<dl...@cs.stanford.edu> wrote:
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >> I'm a student working with Apache Mahout for the Google Summer of
>>>>>> >> Code. We recently moved to 0.20.0, and I was porting my code to the
>>>>>> >> new API. Unfortunately, I (and the whole project team) seem to have
>>>>>> >> run into a problem when it comes to testing them.
>>>>>> >>
>>>>>> >> Historically, we would create a Mapper in a unit test, and a special
>>>>>> >> "DummyOutputCollector", which was essentially a multimap dressed up to
>>>>>> >> conform to OutputCollector. In Hadoop 0.20.0, this isn't possible
>>>>>> >> anymore, because Mappers take an instance of an inner class.
>>>>>> >>
>>>>>> >> It's of course possible to dress up the Context in something else
>>>>>> >> (say, something just like an OutputCollector), and to specify that
>>>>>> >> Mahout Mappers should just delegate to a method that takes an
>>>>>> >> OutputCollector. But, this seems to not be very idiomatic.
>>>>>> >>
>>>>>> >> All this goes to say, what would be a "best practice" for testing
>>>>>> >> Mappers and Reducers in 0.20.0?
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> David Hall
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ted Dunning, CTO
>>>>> DeepDyve
>>>>>
>>>>
>>>
>>
>

Re: Testing Mappers in Hadoop 0.20.0

Posted by David Hall <dl...@cs.stanford.edu>.
For what it's worth, we ended up solving this problem (today) by using
EasyMock with ClassExtension. It's an awful lot of magic, but it seems
to work just fine for our purposes.  It would be great if doing
bytecode weaving under the hood weren't necessary just to write test
code, though.

-- David

On Wed, Jul 22, 2009 at 8:21 PM, Aaron Kimball<aa...@cloudera.com> wrote:
> er, +CC mapreduce-dev...
> - A
>
> On Wed, Jul 22, 2009 at 8:17 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>> +CC mapred-dev
>>
>> Hm.. Making this change is actually really difficult.
>>
>> After changing Mapper.java, I understand why this was made a
>> non-static member.  By making Context non-static, it can inherit from
>> MapContext<W, X, Y, Z> and bind to the type qualifiers already
>> specified in the class definition. So you can't make Context static
>> because it needs to inherit the type parameters from the Mapper class.
>> (Since Mapper has type qualifiers, it actually can't have static inner
>> classes anyway.) But why even have Context in there anyway, given that
>> it does nothing beyond MapContext's implementation?
>>
>> We can still change the map method's signature to use MapContext.
>>
>> As it is, you have to write:
>> class MyMapper extends Mapper<W, X, Y, Z> {
>>  void map(W k, X v, Context c) {
>>     ....
>>  }
>> }
>>
>> To make the change, you would now need to write:
>>
>> class MyMapper extends Mapper<W, X, Y, Z> {
>>  void map(W k, X v, MapContext<W,X,Y,Z> c) {
>>     ....
>>  }
>> }
>>
>> So I think the primary reason for the non-static inner member is to
>> save you some typing. That having been said, it makes adding mock
>> implementations of Context really difficult.
>>
>> The real reason this is a tricky change, though, is that this is
>> (surprisingly, to me) an incompatible change -- even though Context
>> subclasses MapContext, so it's a type-safe API change to widen the set
>> of inputs we accept in map(), anyone who specified @Override on the
>> method and uses Context as an argument will get a compiler error :\
>> (@Override, it seems, uses the literal names even if there's a
>> type-safe promotion to a method in a superclass.)
>>
>> Can we still make incompatible API changes on the new API, or is it
>> officially frozen? If incompatible changes are allowed, I'd like to
>> see this in. I think that in the interest of better
>> mockability/extensibility, it'd be cleaner to ditch the inner Context
>> class in favor of explicit use of MapContext. We know Java's type
>> system is verbose, but that doesn't mean we should try to hack around
>> it, if that means losing functionality. (I think I know how to get
>> around this in MRUnit, but it'd be cleaner to not have to.)
>>
>> - Aaron
>>
>>
>>
>> On Wed, Jul 22, 2009 at 7:22 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>> Both of those are good points. I'll submit a patch.
>>> - Aaron
>>>
>>> On Wed, Jul 22, 2009 at 6:24 PM, Ted Dunning<te...@gmail.com> wrote:
>>>> To amplify David's point, why is the argument a Mapper.Context rather than
>>>> MapContext?
>>>>
>>>> Also, why is the Mapper.Context not static?
>>>>
>>>> On Wed, Jul 22, 2009 at 5:29 PM, David Hall <dl...@cs.stanford.edu> wrote:
>>>>
>>>>> This is nice, but doesn't it suffer from the same problem? MRUnit uses
>>>>> the mapred API, which is deprecated, and the new API doesn't use
>>>>> OutputCollector, but a non-static inner class.
>>>>>
>>>>> -- David
>>>>>
>>>>> On Wed, Jul 22, 2009 at 4:52 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>>>> > Hi David,
>>>>> >
>>>>> > I wrote a contrib module called MRUnit
>>>>> > (http://issues.apache.org/jira/browse/hadoop-5518) designed to allow
>>>>> > unit tests for mappers/reducers more easily. It's slated for inclusion
>>>>> > in 0.21, not 0.20 unfortunately, but you can download the patch above
>>>>> > as well as MAPREDUCE-680 and build it against any earlier version of
>>>>> > Hadoop. Unfortunately, it doesn't currently support the new APIs
>>>>> > (e.g., with Context objects), but I imagine this could be added with
>>>>> > little difficulty. I just haven't had time to do it myself ;) If you'd
>>>>> > like to take a stab at it, I'd love some help!
>>>>> >
>>>>> > More info is at www.cloudera.com/hadoop-mrunit
>>>>> > Cheers,
>>>>> > - Aaron
>>>>> >
>>>>> > On Wed, Jul 22, 2009 at 2:49 PM, David Hall<dl...@cs.stanford.edu> wrote:
>>>>> >> Hi,
>>>>> >>
>>>>> >> I'm a student working with Apache Mahout for the Google Summer of
>>>>> >> Code. We recently moved to 0.20.0, and I was porting my code to the
>>>>> >> new API. Unfortunately, I (and the whole project team) seem to have
>>>>> >> run into a problem when it comes to testing them.
>>>>> >>
>>>>> >> Historically, we would create a Mapper in a unit test, and a special
>>>>> >> "DummyOutputCollector", which was essentially a multimap dressed up to
>>>>> >> conform to OutputCollector. In Hadoop 0.20.0, this isn't possible
>>>>> >> anymore, because Mappers take an instance of an inner class.
>>>>> >>
>>>>> >> It's of course possible to dress up the Context in something else
>>>>> >> (say, something just like an OutputCollector), and to specify that
>>>>> >> Mahout Mappers should just delegate to a method that takes an
>>>>> >> OutputCollector. But, this seems to not be very idiomatic.
>>>>> >>
>>>>> >> All this goes to say, what would be a "best practice" for testing
>>>>> >> Mappers and Reducers in 0.20.0?
>>>>> >>
>>>>> >> Thanks,
>>>>> >> David Hall
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ted Dunning, CTO
>>>> DeepDyve
>>>>
>>>
>>
>

Re: Testing Mappers in Hadoop 0.20.0

Posted by David Hall <dl...@cs.stanford.edu>.
For what it's worth, we ended up solving this problem (today) by using
EasyMock with ClassExtension. It's an awful lot of magic, but it seems
to work just fine for our purposes.  It would be great if doing
bytecode weaving under the hood weren't necessary just to write test
code, though.

-- David

On Wed, Jul 22, 2009 at 8:21 PM, Aaron Kimball<aa...@cloudera.com> wrote:
> er, +CC mapreduce-dev...
> - A
>
> On Wed, Jul 22, 2009 at 8:17 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>> +CC mapred-dev
>>
>> Hm.. Making this change is actually really difficult.
>>
>> After changing Mapper.java, I understand why this was made a
>> non-static member.  By making Context non-static, it can inherit from
>> MapContext<W, X, Y, Z> and bind to the type qualifiers already
>> specified in the class definition. So you can't make Context static
>> because it needs to inherit the type parameters from the Mapper class.
>> (Since Mapper has type qualifiers, it actually can't have static inner
>> classes anyway.) But why even have Context in there anyway, given that
>> it does nothing beyond MapContext's implementation?
>>
>> We can still change the map method's signature to use MapContext.
>>
>> As it is, you have to write:
>> class MyMapper extends Mapper<W, X, Y, Z> {
>>  void map(W k, X v, Context c) {
>>     ....
>>  }
>> }
>>
>> To make the change, you would now need to write:
>>
>> class MyMapper extends Mapper<W, X, Y, Z> {
>>  void map(W k, X v, MapContext<W,X,Y,Z> c) {
>>     ....
>>  }
>> }
>>
>> So I think the primary reason for the non-static inner member is to
>> save you some typing. That having been said, it makes adding mock
>> implementations of Context really difficult.
>>
>> The real reason this is a tricky change, though, is that this is
>> (surprisingly, to me) an incompatible change -- even though Context
>> subclasses MapContext, so it's a type-safe API change to widen the set
>> of inputs we accept in map(), anyone who specified @Override on the
>> method and uses Context as an argument will get a compiler error :\
>> (@Override, it seems, uses the literal names even if there's a
>> type-safe promotion to a method in a superclass.)
>>
>> Can we still make incompatible API changes on the new API, or is it
>> officially frozen? If incompatible changes are allowed, I'd like to
>> see this in. I think that in the interest of better
>> mockability/extensibility, it'd be cleaner to ditch the inner Context
>> class in favor of explicit use of MapContext. We know Java's type
>> system is verbose, but that doesn't mean we should try to hack around
>> it, if that means losing functionality. (I think I know how to get
>> around this in MRUnit, but it'd be cleaner to not have to.)
>>
>> - Aaron
>>
>>
>>
>> On Wed, Jul 22, 2009 at 7:22 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>> Both of those are good points. I'll submit a patch.
>>> - Aaron
>>>
>>> On Wed, Jul 22, 2009 at 6:24 PM, Ted Dunning<te...@gmail.com> wrote:
>>>> To amplify David's point, why is the argument a Mapper.Context rather than
>>>> MapContext?
>>>>
>>>> Also, why is the Mapper.Context not static?
>>>>
>>>> On Wed, Jul 22, 2009 at 5:29 PM, David Hall <dl...@cs.stanford.edu> wrote:
>>>>
>>>>> This is nice, but doesn't it suffer from the same problem? MRUnit uses
>>>>> the mapred API, which is deprecated, and the new API doesn't use
>>>>> OutputCollector, but a non-static inner class.
>>>>>
>>>>> -- David
>>>>>
>>>>> On Wed, Jul 22, 2009 at 4:52 PM, Aaron Kimball<aa...@cloudera.com> wrote:
>>>>> > Hi David,
>>>>> >
>>>>> > I wrote a contrib module called MRUnit
>>>>> > (http://issues.apache.org/jira/browse/hadoop-5518) designed to allow
>>>>> > unit tests for mappers/reducers more easily. It's slated for inclusion
>>>>> > in 0.21, not 0.20 unfortunately, but you can download the patch above
>>>>> > as well as MAPREDUCE-680 and build it against any earlier version of
>>>>> > Hadoop. Unfortunately, it doesn't currently support the new APIs
>>>>> > (e.g., with Context objects), but I imagine this could be added with
>>>>> > little difficulty. I just haven't had time to do it myself ;) If you'd
>>>>> > like to take a stab at it, I'd love some help!
>>>>> >
>>>>> > More info is at www.cloudera.com/hadoop-mrunit
>>>>> > Cheers,
>>>>> > - Aaron
>>>>> >
>>>>> > On Wed, Jul 22, 2009 at 2:49 PM, David Hall<dl...@cs.stanford.edu> wrote:
>>>>> >> Hi,
>>>>> >>
>>>>> >> I'm a student working with Apache Mahout for the Google Summer of
>>>>> >> Code. We recently moved to 0.20.0, and I was porting my code to the
>>>>> >> new API. Unfortunately, I (and the whole project team) seem to have
>>>>> >> run into a problem when it comes to testing them.
>>>>> >>
>>>>> >> Historically, we would create a Mapper in a unit test, and a special
>>>>> >> "DummyOutputCollector", which was essentially a multimap dressed up to
>>>>> >> conform to OutputCollector. In Hadoop 0.20.0, this isn't possible
>>>>> >> anymore, because Mappers take an instance of an inner class.
>>>>> >>
>>>>> >> It's of course possible to dress up the Context in something else
>>>>> >> (say, something just like an OutputCollector), and to specify that
>>>>> >> Mahout Mappers should just delegate to a method that takes an
>>>>> >> OutputCollector. But, this seems to not be very idiomatic.
>>>>> >>
>>>>> >> All this goes to say, what would be a "best practice" for testing
>>>>> >> Mappers and Reducers in 0.20.0?
>>>>> >>
>>>>> >> Thanks,
>>>>> >> David Hall
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ted Dunning, CTO
>>>> DeepDyve
>>>>
>>>
>>
>