You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by tom pierce <tc...@apache.org> on 2012/03/08 02:24:21 UTC

Review Request: MAHOUT-822: Mahout needs to be made compatible with Hadoop .23 releases

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4237/
-----------------------------------------------------------

Review request for mahout and Dmitriy Lyubimov.


Summary
-------

This is the current patch for MAHOUT-822 (as posted by Bilung Lee).


This addresses bug MAHOUT-822.
    https://issues.apache.org/jira/browse/MAHOUT-822


Diffs
-----

  trunk/core/pom.xml 1296318 
  trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/MockContext.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialSequentialBuilder.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1MapperTest.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/common/DummyCounter.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/common/DummyRecordWriter.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/common/DummyStatusReporter.java 1296318 
  trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1296318 
  trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java 1296318 
  trunk/pom.xml 1296318 

Diff: https://reviews.apache.org/r/4237/diff


Testing
-------

Passes unit tests under default config as well as under hadoop 0.23.1.


Thanks,

tom


Re: Review Request: MAHOUT-822: Mahout needs to be made compatible with Hadoop .23 releases

Posted by tom pierce <tc...@apache.org>.

> On 2012-03-08 23:42:23, Dmitriy Lyubimov wrote:
> > trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java, line 247
> > <https://reviews.apache.org/r/4237/diff/1/?file=88913#file88913line247>
> >
> >     is this really an issue with Hadoop 0.20.203? i don't think we care about classic hadoop 0.20.2 anymore.
> 
> tom pierce wrote:
>     Hmm, I am not sure, but I think this is a comment we can omit.  Without the new listStatus methods (but with the changes to other files), there is at least one test that won't compile.  I'm willing to figure out who added this to the patch and ping them about it, but before I do, what should I ask them to clarify?  Exactly which version of Hadoop they had in mind when making the changes to HadoopUtil?

We've misinterpreted the comment - an empty array is returned to emulate 0.20.x behavior under 0.23.1.  I applied the patch, reverted HadoopUtil and TestDistributedRowMatrix (which is the only class using these new methods), and that test still passes under 0.20.x but not under 0.23.1.  From the stacktrace it seems fs.listStatus will now throws an exception rather than returning an empty array when there are no matches. 

I can see this comment causing the same confusion in the future; I will plan to remove the "retain compatibility" comment from HadoopUtil (it appears 2x) before committing.


- tom


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4237/#review5754
-----------------------------------------------------------


On 2012-03-08 01:24:21, tom pierce wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4237/
> -----------------------------------------------------------
> 
> (Updated 2012-03-08 01:24:21)
> 
> 
> Review request for mahout and Dmitriy Lyubimov.
> 
> 
> Summary
> -------
> 
> This is the current patch for MAHOUT-822 (as posted by Bilung Lee).
> 
> 
> This addresses bug MAHOUT-822.
>     https://issues.apache.org/jira/browse/MAHOUT-822
> 
> 
> Diffs
> -----
> 
>   trunk/core/pom.xml 1296318 
>   trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/MockContext.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialSequentialBuilder.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1MapperTest.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyCounter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyRecordWriter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyStatusReporter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1296318 
>   trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java 1296318 
>   trunk/pom.xml 1296318 
> 
> Diff: https://reviews.apache.org/r/4237/diff
> 
> 
> Testing
> -------
> 
> Passes unit tests under default config as well as under hadoop 0.23.1.
> 
> 
> Thanks,
> 
> tom
> 
>


Re: Review Request: MAHOUT-822: Mahout needs to be made compatible with Hadoop .23 releases

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
no

On Fri, Mar 9, 2012 at 3:01 PM, tom pierce <tc...@apache.org> wrote:
> There is MAHOUT-950, which cleans up some of the mixed-usage tricks for
> using MultipleOutputs.  It does not pass tests under 0.20.203, though.
>
> Under various 23.x-SNAPSHOTS, -822 required -950 to pass all tests, but I
> guess before 0.23.1 was released, the earlier behavior was duplicated.
>
> Jeff did +1 the clustering changes in -822, and the HadoopUtil changes are
> fixing the build under 0.23.1, not an old 0.20.x.  Any objections to this
> going in with the whitespace correction and comment removal mentioned on
> reviewboard?
>
> -tom
>
>
> On 03/09/2012 01:52 PM, Dmitriy Lyubimov wrote:
>>
>> On another note, i was primarily interested to see if this branch
>> tacles old/new api multiple outputs anywhere, it doesn't look like it
>> does. I think 0.20.203 already have support for new api multiple
>> outputs, i guess i can take care of it on some other branch. in fact
>> the methods i care about initially had been doing it 100% with new api
>> found in Cloudera distros and were hacked back to support 0.20.2 which
>> Mahout was shipped with. so i think i'll take a look at it at some
>> other issue separately.
>>
>> On Fri, Mar 9, 2012 at 10:50 AM, Dmitriy Lyubimov<dl...@apache.org>
>>  wrote:
>>>
>>> On Thu, Mar 8, 2012 at 8:43 PM, tom pierce<tc...@apache.org>  wrote:
>>>>
>>>> Is there someone you'd nominate as an additional reviewer?
>>>
>>> This clustering stuff in Mahout is i think something Jeff is very
>>> knowledgeable of. Unless he already was positive in the jira
>>> discussion, maybe he can review?
>
>

Re: Review Request: MAHOUT-822: Mahout needs to be made compatible with Hadoop .23 releases

Posted by tom pierce <tc...@apache.org>.
There is MAHOUT-950, which cleans up some of the mixed-usage tricks for 
using MultipleOutputs.  It does not pass tests under 0.20.203, though.

Under various 23.x-SNAPSHOTS, -822 required -950 to pass all tests, but 
I guess before 0.23.1 was released, the earlier behavior was duplicated.

Jeff did +1 the clustering changes in -822, and the HadoopUtil changes 
are fixing the build under 0.23.1, not an old 0.20.x.  Any objections to 
this going in with the whitespace correction and comment removal 
mentioned on reviewboard?

-tom

On 03/09/2012 01:52 PM, Dmitriy Lyubimov wrote:
> On another note, i was primarily interested to see if this branch
> tacles old/new api multiple outputs anywhere, it doesn't look like it
> does. I think 0.20.203 already have support for new api multiple
> outputs, i guess i can take care of it on some other branch. in fact
> the methods i care about initially had been doing it 100% with new api
> found in Cloudera distros and were hacked back to support 0.20.2 which
> Mahout was shipped with. so i think i'll take a look at it at some
> other issue separately.
>
> On Fri, Mar 9, 2012 at 10:50 AM, Dmitriy Lyubimov<dl...@apache.org>  wrote:
>> On Thu, Mar 8, 2012 at 8:43 PM, tom pierce<tc...@apache.org>  wrote:
>>> Is there someone you'd nominate as an additional reviewer?
>> This clustering stuff in Mahout is i think something Jeff is very
>> knowledgeable of. Unless he already was positive in the jira
>> discussion, maybe he can review?


Re: Review Request: MAHOUT-822: Mahout needs to be made compatible with Hadoop .23 releases

Posted by tom pierce <tc...@apache.org>.

> On 2012-03-08 23:42:23, Dmitriy Lyubimov wrote:
> > trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java, line 242
> > <https://reviews.apache.org/r/4237/diff/1/?file=88913#file88913line242>
> >
> >     Extra white spaces generate style warnings but are notoriously hard to eliminate. I have no problems with them but it was a subject of review notes from other committers before, I am just reproducing collective mind here.

Good point - I'll address this before committing.


> On 2012-03-08 23:42:23, Dmitriy Lyubimov wrote:
> > trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java, line 247
> > <https://reviews.apache.org/r/4237/diff/1/?file=88913#file88913line247>
> >
> >     is this really an issue with Hadoop 0.20.203? i don't think we care about classic hadoop 0.20.2 anymore.

Hmm, I am not sure, but I think this is a comment we can omit.  Without the new listStatus methods (but with the changes to other files), there is at least one test that won't compile.  I'm willing to figure out who added this to the patch and ping them about it, but before I do, what should I ask them to clarify?  Exactly which version of Hadoop they had in mind when making the changes to HadoopUtil?  


> On 2012-03-08 23:42:23, Dmitriy Lyubimov wrote:
> > trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java, line 356
> > <https://reviews.apache.org/r/4237/diff/1/?file=88920#file88920line356>
> >
> >     This and some other changes look suspiciously like a functional change rather than a hadoop compatibility change. I think it would help somebody else more familiar with this code to assure it is indeed benign..

The changes in tests were needed because some tests were requiring outputs to appear in a certain order.  That is to say, we'd need to output "cluster 1 contains items a, b and d; cluster 2 contains c, e and f"; finding the same set of clusters but calling {c,e,f} "cluster 1" would not do.  Under Hadoop 0.23.1, sometimes these outputs come in a different order.  Most of the changes to tests simply allow permutations in the expected output.

Mean shift was it's own special case - it is very sensitive to these ordering effects, and it is an iterative process.  I didn't dig into the implementation enough to convince myself it was possible to devise a new test that would be immune to ordering (though it's likely this is true); I just found an input permutation that gave a consistent answer between 0.20.203 and 0.23.1.  In the other cases I think I made the tests a little better, but in this case I made the test pass.

Is there someone you'd nominate as an additional reviewer?


- tom


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4237/#review5754
-----------------------------------------------------------


On 2012-03-08 01:24:21, tom pierce wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4237/
> -----------------------------------------------------------
> 
> (Updated 2012-03-08 01:24:21)
> 
> 
> Review request for mahout and Dmitriy Lyubimov.
> 
> 
> Summary
> -------
> 
> This is the current patch for MAHOUT-822 (as posted by Bilung Lee).
> 
> 
> This addresses bug MAHOUT-822.
>     https://issues.apache.org/jira/browse/MAHOUT-822
> 
> 
> Diffs
> -----
> 
>   trunk/core/pom.xml 1296318 
>   trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/MockContext.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialSequentialBuilder.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1MapperTest.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyCounter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyRecordWriter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyStatusReporter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1296318 
>   trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java 1296318 
>   trunk/pom.xml 1296318 
> 
> Diff: https://reviews.apache.org/r/4237/diff
> 
> 
> Testing
> -------
> 
> Passes unit tests under default config as well as under hadoop 0.23.1.
> 
> 
> Thanks,
> 
> tom
> 
>


Re: Review Request: MAHOUT-822: Mahout needs to be made compatible with Hadoop .23 releases

Posted by Dmitriy Lyubimov <dl...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4237/#review5754
-----------------------------------------------------------



trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java
<https://reviews.apache.org/r/4237/#comment12516>

    Extra white spaces generate style warnings but are notoriously hard to eliminate. I have no problems with them but it was a subject of review notes from other committers before, I am just reproducing collective mind here.



trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java
<https://reviews.apache.org/r/4237/#comment12518>

    is this really an issue with Hadoop 0.20.203? i don't think we care about classic hadoop 0.20.2 anymore.



trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java
<https://reviews.apache.org/r/4237/#comment12552>

    This and some other changes look suspiciously like a functional change rather than a hadoop compatibility change. I think it would help somebody else more familiar with this code to assure it is indeed benign..


- Dmitriy


On 2012-03-08 01:24:21, tom pierce wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4237/
> -----------------------------------------------------------
> 
> (Updated 2012-03-08 01:24:21)
> 
> 
> Review request for mahout and Dmitriy Lyubimov.
> 
> 
> Summary
> -------
> 
> This is the current patch for MAHOUT-822 (as posted by Bilung Lee).
> 
> 
> This addresses bug MAHOUT-822.
>     https://issues.apache.org/jira/browse/MAHOUT-822
> 
> 
> Diffs
> -----
> 
>   trunk/core/pom.xml 1296318 
>   trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/MockContext.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialSequentialBuilder.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1MapperTest.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyCounter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyRecordWriter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/common/DummyStatusReporter.java 1296318 
>   trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1296318 
>   trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java 1296318 
>   trunk/pom.xml 1296318 
> 
> Diff: https://reviews.apache.org/r/4237/diff
> 
> 
> Testing
> -------
> 
> Passes unit tests under default config as well as under hadoop 0.23.1.
> 
> 
> Thanks,
> 
> tom
> 
>