You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Chris Wailes <ch...@gmail.com> on 2010/08/13 00:08:07 UTC

MeanShift Clustering Patch

I started using Mahout a couple of weeks ago, and I'm pleased to say it's
working out great.  However, to get to this point I have had to dig around
the source code quite a bit.  While doing so I decided to add some
additional documentation based on what I found out, re-ordered some of the
functions so they could be found easier (static and instance separated out),
and re-formatted some of the code for readability (added spaces between code
blocks, broke long lines up into multiple lines, and added 'this' to
variables and function calls to make context explicit).

Lastly, I made an API change so that the MeanShiftClusterer behaved in a
more OO fashion.  Now, instead of having a static method
MeanShiftClusterer.clusterPoints() that then creates a MeanShiftClusterer
object, there is an instance method called cluster().  It uses the same
code, but makes re-use a lot easier if you want to cluster several groups of
points using the same parameters.

Going forward I would like to contribute more to the project, specifically
to the clustering code, as that is what I'm using for my own project.  I'd
be willing to add more documentation to the other clustering algorithms, as
well as work on the distinction between the interface, abstract class, and
concrete classes that represent the clusters (Cluster, ClusterBase,
AbstractCluster, MeanShiftCanopy, et cetera).  Also, I would like to look
into making an interface and abstract class for all of the different
clustering algorithms.

- Chris Wailes

P.S. This (https://cwiki.apache.org/confluence/display/MAHOUT/Issue+Tracker)
said that people new to the project should discuss issues on the mailing
list before opening issues, so...

Re: MeanShift Clustering Patch

Posted by Chris Wailes <ch...@gmail.com>.

>
> Also, speaking as a good hypocrite
> should (see MAHOUT-228), it would be good to have multiple small patches
> rather than a large omnibus patch since we can dispose of the small ones in
> less aggregate time than the large one.
>
> In the future I will try to keep patches small and concise :-)
Unfortunately all the work has been done and I didn't have the foresight to
create smaller patches along the way.

Some of the things that you suggest aren't what I tend to do, but in style
> issues, I value the experience of the naive reader over my own prejudices.
>  If you were confused when you read the code, I think we should improve the
> readability.  If it is something that you do "just because", then we should
> probably both defer to the standard Mahout style rather than changing it.
>
> As for the formatting work I have done I tend to think I have reasons for
it all, but in the end they just boil down to personal value-judgments.  I
find having the functions arranged as I have done makes it easier to find
the function I'm looking for when I'm using simpler text editors.  Also, I
tend to put whitespace lines between logically seperated blocks of code to
make that distinction clearer on first glance.  That way, when reading the
code (especially for the first time) you don't have to read every line to
see where one step begins and another ends.

One thing that I particularly want to have at the end of that exercise is to
> have clusters and classification models be unified.  It should not matter
> (much) where a model came from, you should be able to classify new examples
> using it.  You should also be able to save and restore the model in a
> pretty
> uniform way.
>
> This also implies that we need a consistent way to represent examples to be
> classified.
>
> In regards to the unified models for clustering and classification, I agree
that that would be nice.  However, moving in that direction all at once
seems like it would be hard.  With such big differences between the
individual clustering algorithm's models it would be hard to create a brand
new class/interface in one fell swoop that would have all of the features
that they all needed.  What I had thought of doing was to pull each of the
clusterers into this new model one at a time.  As much functionality as
possible would be moved into the abstract BaseCluster (or whatever), and a
common interface could be constructed.  After that, a similar process could
be done for the classifiers and then the overlap between the two models
could be moved into a new, unifying class.

What thoughts do you have on larger scale design issues?  What would you
> like to see?  Can you share some user stories about how you would like to
> use clustering?
>
> What I would really like to see, as a user, for the clustering API is this
clustering abstraction.  I would really like to be able to change my
clusterer simply by changing the class that is created, and not having to
re-factor all of my code from using MeanShiftCanopys to DirichletClusters.
I have only been using Mahout for a week or so on my local machine and
haven't reached a point in the project where I need to scale it up using
Hadoop yet, so I can't really comment on those issues.  This will be
something I'll be doing in the near future though.

Another change that I made and forgot to mention is the setT1() and setT2()
methods.  In my application I'll be mostly using the same clusterer over and
over again, but occasionally I'd like to let to user change these values to
get more or fewer clusters.  This gets back to making the clustering
algorithms behave more like objects instead of library functions.

- Chris Wailes

On Thu, Aug 12, 2010 at 4:34 PM, Ted Dunning <te...@gmail.com> wrote:

> This is a great thing in general, and we were just discussing how the
> clustering and classification API's need to be made more coherent.
>
> One thing that I particularly want to have at the end of that exercise is
> to
> have clusters and classification models be unified.  It should not matter
> (much) where a model came from, you should be able to classify new examples
> using it.  You should also be able to save and restore the model in a
> pretty
> uniform way.
>
> This also implies that we need a consistent way to represent examples to be
> classified.
>
> What you are talking about so far is to make the construction of clustering
> models more consistent which is really, really good and important, but it
> needs to be in the large context of making clustering and classification
> coherent as well.
>
> What thoughts do you have on larger scale design issues?  What would you
> like to see?  Can you share some user stories about how you would like to
> use clustering?
>
> On Thu, Aug 12, 2010 at 3:08 PM, Chris Wailes <chris.wailes@gmail.com
> >wrote:
>
> > Lastly, I made an API change so that the MeanShiftClusterer behaved in a
> > more OO fashion.  Now, instead of having a static method
> > MeanShiftClusterer.clusterPoints() that then creates a MeanShiftClusterer
> > object, there is an instance method called cluster().  It uses the same
> > code, but makes re-use a lot easier if you want to cluster several groups
> of
> > points using the same parameters.
> >
>

Re: MeanShift Clustering Patch

Posted by Ted Dunning <te...@gmail.com>.

This is a great thing in general, and we were just discussing how the
clustering and classification API's need to be made more coherent.

One thing that I particularly want to have at the end of that exercise is to
have clusters and classification models be unified.  It should not matter
(much) where a model came from, you should be able to classify new examples
using it.  You should also be able to save and restore the model in a pretty
uniform way.

This also implies that we need a consistent way to represent examples to be
classified.

What you are talking about so far is to make the construction of clustering
models more consistent which is really, really good and important, but it
needs to be in the large context of making clustering and classification
coherent as well.

What thoughts do you have on larger scale design issues?  What would you
like to see?  Can you share some user stories about how you would like to
use clustering?

On Thu, Aug 12, 2010 at 3:08 PM, Chris Wailes <ch...@gmail.com>wrote:

> Lastly, I made an API change so that the MeanShiftClusterer behaved in a
> more OO fashion.  Now, instead of having a static method
> MeanShiftClusterer.clusterPoints() that then creates a MeanShiftClusterer
> object, there is an instance method called cluster().  It uses the same
> code, but makes re-use a lot easier if you want to cluster several groups of
> points using the same parameters.
>

Re: MeanShift Clustering Patch

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  +1 on the documentation but the patch is too big for me to assess just 
by reading it. It would be nice to separate the comments and reordering 
of methods from more substantive refactorings and I'm all in favor of 
making the APIs more uniform. We've been making headway in that area 
recently. As an Eclipse user I'm insulated a bit from the textual 
ordering of methods but I also like keeping statics, private and public 
methods grouped in logical ways. Some of the current state is a 
consequence of historical evolution and can be made more user friendly. 
I will dig deeper into the patch this week.

I also suggest opening a JIRA (if you haven't already and I missed it) 
so we can channel the discussion.

Jeff

On 8/12/10 3:30 PM, Ted Dunning wrote:
> You are an ACE.
>
> The documentation is something that we unqualifiedly need.
>
> The code rearrangement sounds good in the abstract, but I would want to talk
> about the details a bit.  Opening a JIRA and posting a patch is a great way
> to stimulate that discussion.  Also, speaking as a good hypocrite
> should (see MAHOUT-228), it would be good to have multiple small patches
> rather than a large omnibus patch since we can dispose of the small ones in
> less aggregate time than the large one.
>
> Some of the things that you suggest aren't what I tend to do, but in style
> issues, I value the experience of the naive reader over my own prejudices.
>   If you were confused when you read the code, I think we should improve the
> readability.  If it is something that you do "just because", then we should
> probably both defer to the standard Mahout style rather than changing it.
>
> On Thu, Aug 12, 2010 at 3:08 PM, Chris Wailes<ch...@gmail.com>wrote:
>
>> While doing so I decided to add some additional documentation based on what
>> I found out, re-ordered some of the functions so they could be found easier
>> (static and instance separated out), and re-formatted some of the code for
>> readability (added spaces between code blocks, broke long lines up into
>> multiple lines, and added 'this' to variables and function calls to make
>> context explicit).

Re: MeanShift Clustering Patch

Posted by Ted Dunning <te...@gmail.com>.

You are an ACE.

The documentation is something that we unqualifiedly need.

The code rearrangement sounds good in the abstract, but I would want to talk
about the details a bit.  Opening a JIRA and posting a patch is a great way
to stimulate that discussion.  Also, speaking as a good hypocrite
should (see MAHOUT-228), it would be good to have multiple small patches
rather than a large omnibus patch since we can dispose of the small ones in
less aggregate time than the large one.

Some of the things that you suggest aren't what I tend to do, but in style
issues, I value the experience of the naive reader over my own prejudices.
 If you were confused when you read the code, I think we should improve the
readability.  If it is something that you do "just because", then we should
probably both defer to the standard Mahout style rather than changing it.

On Thu, Aug 12, 2010 at 3:08 PM, Chris Wailes <ch...@gmail.com>wrote:

> While doing so I decided to add some additional documentation based on what
> I found out, re-ordered some of the functions so they could be found easier
> (static and instance separated out), and re-formatted some of the code for
> readability (added spaces between code blocks, broke long lines up into
> multiple lines, and added 'this' to variables and function calls to make
> context explicit).