You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2008/03/12 02:00:48 UTC

[jira] Created: (MAHOUT-15) Investigate Mean Shift Clustering

Investigate Mean Shift Clustering
---------------------------------

                 Key: MAHOUT-15
                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
             Project: Mahout
          Issue Type: New Feature
          Components: Clustering
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman


"The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 

Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by Jeff Eastman <je...@windwardsolutions.com>.

Hi Matt,

No problem, the more I study the papers the less convinced I am that this is
an instance of mean shift. It is something related but at this point I don't
know how to categorize it. 

I hope your job hunting is going well,
Jeff

> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Thursday, March 13, 2008 12:24 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: [jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering
> 
> Hey Jeff-
> 
> Sorry I've been so quiet on this stuff lately -- I've been traveling for
> job
> interviews. I haven't had a chance to look through your code thoroughly,
> but
> I'm hoping to get to it by the weekend. Will get back to you soon...
> 
> Matt
> 
> On Wed, Mar 12, 2008 at 3:10 PM, Jeff Eastman (JIRA) <ji...@apache.org>
> wrote:
> 
> >
> >     [
> > https://issues.apache.org/jira/browse/MAHOUT-
> 15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > Jeff Eastman updated MAHOUT-15:
> > -------------------------------
> >
> >     Attachment: MAHOUT-15b.patch
> >
> > I noticed that the mergeCanopy method was only adding the new canopy
> > center to the existing canopies, but not adding the existing canopy
> centers
> > to itself. This produced an assymetry defect that, I think, also exists
> in
> > canopy clustering and kmeans. Correcting this problem leads the
> algorithm to
> > now converge exactly upon the input image after 40-odd iterations:
> >
> > ABBBBBBBBB
> > BABBBBBBBB
> > BBABBBBBBB
> > BBBABBBBBB
> > BBBBABBBBB
> > BBBBBABBBB
> > BBBBBBABBB
> > BBBBBBBABB
> > BBBBBBBBAB
> > BBBBBBBBBA
> >
> > > Investigate Mean Shift Clustering
> > > ---------------------------------
> > >
> > >                 Key: MAHOUT-15
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
> > >             Project: Mahout
> > >          Issue Type: New Feature
> > >          Components: Clustering
> > >            Reporter: Jeff Eastman
> > >            Assignee: Jeff Eastman
> > >         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch
> > >
> > >
> > > "The mean shift algorithm is a nonparametric clustering technique
> which
> > does not require prior knowledge of the number of clusters, and does not
> > constrain the shape of the clusters."
> >
> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.p
> df
> > > Investigate implementing mean shift clustering using Hadoop
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >

Re: [jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by Matthew Riley <mr...@gmail.com>.

Hey Jeff-

Sorry I've been so quiet on this stuff lately -- I've been traveling for job
interviews. I haven't had a chance to look through your code thoroughly, but
I'm hoping to get to it by the weekend. Will get back to you soon...

Matt

On Wed, Mar 12, 2008 at 3:10 PM, Jeff Eastman (JIRA) <ji...@apache.org>
wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Jeff Eastman updated MAHOUT-15:
> -------------------------------
>
>     Attachment: MAHOUT-15b.patch
>
> I noticed that the mergeCanopy method was only adding the new canopy
> center to the existing canopies, but not adding the existing canopy centers
> to itself. This produced an assymetry defect that, I think, also exists in
> canopy clustering and kmeans. Correcting this problem leads the algorithm to
> now converge exactly upon the input image after 40-odd iterations:
>
> ABBBBBBBBB
> BABBBBBBBB
> BBABBBBBBB
> BBBABBBBBB
> BBBBABBBBB
> BBBBBABBBB
> BBBBBBABBB
> BBBBBBBABB
> BBBBBBBBAB
> BBBBBBBBBA
>
> > Investigate Mean Shift Clustering
> > ---------------------------------
> >
> >                 Key: MAHOUT-15
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Clustering
> >            Reporter: Jeff Eastman
> >            Assignee: Jeff Eastman
> >         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch
> >
> >
> > "The mean shift algorithm is a nonparametric clustering technique which
> does not require prior knowledge of the number of clusters, and does not
> constrain the shape of the clusters."
> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> > Investigate implementing mean shift clustering using Hadoop
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-15:
-------------------------------

    Attachment: MAHOUT-15a.patch

I've implemented a minimal, non-MR version of the algorithm below to see how it would behave. The operant code is in TestMeanShift.testMeanShift() and MeanShiftCanopy.mergeCanopy(). The rest of the MR classes are stuff I copied from Canopy so you can ignore them.

The TestMeanShift.setUp() method builds a 100x3 point matrix that represents a 10x10 image with the diagonal intensified (i.e. a '\' character mask). Then testMeanShift()creates an initial set of 100 canopies from it and iterates over the canopies merging their centroids into a new set of canopies until the canopy centers do not change any more (< delta). Finally it prints out the canopies that were found for each cell in the original image.

Every time two canopies come within T2 distance of each other they merge, reducing the number of canopies. The original points that were bound to each canopy are also merged so that, at the end of the iteration, the original
points are available in their respective canopies.

Depending upon the values chosen for T1 and T2, the process either converges quickly or slowly. The loop seems to erminate consistently and it does seem to cluster the input coherently.

I hesitate to call this MeanShift but it is something similar that follows the same general algorithm, as I  understand it at least. I hope you find it interesting.


> -----Original Message-----
> From: Jeff Eastman [mailto:jeff@windwardsolutions.com]
> Sent: Monday, March 10, 2008 9:09 PM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Google Summer of Code[esp. More Clustering]
> 
> Hi Matthew,
> 
> I'd like to pursue that canopy thought a little further and mix it in with
> your sub sampling idea. Optimizing can come later, once we figure out how
> to
> do mean-shift in M/R at all. How about this?
> 
> 1. Each mean-shift iteration consists of a canopy clustering of all the
> points, with T1 set to the desired sampling resolution (h?) and T2 set to
> 1.
> This will create one canopy centered on each point in the input set which
> contains all of its neighbors that are close enough to influence its next
> position in its trajectory (the window?).
> 
> 2. We then calculate the centroid of each canopy (that's actually done
> already by the canopy cluster reducer). Is this centroid not also the
> weighted average you desire for the next location of the point at its
> center?
> 
> 3. As the computation proceeds, the canopies will collapse together as
> their
> various centroids move inside the T2=1 radius. At the point when all
> points
> have converged, the remaining canopies will be the mean-shift clusters
> (modes?) of the dataset and their contents will be the migrated points in
> each cluster.
> 
> 4. If each original point is duplicated as its own payload, then the
> iterations will produce clusters of migrated points whose payloads are the
> final contents of each cluster.
> 
> Can you wrap your mind around this enough to validate my assumptions?
> 
> Jeff
> 
> > -----Original Message-----
> > From: Matthew Riley [mailto:mriley@gmail.com]
> > Sent: Monday, March 10, 2008 5:58 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: Google Summer of Code[esp. More Clustering]
> >
> > Hi Jeff-
> >
> > I think your "basin of attraction" understanding is right on. I also
> like
> > your ideas for distributing the mean-shift iterations by following a
> > canopy-style method. My intuition was a little different, and I would
> like
> > to hear your ideas on it:
> >
> > Just to make sure we're on the same page....
> > Say we have 1 million point in our original dataset, and we want to
> > cluster
> > by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> > points from the original dataset and follow the gradient of those points
> > to
> > the region of highest density (and as we saw from the paper, rather than
> > calculate the gradient itself we can equivalently compute the weighted
> > average of our subsampled points and move the centroid to that point).
> > This
> > part seems fairly straightforward to distribute - we just send a
> different
> > subsampled set to each processor and each processor returns the final
> > centroid for that set.
> >
> > The problem I see is that 10,000 points (or whatever value we choose),
> may
> > be too much for a single processor if we have to compute the distance to
> > every single point when we compute the weighted mean. My thought here
> was
> > to
> > exploit the fact that we're using a kernel function (gaussian, uniform,
> > etc.) in the weighted mean calculation and that kernel will have a set
> > radius. Because the radius is static, it may be easy to (quickly)
> identify
> > the points that we must consider in the calculation (i.e. those within
> the
> > radius) by using a locality sensitive hashing scheme, tuned to that
> > particular radius. Of course, the degree of advantage we get from this
> > method will depend on the data itself, but intuitively I think we will
> > usually see a dramatic improvement.
> >
> > Honestly, I should do more background work developing this idea, and
> > possibly try a matlab implementation to test the feasibility. This
> sounds
> > more like a research paper than something we should dive into
> immediately,
> > but I wanted to share the idea and get some feedback if anyone has
> > thoughts...
> >
> > Matt
> >
> >
> > On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net>
> > wrote:
> >
> > > Hi Matthew,
> > >
> > > I've been looking over the mean-shift papers for the last several
> days.
> > > While the details of the math are still sinking in, it looks like the
> > > basic algorithm might be summarized thusly:
> > >
> > > Points in an n-d feature space are migrated iteratively in the
> direction
> > > of maxima in their local density functions. Points within a "basin of
> > > attraction" all converge to the same maxima and thus belong to the
> same
> > > cluster.
> > >
> > > A physical analogy might be(?):
> > >
> > > Gas particles in 3-space, operating with gravitational attraction but
> > > without momentum, would tend to cluster similarly.
> > >
> > > The algorithm seems to require that each point be compared with every
> > > other point. This might be taken to require each mapper to see all of
> > > the points, thus frustrating scalability. OTOH, Canopy clustering
> avoids
> > > this by clustering the clusters produced by the subsets of points seen
> > > by each mapper. K-means has the requirement that each point needs to
> be
> > > compared with all of the cluster centers, not points. It has a similar
> > > iterative structure over clusters (a much smaller, constant number)
> that
> > > might be employed.
> > >
> > > There is a lot of locality in the local density function window, and
> > > this could perhaps be exploited. If points could be pre-clustered (as
> > > canopy is often used to prime the k-means iterations), parallelization
> > > might be feasible.
> > >
> > > Are these observations within a "basin of attraction" to your
> > > understanding of mean-shift <grin>?
> > >
> > > Jeff
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > Sent: Thursday, March 06, 2008 11:46 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: Google Summer of Code[esp. More Clustering]
> > >
> > > Hey Jeff-
> > >
> > > I'm certainly willing to put some energy into developing
> implementations
> > > of
> > > these algorithms, and it's good to hear that you may be interested in
> > > guiding us in the right direction.
> > >
> > > Here are the references I learned the algorithms from- some are more
> > > detailed than others:
> > >
> > > Mean-Shift clustering was introduced here and this paper is a thorough
> > > reference:
> > > Mean-Shift: A Robust Approach to Feature Space Analysis
> > > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> > >
> > > And here's a PDF with just guts of the algorithm outlined:
> > > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> > >
> > > It looks like there isn't a definitive reference for the k-means
> > > approximation with randomized k-d trees, but there are promising
> results
> > > introduced here:
> > >
> > > Object retrieval with large vocabularies and fast spatial matching:
> > >
> >
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://
> > www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > > *
> > > And a deeper explanation of the technique here:
> > >
> > > Randomized KD-Trees for Real-Time Keypoint Detection:
> > > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> > >
> > > Let me know what you think.
> > >
> > > Matt
> > >
> > > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> > > wrote:
> > >
> > > > Hi Matthew,
> > > >
> > > > As with most open source projects, "interest" is mainly a function
> of
> > > > the willingness of somebody to contribute their energy. Clustering
> is
> > > > certainly within the scope of the project. I'd be interested in
> > > > exploring additional clustering algorithms with you and your
> > > colleague.
> > > > I'm a complete noob in this area and it is always enlightening to
> work
> > > > with students who have more current theoretical exposures.
> > > >
> > > > Do you have some links on these approaches that you find
> particularly
> > > > helpful?
> > > >
> > > > Jeff
> > > >
> > > > -----Original Message-----
> > > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > > Sent: Wednesday, March 05, 2008 11:11 PM
> > > > To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> > > > Subject: Re: Google Summer of Code
> > > >
> > > > Hey everyone-
> > > >
> > > > I've been watching the mailing list for a little while now, hoping
> to
> > > > contribute once I became more familiar, but I wanted to jump in here
> > > now
> > > > and
> > > > express my interest in the Summer of Code project. I'm currently a
> > > > graduate
> > > > student in electrical engineering at UT-Austin working in computer
> > > > vision,
> > > > which is closely tied to many of the problems Mahout is addressing
> > > > (especially in my area of content-based retrieval).
> > > >
> > > > What can I do to help out?
> > > >
> > > > I've discussed some potential Mahout projects with another student
> > > > recently-
> > > > mostly focused around approximate k-means algorithms (since that's a
> > > > problem
> > > > I've been working on lately). It sounds like you guys are already
> > > > implementing canopy clustering for k-means- Is there any interest in
> > > > developing another approximation algorithm based on randomized
> > > kd-trees
> > > > for
> > > > high dimensional data? What about mean-shift clustering?
> > > >
> > > > Again, I would be glad to help in any way I can.
> > > >
> > > > Matt
> > > >
> > > > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <mainec@isabel-
> drost.de>
> > > > wrote:
> > > >
> > > > > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > > > > Also, any thoughts on what we might want someone to do?  I think
> > > it
> > > > > > would be great to have someone implement one of the algorithms
> on
> > > > our
> > > > > > wiki.
> > > > >
> > > > > Just as a general note, the deadline for applications:
> > > > >
> > > > > March 12: Mentoring organization application deadline (12 noon
> > > > PDT/19:00
> > > > > UTC).
> > > > >
> > > > > I suppose we should identify interesing tasks until that deadline.
> > > As
> > > > a
> > > > > general guideline for mentors and for project proposals:
> > > > >
> > > > > http://code.google.com/p/google-summer-of-
> code/wiki/AdviceforMentors
> > > > >
> > > > > Isabel
> > > > >
> > > > > --
> > > > > Better late than never.         -- Titus Livius (Livy)
> > > > >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> > > > >  /,`.-'`'    -.  ;-;;,_
> > > > >  |,4-  ) )-,_..;\ (  `'-'
> > > > > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> > > > >
> > > >
> > >
> 



> Investigate Mean Shift Clustering
> ---------------------------------
>
>                 Key: MAHOUT-15
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-15a.patch
>
>
> "The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 
> Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-15:
------------------------------

    Fix Version/s: 0.1

> Investigate Mean Shift Clustering
> ---------------------------------
>
>                 Key: MAHOUT-15
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.1
>
>         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch, MAHOUT-15c.patch, MAHOUT-15d.patch, MAHOUT-15e.patch
>
>
> "The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 
> Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman resolved MAHOUT-15.
--------------------------------

    Resolution: Fixed

r648085 committed the latest patch, which additionally adds a method to the DistanceMeasures.

> Investigate Mean Shift Clustering
> ---------------------------------
>
>                 Key: MAHOUT-15
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch, MAHOUT-15c.patch, MAHOUT-15d.patch, MAHOUT-15e.patch
>
>
> "The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 
> Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-15:
-------------------------------

    Attachment: MAHOUT-15d.patch

This patch implements the mapper, combiner and reducer classes and two new unit tests. The driver class needs completion and the code needs some cleanup. Currently, I call it "Mean Shift Canopy Clustering". After looking at the Cluster Analysis section of the statsoft.com reference that Isabel posted, it has the following properties which cause me to believe it is aptly named:

- It is an iterative, amalgamating canopy clustering algorithm
- It allows the use of arbitrary distance measures, though Euclidean is used in the tests
- It uses a weighted pair-group centroid (mean) to determine the next, shifted position of each canopy
- Input is canopies containing a single input data point
- During each iteration, every canopy within T1 affects the centroid calculation of its neighbors
- During each iteration, canopies within T2 are merged and the contained points are accumulated
- At the end of each iteration, each remaining canopy is shifted to its new mean position
- If the shift is less than a configurable delta parameter, the canopy has converged
- Once all canopies have converged, the computation terminates

For very large numbers of input points, multiple mapper/combiner pairs will each iterate once over their canopies before passing all resulting canopies to a single reducer, which does another iteration to combine them. Multiple map/combine/reduce passes will be orchestrated by the driver, until all canopies have converged. At that point, each canopy contains a set of original input points that belong to that cluster.

> Investigate Mean Shift Clustering
> ---------------------------------
>
>                 Key: MAHOUT-15
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch, MAHOUT-15c.patch, MAHOUT-15d.patch
>
>
> "The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 
> Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-15:
-------------------------------

    Attachment: MAHOUT-15b.patch

I noticed that the mergeCanopy method was only adding the new canopy center to the existing canopies, but not adding the existing canopy centers to itself. This produced an assymetry defect that, I think, also exists in canopy clustering and kmeans. Correcting this problem leads the algorithm to now converge exactly upon the input image after 40-odd iterations:

ABBBBBBBBB
BABBBBBBBB
BBABBBBBBB
BBBABBBBBB
BBBBABBBBB
BBBBBABBBB
BBBBBBABBB
BBBBBBBABB
BBBBBBBBAB
BBBBBBBBBA

> Investigate Mean Shift Clustering
> ---------------------------------
>
>                 Key: MAHOUT-15
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch
>
>
> "The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 
> Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Friday 14 March 2008, Ted Dunning wrote:
> Clustering implementations are notoriously hard to debug, if only because
> they are relatively robust so that broken implementations will often
> produce plausible results.

I can only second that experience. I think one way of making sanity tests 
could be to additionally implement the algorithm with a tool that is closer 
to the mathematical formulas than say Java and compare the results of each 
step.

Isabel

-- 
BOFH excuse #245:The Borg tried to assimilate your system. Resistance is 
futile.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: [jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by Ted Dunning <td...@veoh.com>.

Clustering implementations are notoriously hard to debug, if only because
they are relatively robust so that broken implementations will often produce
plausible results.


On 3/14/08 12:58 PM, "Jeff Eastman (JIRA)" <ji...@apache.org> wrote:

> 
>      [ 
> https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin
> .system.issuetabpanels:all-tabpanel ]
> 
> Jeff Eastman updated MAHOUT-15:
> -------------------------------
> 
>     Attachment: MAHOUT-15c.patch
> 
> Found another defect in the iteration loop. The order of the done test (done =
> done && migrate(0.5)) was omitting the canopy migrations once the first one
> reported not done. I reversed the elements and now the algorithm converges in
> 4 iterations vs 44. I also tweaked the actual migration routine to merge with
> only the closest canopy vs. the first one encountered. Finally, I added
> another set of values ('/') to the initial image data set and the algorithm
> clustered it correctly too:
> 
> ABBBBBBBBC
> BABBBBBBCB
> BBABBBBCBB
> BBBABBCBBB
> BBBBACBBBB
> BBBBCABBBB
> BBBCBBABBB
> BBCBBBBABB
> BCBBBBBBAB
> CBBBBBBBBA
> 
> Note: The values I added had a z=4 value and were clustered separately (C).
> When I changed their z value to 9, there were only two remaining canopies (A,
> B):
> 
> ABBBBBBBBA
> BABBBBBBAB
> BBABBBBABB
> BBBABBABBB
> BBBBAABBBB
> BBBBAABBBB
> BBBABBABBB
> BBABBBBABB
> BABBBBBBAB
> ABBBBBBBBA
> 
> I still do not know what to call this algorithm, perhaps 'colliding canopies'
> or 'coalescing canopies'? Though it has some similarity to mean shift I'd be
> surprised if the term applies.
> 
>> Investigate Mean Shift Clustering
>> ---------------------------------
>> 
>>                 Key: MAHOUT-15
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Clustering
>>            Reporter: Jeff Eastman
>>            Assignee: Jeff Eastman
>>         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch, MAHOUT-15c.patch
>> 
>> 
>> "The mean shift algorithm is a nonparametric clustering technique which does
>> not require prior knowledge of the number of clusters, and does not constrain
>> the shape of the clusters."
>> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>> Investigate implementing mean shift clustering using Hadoop

[jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-15:
-------------------------------

    Attachment: MAHOUT-15c.patch

Found another defect in the iteration loop. The order of the done test (done = done && migrate(0.5)) was omitting the canopy migrations once the first one reported not done. I reversed the elements and now the algorithm converges in 4 iterations vs 44. I also tweaked the actual migration routine to merge with only the closest canopy vs. the first one encountered. Finally, I added another set of values ('/') to the initial image data set and the algorithm clustered it correctly too:

ABBBBBBBBC
BABBBBBBCB
BBABBBBCBB
BBBABBCBBB
BBBBACBBBB
BBBBCABBBB
BBBCBBABBB
BBCBBBBABB
BCBBBBBBAB
CBBBBBBBBA

Note: The values I added had a z=4 value and were clustered separately (C). When I changed their z value to 9, there were only two remaining canopies (A, B):

ABBBBBBBBA
BABBBBBBAB
BBABBBBABB
BBBABBABBB
BBBBAABBBB
BBBBAABBBB
BBBABBABBB
BBABBBBABB
BABBBBBBAB
ABBBBBBBBA

I still do not know what to call this algorithm, perhaps 'colliding canopies' or 'coalescing canopies'? Though it has some similarity to mean shift I'd be surprised if the term applies.

> Investigate Mean Shift Clustering
> ---------------------------------
>
>                 Key: MAHOUT-15
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch, MAHOUT-15c.patch
>
>
> "The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 
> Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-15) Investigate Mean Shift Clustering

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-15:
-------------------------------

    Attachment: MAHOUT-15e.patch

This patch has improved javadoc comments and removes some debugging code. It includes an extension to DistanceMeasure to introduce measures over Vector instances and deprecates the Float[] method. Finally, it includes a toString() implementation in DenseVector for use in outputting formatted data from the MeanShiftCanopy implementation. Longer term, I think the Vector and Matrix interfaces need an asFormatString() method in addition to the asWritableComparable().

All unit tests run. Once my credentials arrive I plan to commit this. Here's a link to the original paper (http://www.cse.msu.edu/~cse902/S03/MeanShif.pdf), I'd love go get some review to see if anybody else thinks this is an implementation of that algorithm.

> Investigate Mean Shift Clustering
> ---------------------------------
>
>                 Key: MAHOUT-15
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-15
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-15a.patch, MAHOUT-15b.patch, MAHOUT-15c.patch, MAHOUT-15d.patch, MAHOUT-15e.patch
>
>
> "The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters." http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 
> Investigate implementing mean shift clustering using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.