You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Matthew Riley <mr...@gmail.com> on 2008/03/06 20:45:31 UTC

Re: Google Summer of Code[esp. More Clustering]

Hey Jeff-

I'm certainly willing to put some energy into developing implementations of
these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising results
introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net> wrote:

> Hi Matthew,
>
> As with most open source projects, "interest" is mainly a function of
> the willingness of somebody to contribute their energy. Clustering is
> certainly within the scope of the project. I'd be interested in
> exploring additional clustering algorithms with you and your colleague.
> I'm a complete noob in this area and it is always enlightening to work
> with students who have more current theoretical exposures.
>
> Do you have some links on these approaches that you find particularly
> helpful?
>
> Jeff
>
> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Wednesday, March 05, 2008 11:11 PM
> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> Subject: Re: Google Summer of Code
>
> Hey everyone-
>
> I've been watching the mailing list for a little while now, hoping to
> contribute once I became more familiar, but I wanted to jump in here now
> and
> express my interest in the Summer of Code project. I'm currently a
> graduate
> student in electrical engineering at UT-Austin working in computer
> vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).
>
> What can I do to help out?
>
> I've discussed some potential Mahout projects with another student
> recently-
> mostly focused around approximate k-means algorithms (since that's a
> problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized kd-trees
> for
> high dimensional data? What about mean-shift clustering?
>
> Again, I would be glad to help in any way I can.
>
> Matt
>
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> wrote:
>
> > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > Also, any thoughts on what we might want someone to do?  I think it
> > > would be great to have someone implement one of the algorithms on
> our
> > > wiki.
> >
> > Just as a general note, the deadline for applications:
> >
> > March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> > UTC).
> >
> > I suppose we should identify interesing tasks until that deadline. As
> a
> > general guideline for mentors and for project proposals:
> >
> > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >
> > Isabel
> >
> > --
> > Better late than never.         -- Titus Livius (Livy)
> >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'    -.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >
>

Re: Google Summer of Code[esp. More Clustering]

Posted by Matthew Riley <mr...@gmail.com>.

Hey Grant-

I believe scaling Mean-Shift clustering using M/R will be pretty
straightforward. I'm not as sure about K-Means using KD-Trees, since I
haven't personally implemented that algorithm, but since it follows K-Means
fairly closely I imagine it is possible.

I'll get to work on a proposal with some of my ideas, and hopefully get some
feedback from you guys during the process.

Thanks for all the responses so far.

Matt

On Thu, Mar 6, 2008 at 3:25 PM, Grant Ingersoll <gs...@apache.org> wrote:

> I haven't read the papers, but the big question is do you think they
> can scale using M/R or some other distributed techniques?
>
> If so, feel free to write up a bit of a proposal using the info at:
> http://wiki.apache.org/general/SummerOfCode2008
>   If you are unsure, that is fine too.  We could start with a simpler
> implementation, and then look to distribute it.
>
>
> On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote:
>
> > Hey Jeff-
> >
> > I'm certainly willing to put some energy into developing
> > implementations of
> > these algorithms, and it's good to hear that you may be interested in
> > guiding us in the right direction.
> >
> > Here are the references I learned the algorithms from- some are more
> > detailed than others:
> >
> > Mean-Shift clustering was introduced here and this paper is a thorough
> > reference:
> > Mean-Shift: A Robust Approach to Feature Space Analysis
> > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> >
> > And here's a PDF with just guts of the algorithm outlined:
> > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> >
> > It looks like there isn't a definitive reference for the k-means
> > approximation with randomized k-d trees, but there are promising
> > results
> > introduced here:
> >
> > Object retrieval with large vocabularies and fast spatial matching:
> > http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > *
> > And a deeper explanation of the technique here:
> >
> > Randomized KD-Trees for Real-Time Keypoint Detection:
> > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> >
> > Let me know what you think.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> > wrote:
> >
> >> Hi Matthew,
> >>
> >> As with most open source projects, "interest" is mainly a function of
> >> the willingness of somebody to contribute their energy. Clustering is
> >> certainly within the scope of the project. I'd be interested in
> >> exploring additional clustering algorithms with you and your
> >> colleague.
> >> I'm a complete noob in this area and it is always enlightening to
> >> work
> >> with students who have more current theoretical exposures.
> >>
> >> Do you have some links on these approaches that you find particularly
> >> helpful?
> >>
> >> Jeff
> >>
> >> -----Original Message-----
> >> From: Matthew Riley [mailto:mriley@gmail.com]
> >> Sent: Wednesday, March 05, 2008 11:11 PM
> >> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> >> Subject: Re: Google Summer of Code
> >>
> >> Hey everyone-
> >>
> >> I've been watching the mailing list for a little while now, hoping to
> >> contribute once I became more familiar, but I wanted to jump in
> >> here now
> >> and
> >> express my interest in the Summer of Code project. I'm currently a
> >> graduate
> >> student in electrical engineering at UT-Austin working in computer
> >> vision,
> >> which is closely tied to many of the problems Mahout is addressing
> >> (especially in my area of content-based retrieval).
> >>
> >> What can I do to help out?
> >>
> >> I've discussed some potential Mahout projects with another student
> >> recently-
> >> mostly focused around approximate k-means algorithms (since that's a
> >> problem
> >> I've been working on lately). It sounds like you guys are already
> >> implementing canopy clustering for k-means- Is there any interest in
> >> developing another approximation algorithm based on randomized kd-
> >> trees
> >> for
> >> high dimensional data? What about mean-shift clustering?
> >>
> >> Again, I would be glad to help in any way I can.
> >>
> >> Matt
> >>
> >> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <mainec@isabel-
> >> drost.de>
> >> wrote:
> >>
> >>> On Saturday 01 March 2008, Grant Ingersoll wrote:
> >>>> Also, any thoughts on what we might want someone to do?  I think it
> >>>> would be great to have someone implement one of the algorithms on
> >> our
> >>>> wiki.
> >>>
> >>> Just as a general note, the deadline for applications:
> >>>
> >>> March 12: Mentoring organization application deadline (12 noon
> >> PDT/19:00
> >>> UTC).
> >>>
> >>> I suppose we should identify interesing tasks until that deadline.
> >>> As
> >> a
> >>> general guideline for mentors and for project proposals:
> >>>
> >>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >>>
> >>> Isabel
> >>>
> >>> --
> >>> Better late than never.         -- Titus Livius (Livy)
> >>>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >>> /,`.-'`'    -.  ;-;;,_
> >>> |,4-  ) )-,_..;\ (  `'-'
> >>> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>

Re: Google Summer of Code[esp. More Clustering]

Posted by Grant Ingersoll <gs...@apache.org>.

I haven't read the papers, but the big question is do you think they  
can scale using M/R or some other distributed techniques?

If so, feel free to write up a bit of a proposal using the info at: http://wiki.apache.org/general/SummerOfCode2008 
   If you are unsure, that is fine too.  We could start with a simpler  
implementation, and then look to distribute it.


On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote:

> Hey Jeff-
>
> I'm certainly willing to put some energy into developing  
> implementations of
> these algorithms, and it's good to hear that you may be interested in
> guiding us in the right direction.
>
> Here are the references I learned the algorithms from- some are more
> detailed than others:
>
> Mean-Shift clustering was introduced here and this paper is a thorough
> reference:
> Mean-Shift: A Robust Approach to Feature Space Analysis
> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>
> And here's a PDF with just guts of the algorithm outlined:
> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>
> It looks like there isn't a definitive reference for the k-means
> approximation with randomized k-d trees, but there are promising  
> results
> introduced here:
>
> Object retrieval with large vocabularies and fast spatial matching:
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
> *
> And a deeper explanation of the technique here:
>
> Randomized KD-Trees for Real-Time Keypoint Detection:
> ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
>
> Let me know what you think.
>
> Matt
>
> On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>  
> wrote:
>
>> Hi Matthew,
>>
>> As with most open source projects, "interest" is mainly a function of
>> the willingness of somebody to contribute their energy. Clustering is
>> certainly within the scope of the project. I'd be interested in
>> exploring additional clustering algorithms with you and your  
>> colleague.
>> I'm a complete noob in this area and it is always enlightening to  
>> work
>> with students who have more current theoretical exposures.
>>
>> Do you have some links on these approaches that you find particularly
>> helpful?
>>
>> Jeff
>>
>> -----Original Message-----
>> From: Matthew Riley [mailto:mriley@gmail.com]
>> Sent: Wednesday, March 05, 2008 11:11 PM
>> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
>> Subject: Re: Google Summer of Code
>>
>> Hey everyone-
>>
>> I've been watching the mailing list for a little while now, hoping to
>> contribute once I became more familiar, but I wanted to jump in  
>> here now
>> and
>> express my interest in the Summer of Code project. I'm currently a
>> graduate
>> student in electrical engineering at UT-Austin working in computer
>> vision,
>> which is closely tied to many of the problems Mahout is addressing
>> (especially in my area of content-based retrieval).
>>
>> What can I do to help out?
>>
>> I've discussed some potential Mahout projects with another student
>> recently-
>> mostly focused around approximate k-means algorithms (since that's a
>> problem
>> I've been working on lately). It sounds like you guys are already
>> implementing canopy clustering for k-means- Is there any interest in
>> developing another approximation algorithm based on randomized kd- 
>> trees
>> for
>> high dimensional data? What about mean-shift clustering?
>>
>> Again, I would be glad to help in any way I can.
>>
>> Matt
>>
>> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <mainec@isabel- 
>> drost.de>
>> wrote:
>>
>>> On Saturday 01 March 2008, Grant Ingersoll wrote:
>>>> Also, any thoughts on what we might want someone to do?  I think it
>>>> would be great to have someone implement one of the algorithms on
>> our
>>>> wiki.
>>>
>>> Just as a general note, the deadline for applications:
>>>
>>> March 12: Mentoring organization application deadline (12 noon
>> PDT/19:00
>>> UTC).
>>>
>>> I suppose we should identify interesing tasks until that deadline.  
>>> As
>> a
>>> general guideline for mentors and for project proposals:
>>>
>>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>>>
>>> Isabel
>>>
>>> --
>>> Better late than never.         -- Titus Livius (Livy)
>>>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>>> /,`.-'`'    -.  ;-;;,_
>>> |,4-  ) )-,_..;\ (  `'-'
>>> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Google Summer of Code[esp. More Clustering]

Posted by Ted Dunning <td...@veoh.com>.

Shooting somewhat in the dark here, but isn't this susceptible to the
mean-field techniques that are used for variational integration of Bayesian
probabilistic techniques (except that you are starting with a field
representation rather than introducing it)?  I can't say what the details
are, but this does seem plausible.

Also, wouldn't a sampling of the points give a good approximation of the
field?  If so, then building an n-centroid approximation of the field should
be possible for a sampled subset.  Then given that approximation, you can
perform the mean shift very much in parallel at which point you can repeat
the approximation and be ready to roll for the next shift.  Moreover, since
the fields are additive, you can compute the field on all of the sub-sample
due to a sub-sub-sample at each of the map nodes and then combine them in a
reduce step.  That would allow you to simplify wha tis likely the most
expensive step of the approximation.

The k-means pre-step approach is also likely to be be helpful, at least in
relatively low dimensional problems.  In high dimensional spaces, however,
everything outside a cluster is approximately equidistant so using the
k-means approximation for the field simply reduces to a very expensive form
of k-means clustering, or a variant on mean-shift with a useless k-means
add-on.


On 3/10/08 5:57 PM, "Matthew Riley" <mr...@gmail.com> wrote:

> Hi Jeff-
> 
> I think your "basin of attraction" understanding is right on. I also like
> your ideas for distributing the mean-shift iterations by following a
> canopy-style method. My intuition was a little different, and I would like
> to hear your ideas on it:
> 
> Just to make sure we're on the same page....
> Say we have 1 million point in our original dataset, and we want to cluster
> by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> points from the original dataset and follow the gradient of those points to
> the region of highest density (and as we saw from the paper, rather than
> calculate the gradient itself we can equivalently compute the weighted
> average of our subsampled points and move the centroid to that point). This
> part seems fairly straightforward to distribute - we just send a different
> subsampled set to each processor and each processor returns the final
> centroid for that set.
> 
> The problem I see is that 10,000 points (or whatever value we choose), may
> be too much for a single processor if we have to compute the distance to
> every single point when we compute the weighted mean. My thought here was to
> exploit the fact that we're using a kernel function (gaussian, uniform,
> etc.) in the weighted mean calculation and that kernel will have a set
> radius. Because the radius is static, it may be easy to (quickly) identify
> the points that we must consider in the calculation (i.e. those within the
> radius) by using a locality sensitive hashing scheme, tuned to that
> particular radius. Of course, the degree of advantage we get from this
> method will depend on the data itself, but intuitively I think we will
> usually see a dramatic improvement.
> 
> Honestly, I should do more background work developing this idea, and
> possibly try a matlab implementation to test the feasibility. This sounds
> more like a research paper than something we should dive into immediately,
> but I wanted to share the idea and get some feedback if anyone has
> thoughts...
> 
> Matt
> 
> 
> On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net> wrote:
> 
>> Hi Matthew,
>> 
>> I've been looking over the mean-shift papers for the last several days.
>> While the details of the math are still sinking in, it looks like the
>> basic algorithm might be summarized thusly:
>> 
>> Points in an n-d feature space are migrated iteratively in the direction
>> of maxima in their local density functions. Points within a "basin of
>> attraction" all converge to the same maxima and thus belong to the same
>> cluster.
>> 
>> A physical analogy might be(?):
>> 
>> Gas particles in 3-space, operating with gravitational attraction but
>> without momentum, would tend to cluster similarly.
>> 
>> The algorithm seems to require that each point be compared with every
>> other point. This might be taken to require each mapper to see all of
>> the points, thus frustrating scalability. OTOH, Canopy clustering avoids
>> this by clustering the clusters produced by the subsets of points seen
>> by each mapper. K-means has the requirement that each point needs to be
>> compared with all of the cluster centers, not points. It has a similar
>> iterative structure over clusters (a much smaller, constant number) that
>> might be employed.
>> 
>> There is a lot of locality in the local density function window, and
>> this could perhaps be exploited. If points could be pre-clustered (as
>> canopy is often used to prime the k-means iterations), parallelization
>> might be feasible.
>> 
>> Are these observations within a "basin of attraction" to your
>> understanding of mean-shift <grin>?
>> 
>> Jeff
>> 
>> 
>> 
>> -----Original Message-----
>> From: Matthew Riley [mailto:mriley@gmail.com]
>> Sent: Thursday, March 06, 2008 11:46 AM
>> To: mahout-dev@lucene.apache.org
>> Subject: Re: Google Summer of Code[esp. More Clustering]
>> 
>> Hey Jeff-
>> 
>> I'm certainly willing to put some energy into developing implementations
>> of
>> these algorithms, and it's good to hear that you may be interested in
>> guiding us in the right direction.
>> 
>> Here are the references I learned the algorithms from- some are more
>> detailed than others:
>> 
>> Mean-Shift clustering was introduced here and this paper is a thorough
>> reference:
>> Mean-Shift: A Robust Approach to Feature Space Analysis
>> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>> 
>> And here's a PDF with just guts of the algorithm outlined:
>> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>> 
>> It looks like there isn't a definitive reference for the k-means
>> approximation with randomized k-d trees, but there are promising results
>> introduced here:
>> 
>> Object retrieval with large vocabularies and fast spatial matching:
>> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www
>> .robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
>> *
>> And a deeper explanation of the technique here:
>> 
>> Randomized KD-Trees for Real-Time Keypoint Detection:
>> ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
>> 
>> Let me know what you think.
>> 
>> Matt
>> 
>> On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
>> wrote:
>> 
>>> Hi Matthew,
>>> 
>>> As with most open source projects, "interest" is mainly a function of
>>> the willingness of somebody to contribute their energy. Clustering is
>>> certainly within the scope of the project. I'd be interested in
>>> exploring additional clustering algorithms with you and your
>> colleague.
>>> I'm a complete noob in this area and it is always enlightening to work
>>> with students who have more current theoretical exposures.
>>> 
>>> Do you have some links on these approaches that you find particularly
>>> helpful?
>>> 
>>> Jeff
>>> 
>>> -----Original Message-----
>>> From: Matthew Riley [mailto:mriley@gmail.com]
>>> Sent: Wednesday, March 05, 2008 11:11 PM
>>> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
>>> Subject: Re: Google Summer of Code
>>> 
>>> Hey everyone-
>>> 
>>> I've been watching the mailing list for a little while now, hoping to
>>> contribute once I became more familiar, but I wanted to jump in here
>> now
>>> and
>>> express my interest in the Summer of Code project. I'm currently a
>>> graduate
>>> student in electrical engineering at UT-Austin working in computer
>>> vision,
>>> which is closely tied to many of the problems Mahout is addressing
>>> (especially in my area of content-based retrieval).
>>> 
>>> What can I do to help out?
>>> 
>>> I've discussed some potential Mahout projects with another student
>>> recently-
>>> mostly focused around approximate k-means algorithms (since that's a
>>> problem
>>> I've been working on lately). It sounds like you guys are already
>>> implementing canopy clustering for k-means- Is there any interest in
>>> developing another approximation algorithm based on randomized
>> kd-trees
>>> for
>>> high dimensional data? What about mean-shift clustering?
>>> 
>>> Again, I would be glad to help in any way I can.
>>> 
>>> Matt
>>> 
>>> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
>>> wrote:
>>> 
>>>> On Saturday 01 March 2008, Grant Ingersoll wrote:
>>>>> Also, any thoughts on what we might want someone to do?  I think
>> it
>>>>> would be great to have someone implement one of the algorithms on
>>> our
>>>>> wiki.
>>>> 
>>>> Just as a general note, the deadline for applications:
>>>> 
>>>> March 12: Mentoring organization application deadline (12 noon
>>> PDT/19:00
>>>> UTC).
>>>> 
>>>> I suppose we should identify interesing tasks until that deadline.
>> As
>>> a
>>>> general guideline for mentors and for project proposals:
>>>> 
>>>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>>>> 
>>>> Isabel
>>>> 
>>>> --
>>>> Better late than never.         -- Titus Livius (Livy)
>>>>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>>>>  /,`.-'`'    -.  ;-;;,_
>>>>  |,4-  ) )-,_..;\ (  `'-'
>>>> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>>>> 
>>> 
>>

RE: Google Summer of Code[esp. More Clustering]

Posted by Jeff Eastman <je...@windwardsolutions.com>.

Hi Matthew,

I've implemented a minimal, non-MR version of the algorithm below to see how
it would behave. The operant code is in TestMeanShift.testMeanShift() and
MeanShiftCanopy.mergeCanopy(). The rest of the MR classes are stuff I copied
from Canopy so you can ignore them.

The TestMeanShift.setUp() method builds a 100x3 point matrix that represents
a 10x10 image with the diagonal intensified (i.e. a '\' character mask).
Then testMeanShift()creates an initial set of 100 canopies from it and
iterates over the canopies merging their centroids into a new set of
canopies until the canopy list size does not change any more. Finally it
prints out the canopies that were found for each cell in the original image.

Every time two canopies come within T2 distance of each other they merge,
reducing the number of canopies. The original points that were bound to each
canopy are also merged so that, at the end of the iteration, the original
points are available in their respective canopies.

Depending upon the values chosen for T1 and T2, the process either converges
quickly or slowly. The loop terminates before actual convergence is
achieved, but it does seem to cluster the input coherently.

I hesitate to call this MeanShift but it is something similar that follows
the same general algorithm, as I understand it at least. I hope you find it
interesting.

Jeff


> -----Original Message-----
> From: Jeff Eastman [mailto:jeff@windwardsolutions.com]
> Sent: Monday, March 10, 2008 9:09 PM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Google Summer of Code[esp. More Clustering]
> 
> Hi Matthew,
> 
> I'd like to pursue that canopy thought a little further and mix it in with
> your sub sampling idea. Optimizing can come later, once we figure out how
> to
> do mean-shift in M/R at all. How about this?
> 
> 1. Each mean-shift iteration consists of a canopy clustering of all the
> points, with T1 set to the desired sampling resolution (h?) and T2 set to
> 1.
> This will create one canopy centered on each point in the input set which
> contains all of its neighbors that are close enough to influence its next
> position in its trajectory (the window?).
> 
> 2. We then calculate the centroid of each canopy (that's actually done
> already by the canopy cluster reducer). Is this centroid not also the
> weighted average you desire for the next location of the point at its
> center?
> 
> 3. As the computation proceeds, the canopies will collapse together as
> their
> various centroids move inside the T2=1 radius. At the point when all
> points
> have converged, the remaining canopies will be the mean-shift clusters
> (modes?) of the dataset and their contents will be the migrated points in
> each cluster.
> 
> 4. If each original point is duplicated as its own payload, then the
> iterations will produce clusters of migrated points whose payloads are the
> final contents of each cluster.
> 
> Can you wrap your mind around this enough to validate my assumptions?
> 
> Jeff
> 
> > -----Original Message-----
> > From: Matthew Riley [mailto:mriley@gmail.com]
> > Sent: Monday, March 10, 2008 5:58 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: Google Summer of Code[esp. More Clustering]
> >
> > Hi Jeff-
> >
> > I think your "basin of attraction" understanding is right on. I also
> like
> > your ideas for distributing the mean-shift iterations by following a
> > canopy-style method. My intuition was a little different, and I would
> like
> > to hear your ideas on it:
> >
> > Just to make sure we're on the same page....
> > Say we have 1 million point in our original dataset, and we want to
> > cluster
> > by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> > points from the original dataset and follow the gradient of those points
> > to
> > the region of highest density (and as we saw from the paper, rather than
> > calculate the gradient itself we can equivalently compute the weighted
> > average of our subsampled points and move the centroid to that point).
> > This
> > part seems fairly straightforward to distribute - we just send a
> different
> > subsampled set to each processor and each processor returns the final
> > centroid for that set.
> >
> > The problem I see is that 10,000 points (or whatever value we choose),
> may
> > be too much for a single processor if we have to compute the distance to
> > every single point when we compute the weighted mean. My thought here
> was
> > to
> > exploit the fact that we're using a kernel function (gaussian, uniform,
> > etc.) in the weighted mean calculation and that kernel will have a set
> > radius. Because the radius is static, it may be easy to (quickly)
> identify
> > the points that we must consider in the calculation (i.e. those within
> the
> > radius) by using a locality sensitive hashing scheme, tuned to that
> > particular radius. Of course, the degree of advantage we get from this
> > method will depend on the data itself, but intuitively I think we will
> > usually see a dramatic improvement.
> >
> > Honestly, I should do more background work developing this idea, and
> > possibly try a matlab implementation to test the feasibility. This
> sounds
> > more like a research paper than something we should dive into
> immediately,
> > but I wanted to share the idea and get some feedback if anyone has
> > thoughts...
> >
> > Matt
> >
> >
> > On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net>
> > wrote:
> >
> > > Hi Matthew,
> > >
> > > I've been looking over the mean-shift papers for the last several
> days.
> > > While the details of the math are still sinking in, it looks like the
> > > basic algorithm might be summarized thusly:
> > >
> > > Points in an n-d feature space are migrated iteratively in the
> direction
> > > of maxima in their local density functions. Points within a "basin of
> > > attraction" all converge to the same maxima and thus belong to the
> same
> > > cluster.
> > >
> > > A physical analogy might be(?):
> > >
> > > Gas particles in 3-space, operating with gravitational attraction but
> > > without momentum, would tend to cluster similarly.
> > >
> > > The algorithm seems to require that each point be compared with every
> > > other point. This might be taken to require each mapper to see all of
> > > the points, thus frustrating scalability. OTOH, Canopy clustering
> avoids
> > > this by clustering the clusters produced by the subsets of points seen
> > > by each mapper. K-means has the requirement that each point needs to
> be
> > > compared with all of the cluster centers, not points. It has a similar
> > > iterative structure over clusters (a much smaller, constant number)
> that
> > > might be employed.
> > >
> > > There is a lot of locality in the local density function window, and
> > > this could perhaps be exploited. If points could be pre-clustered (as
> > > canopy is often used to prime the k-means iterations), parallelization
> > > might be feasible.
> > >
> > > Are these observations within a "basin of attraction" to your
> > > understanding of mean-shift <grin>?
> > >
> > > Jeff
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > Sent: Thursday, March 06, 2008 11:46 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: Google Summer of Code[esp. More Clustering]
> > >
> > > Hey Jeff-
> > >
> > > I'm certainly willing to put some energy into developing
> implementations
> > > of
> > > these algorithms, and it's good to hear that you may be interested in
> > > guiding us in the right direction.
> > >
> > > Here are the references I learned the algorithms from- some are more
> > > detailed than others:
> > >
> > > Mean-Shift clustering was introduced here and this paper is a thorough
> > > reference:
> > > Mean-Shift: A Robust Approach to Feature Space Analysis
> > > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> > >
> > > And here's a PDF with just guts of the algorithm outlined:
> > > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> > >
> > > It looks like there isn't a definitive reference for the k-means
> > > approximation with randomized k-d trees, but there are promising
> results
> > > introduced here:
> > >
> > > Object retrieval with large vocabularies and fast spatial matching:
> > >
> >
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://
> > www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > > *
> > > And a deeper explanation of the technique here:
> > >
> > > Randomized KD-Trees for Real-Time Keypoint Detection:
> > > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> > >
> > > Let me know what you think.
> > >
> > > Matt
> > >
> > > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> > > wrote:
> > >
> > > > Hi Matthew,
> > > >
> > > > As with most open source projects, "interest" is mainly a function
> of
> > > > the willingness of somebody to contribute their energy. Clustering
> is
> > > > certainly within the scope of the project. I'd be interested in
> > > > exploring additional clustering algorithms with you and your
> > > colleague.
> > > > I'm a complete noob in this area and it is always enlightening to
> work
> > > > with students who have more current theoretical exposures.
> > > >
> > > > Do you have some links on these approaches that you find
> particularly
> > > > helpful?
> > > >
> > > > Jeff
> > > >
> > > > -----Original Message-----
> > > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > > Sent: Wednesday, March 05, 2008 11:11 PM
> > > > To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> > > > Subject: Re: Google Summer of Code
> > > >
> > > > Hey everyone-
> > > >
> > > > I've been watching the mailing list for a little while now, hoping
> to
> > > > contribute once I became more familiar, but I wanted to jump in here
> > > now
> > > > and
> > > > express my interest in the Summer of Code project. I'm currently a
> > > > graduate
> > > > student in electrical engineering at UT-Austin working in computer
> > > > vision,
> > > > which is closely tied to many of the problems Mahout is addressing
> > > > (especially in my area of content-based retrieval).
> > > >
> > > > What can I do to help out?
> > > >
> > > > I've discussed some potential Mahout projects with another student
> > > > recently-
> > > > mostly focused around approximate k-means algorithms (since that's a
> > > > problem
> > > > I've been working on lately). It sounds like you guys are already
> > > > implementing canopy clustering for k-means- Is there any interest in
> > > > developing another approximation algorithm based on randomized
> > > kd-trees
> > > > for
> > > > high dimensional data? What about mean-shift clustering?
> > > >
> > > > Again, I would be glad to help in any way I can.
> > > >
> > > > Matt
> > > >
> > > > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <mainec@isabel-
> drost.de>
> > > > wrote:
> > > >
> > > > > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > > > > Also, any thoughts on what we might want someone to do?  I think
> > > it
> > > > > > would be great to have someone implement one of the algorithms
> on
> > > > our
> > > > > > wiki.
> > > > >
> > > > > Just as a general note, the deadline for applications:
> > > > >
> > > > > March 12: Mentoring organization application deadline (12 noon
> > > > PDT/19:00
> > > > > UTC).
> > > > >
> > > > > I suppose we should identify interesing tasks until that deadline.
> > > As
> > > > a
> > > > > general guideline for mentors and for project proposals:
> > > > >
> > > > > http://code.google.com/p/google-summer-of-
> code/wiki/AdviceforMentors
> > > > >
> > > > > Isabel
> > > > >
> > > > > --
> > > > > Better late than never.         -- Titus Livius (Livy)
> > > > >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> > > > >  /,`.-'`'    -.  ;-;;,_
> > > > >  |,4-  ) )-,_..;\ (  `'-'
> > > > > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> > > > >
> > > >
> > >
>

RE: Google Summer of Code[esp. More Clustering]

Posted by Jeff Eastman <je...@windwardsolutions.com>.

Hi Matthew,

I'd like to pursue that canopy thought a little further and mix it in with
your sub sampling idea. Optimizing can come later, once we figure out how to
do mean-shift in M/R at all. How about this?

1. Each mean-shift iteration consists of a canopy clustering of all the
points, with T1 set to the desired sampling resolution (h?) and T2 set to 1.
This will create one canopy centered on each point in the input set which
contains all of its neighbors that are close enough to influence its next
position in its trajectory (the window?).

2. We then calculate the centroid of each canopy (that's actually done
already by the canopy cluster reducer). Is this centroid not also the
weighted average you desire for the next location of the point at its
center?

3. As the computation proceeds, the canopies will collapse together as their
various centroids move inside the T2=1 radius. At the point when all points
have converged, the remaining canopies will be the mean-shift clusters
(modes?) of the dataset and their contents will be the migrated points in
each cluster.

4. If each original point is duplicated as its own payload, then the
iterations will produce clusters of migrated points whose payloads are the
final contents of each cluster.

Can you wrap your mind around this enough to validate my assumptions? 

Jeff

> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Monday, March 10, 2008 5:58 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Google Summer of Code[esp. More Clustering]
> 
> Hi Jeff-
> 
> I think your "basin of attraction" understanding is right on. I also like
> your ideas for distributing the mean-shift iterations by following a
> canopy-style method. My intuition was a little different, and I would like
> to hear your ideas on it:
> 
> Just to make sure we're on the same page....
> Say we have 1 million point in our original dataset, and we want to
> cluster
> by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> points from the original dataset and follow the gradient of those points
> to
> the region of highest density (and as we saw from the paper, rather than
> calculate the gradient itself we can equivalently compute the weighted
> average of our subsampled points and move the centroid to that point).
> This
> part seems fairly straightforward to distribute - we just send a different
> subsampled set to each processor and each processor returns the final
> centroid for that set.
> 
> The problem I see is that 10,000 points (or whatever value we choose), may
> be too much for a single processor if we have to compute the distance to
> every single point when we compute the weighted mean. My thought here was
> to
> exploit the fact that we're using a kernel function (gaussian, uniform,
> etc.) in the weighted mean calculation and that kernel will have a set
> radius. Because the radius is static, it may be easy to (quickly) identify
> the points that we must consider in the calculation (i.e. those within the
> radius) by using a locality sensitive hashing scheme, tuned to that
> particular radius. Of course, the degree of advantage we get from this
> method will depend on the data itself, but intuitively I think we will
> usually see a dramatic improvement.
> 
> Honestly, I should do more background work developing this idea, and
> possibly try a matlab implementation to test the feasibility. This sounds
> more like a research paper than something we should dive into immediately,
> but I wanted to share the idea and get some feedback if anyone has
> thoughts...
> 
> Matt
> 
> 
> On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net>
> wrote:
> 
> > Hi Matthew,
> >
> > I've been looking over the mean-shift papers for the last several days.
> > While the details of the math are still sinking in, it looks like the
> > basic algorithm might be summarized thusly:
> >
> > Points in an n-d feature space are migrated iteratively in the direction
> > of maxima in their local density functions. Points within a "basin of
> > attraction" all converge to the same maxima and thus belong to the same
> > cluster.
> >
> > A physical analogy might be(?):
> >
> > Gas particles in 3-space, operating with gravitational attraction but
> > without momentum, would tend to cluster similarly.
> >
> > The algorithm seems to require that each point be compared with every
> > other point. This might be taken to require each mapper to see all of
> > the points, thus frustrating scalability. OTOH, Canopy clustering avoids
> > this by clustering the clusters produced by the subsets of points seen
> > by each mapper. K-means has the requirement that each point needs to be
> > compared with all of the cluster centers, not points. It has a similar
> > iterative structure over clusters (a much smaller, constant number) that
> > might be employed.
> >
> > There is a lot of locality in the local density function window, and
> > this could perhaps be exploited. If points could be pre-clustered (as
> > canopy is often used to prime the k-means iterations), parallelization
> > might be feasible.
> >
> > Are these observations within a "basin of attraction" to your
> > understanding of mean-shift <grin>?
> >
> > Jeff
> >
> >
> >
> > -----Original Message-----
> > From: Matthew Riley [mailto:mriley@gmail.com]
> > Sent: Thursday, March 06, 2008 11:46 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: Google Summer of Code[esp. More Clustering]
> >
> > Hey Jeff-
> >
> > I'm certainly willing to put some energy into developing implementations
> > of
> > these algorithms, and it's good to hear that you may be interested in
> > guiding us in the right direction.
> >
> > Here are the references I learned the algorithms from- some are more
> > detailed than others:
> >
> > Mean-Shift clustering was introduced here and this paper is a thorough
> > reference:
> > Mean-Shift: A Robust Approach to Feature Space Analysis
> > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> >
> > And here's a PDF with just guts of the algorithm outlined:
> > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> >
> > It looks like there isn't a definitive reference for the k-means
> > approximation with randomized k-d trees, but there are promising results
> > introduced here:
> >
> > Object retrieval with large vocabularies and fast spatial matching:
> >
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://
> www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > *
> > And a deeper explanation of the technique here:
> >
> > Randomized KD-Trees for Real-Time Keypoint Detection:
> > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> >
> > Let me know what you think.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> > wrote:
> >
> > > Hi Matthew,
> > >
> > > As with most open source projects, "interest" is mainly a function of
> > > the willingness of somebody to contribute their energy. Clustering is
> > > certainly within the scope of the project. I'd be interested in
> > > exploring additional clustering algorithms with you and your
> > colleague.
> > > I'm a complete noob in this area and it is always enlightening to work
> > > with students who have more current theoretical exposures.
> > >
> > > Do you have some links on these approaches that you find particularly
> > > helpful?
> > >
> > > Jeff
> > >
> > > -----Original Message-----
> > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > Sent: Wednesday, March 05, 2008 11:11 PM
> > > To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> > > Subject: Re: Google Summer of Code
> > >
> > > Hey everyone-
> > >
> > > I've been watching the mailing list for a little while now, hoping to
> > > contribute once I became more familiar, but I wanted to jump in here
> > now
> > > and
> > > express my interest in the Summer of Code project. I'm currently a
> > > graduate
> > > student in electrical engineering at UT-Austin working in computer
> > > vision,
> > > which is closely tied to many of the problems Mahout is addressing
> > > (especially in my area of content-based retrieval).
> > >
> > > What can I do to help out?
> > >
> > > I've discussed some potential Mahout projects with another student
> > > recently-
> > > mostly focused around approximate k-means algorithms (since that's a
> > > problem
> > > I've been working on lately). It sounds like you guys are already
> > > implementing canopy clustering for k-means- Is there any interest in
> > > developing another approximation algorithm based on randomized
> > kd-trees
> > > for
> > > high dimensional data? What about mean-shift clustering?
> > >
> > > Again, I would be glad to help in any way I can.
> > >
> > > Matt
> > >
> > > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> > > wrote:
> > >
> > > > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > > > Also, any thoughts on what we might want someone to do?  I think
> > it
> > > > > would be great to have someone implement one of the algorithms on
> > > our
> > > > > wiki.
> > > >
> > > > Just as a general note, the deadline for applications:
> > > >
> > > > March 12: Mentoring organization application deadline (12 noon
> > > PDT/19:00
> > > > UTC).
> > > >
> > > > I suppose we should identify interesing tasks until that deadline.
> > As
> > > a
> > > > general guideline for mentors and for project proposals:
> > > >
> > > > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> > > >
> > > > Isabel
> > > >
> > > > --
> > > > Better late than never.         -- Titus Livius (Livy)
> > > >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> > > >  /,`.-'`'    -.  ;-;;,_
> > > >  |,4-  ) )-,_..;\ (  `'-'
> > > > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> > > >
> > >
> >

Re: Google Summer of Code[esp. More Clustering]

Posted by Matthew Riley <mr...@gmail.com>.

Hi Jeff-

I think your "basin of attraction" understanding is right on. I also like
your ideas for distributing the mean-shift iterations by following a
canopy-style method. My intuition was a little different, and I would like
to hear your ideas on it:

Just to make sure we're on the same page....
Say we have 1 million point in our original dataset, and we want to cluster
by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
points from the original dataset and follow the gradient of those points to
the region of highest density (and as we saw from the paper, rather than
calculate the gradient itself we can equivalently compute the weighted
average of our subsampled points and move the centroid to that point). This
part seems fairly straightforward to distribute - we just send a different
subsampled set to each processor and each processor returns the final
centroid for that set.

The problem I see is that 10,000 points (or whatever value we choose), may
be too much for a single processor if we have to compute the distance to
every single point when we compute the weighted mean. My thought here was to
exploit the fact that we're using a kernel function (gaussian, uniform,
etc.) in the weighted mean calculation and that kernel will have a set
radius. Because the radius is static, it may be easy to (quickly) identify
the points that we must consider in the calculation (i.e. those within the
radius) by using a locality sensitive hashing scheme, tuned to that
particular radius. Of course, the degree of advantage we get from this
method will depend on the data itself, but intuitively I think we will
usually see a dramatic improvement.

Honestly, I should do more background work developing this idea, and
possibly try a matlab implementation to test the feasibility. This sounds
more like a research paper than something we should dive into immediately,
but I wanted to share the idea and get some feedback if anyone has
thoughts...

Matt


On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net> wrote:

> Hi Matthew,
>
> I've been looking over the mean-shift papers for the last several days.
> While the details of the math are still sinking in, it looks like the
> basic algorithm might be summarized thusly:
>
> Points in an n-d feature space are migrated iteratively in the direction
> of maxima in their local density functions. Points within a "basin of
> attraction" all converge to the same maxima and thus belong to the same
> cluster.
>
> A physical analogy might be(?):
>
> Gas particles in 3-space, operating with gravitational attraction but
> without momentum, would tend to cluster similarly.
>
> The algorithm seems to require that each point be compared with every
> other point. This might be taken to require each mapper to see all of
> the points, thus frustrating scalability. OTOH, Canopy clustering avoids
> this by clustering the clusters produced by the subsets of points seen
> by each mapper. K-means has the requirement that each point needs to be
> compared with all of the cluster centers, not points. It has a similar
> iterative structure over clusters (a much smaller, constant number) that
> might be employed.
>
> There is a lot of locality in the local density function window, and
> this could perhaps be exploited. If points could be pre-clustered (as
> canopy is often used to prime the k-means iterations), parallelization
> might be feasible.
>
> Are these observations within a "basin of attraction" to your
> understanding of mean-shift <grin>?
>
> Jeff
>
>
>
> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Thursday, March 06, 2008 11:46 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Google Summer of Code[esp. More Clustering]
>
> Hey Jeff-
>
> I'm certainly willing to put some energy into developing implementations
> of
> these algorithms, and it's good to hear that you may be interested in
> guiding us in the right direction.
>
> Here are the references I learned the algorithms from- some are more
> detailed than others:
>
> Mean-Shift clustering was introduced here and this paper is a thorough
> reference:
> Mean-Shift: A Robust Approach to Feature Space Analysis
> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>
> And here's a PDF with just guts of the algorithm outlined:
> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>
> It looks like there isn't a definitive reference for the k-means
> approximation with randomized k-d trees, but there are promising results
> introduced here:
>
> Object retrieval with large vocabularies and fast spatial matching:
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> *
> And a deeper explanation of the technique here:
>
> Randomized KD-Trees for Real-Time Keypoint Detection:
> ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
>
> Let me know what you think.
>
> Matt
>
> On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> wrote:
>
> > Hi Matthew,
> >
> > As with most open source projects, "interest" is mainly a function of
> > the willingness of somebody to contribute their energy. Clustering is
> > certainly within the scope of the project. I'd be interested in
> > exploring additional clustering algorithms with you and your
> colleague.
> > I'm a complete noob in this area and it is always enlightening to work
> > with students who have more current theoretical exposures.
> >
> > Do you have some links on these approaches that you find particularly
> > helpful?
> >
> > Jeff
> >
> > -----Original Message-----
> > From: Matthew Riley [mailto:mriley@gmail.com]
> > Sent: Wednesday, March 05, 2008 11:11 PM
> > To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> > Subject: Re: Google Summer of Code
> >
> > Hey everyone-
> >
> > I've been watching the mailing list for a little while now, hoping to
> > contribute once I became more familiar, but I wanted to jump in here
> now
> > and
> > express my interest in the Summer of Code project. I'm currently a
> > graduate
> > student in electrical engineering at UT-Austin working in computer
> > vision,
> > which is closely tied to many of the problems Mahout is addressing
> > (especially in my area of content-based retrieval).
> >
> > What can I do to help out?
> >
> > I've discussed some potential Mahout projects with another student
> > recently-
> > mostly focused around approximate k-means algorithms (since that's a
> > problem
> > I've been working on lately). It sounds like you guys are already
> > implementing canopy clustering for k-means- Is there any interest in
> > developing another approximation algorithm based on randomized
> kd-trees
> > for
> > high dimensional data? What about mean-shift clustering?
> >
> > Again, I would be glad to help in any way I can.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> > wrote:
> >
> > > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > > Also, any thoughts on what we might want someone to do?  I think
> it
> > > > would be great to have someone implement one of the algorithms on
> > our
> > > > wiki.
> > >
> > > Just as a general note, the deadline for applications:
> > >
> > > March 12: Mentoring organization application deadline (12 noon
> > PDT/19:00
> > > UTC).
> > >
> > > I suppose we should identify interesing tasks until that deadline.
> As
> > a
> > > general guideline for mentors and for project proposals:
> > >
> > > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> > >
> > > Isabel
> > >
> > > --
> > > Better late than never.         -- Titus Livius (Livy)
> > >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> > >  /,`.-'`'    -.  ;-;;,_
> > >  |,4-  ) )-,_..;\ (  `'-'
> > > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> > >
> >
>

RE: Google Summer of Code[esp. More Clustering]

Posted by Jeff Eastman <je...@collab.net>.

Hi Matthew,

I've been looking over the mean-shift papers for the last several days.
While the details of the math are still sinking in, it looks like the
basic algorithm might be summarized thusly:

Points in an n-d feature space are migrated iteratively in the direction
of maxima in their local density functions. Points within a "basin of
attraction" all converge to the same maxima and thus belong to the same
cluster.

A physical analogy might be(?):

Gas particles in 3-space, operating with gravitational attraction but
without momentum, would tend to cluster similarly.

The algorithm seems to require that each point be compared with every
other point. This might be taken to require each mapper to see all of
the points, thus frustrating scalability. OTOH, Canopy clustering avoids
this by clustering the clusters produced by the subsets of points seen
by each mapper. K-means has the requirement that each point needs to be
compared with all of the cluster centers, not points. It has a similar
iterative structure over clusters (a much smaller, constant number) that
might be employed.

There is a lot of locality in the local density function window, and
this could perhaps be exploited. If points could be pre-clustered (as
canopy is often used to prime the k-means iterations), parallelization
might be feasible.

Are these observations within a "basin of attraction" to your
understanding of mean-shift <grin>? 

Jeff

-----Original Message-----
From: Matthew Riley [mailto:mriley@gmail.com] 
Sent: Thursday, March 06, 2008 11:46 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Google Summer of Code[esp. More Clustering]

Hey Jeff-

I'm certainly willing to put some energy into developing implementations
of
these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising results
introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
wrote:

> Hi Matthew,
>
> As with most open source projects, "interest" is mainly a function of
> the willingness of somebody to contribute their energy. Clustering is
> certainly within the scope of the project. I'd be interested in
> exploring additional clustering algorithms with you and your
colleague.
> I'm a complete noob in this area and it is always enlightening to work
> with students who have more current theoretical exposures.
>
> Do you have some links on these approaches that you find particularly
> helpful?
>
> Jeff
>
> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Wednesday, March 05, 2008 11:11 PM
> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> Subject: Re: Google Summer of Code
>
> Hey everyone-
>
> I've been watching the mailing list for a little while now, hoping to
> contribute once I became more familiar, but I wanted to jump in here
now
> and
> express my interest in the Summer of Code project. I'm currently a
> graduate
> student in electrical engineering at UT-Austin working in computer
> vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).
>
> What can I do to help out?
>
> I've discussed some potential Mahout projects with another student
> recently-
> mostly focused around approximate k-means algorithms (since that's a
> problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized
kd-trees
> for
> high dimensional data? What about mean-shift clustering?
>
> Again, I would be glad to help in any way I can.
>
> Matt
>
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> wrote:
>
> > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > Also, any thoughts on what we might want someone to do?  I think
it
> > > would be great to have someone implement one of the algorithms on
> our
> > > wiki.
> >
> > Just as a general note, the deadline for applications:
> >
> > March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> > UTC).
> >
> > I suppose we should identify interesing tasks until that deadline.
As
> a
> > general guideline for mentors and for project proposals:
> >
> > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >
> > Isabel
> >
> > --
> > Better late than never.         -- Titus Livius (Livy)
> >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'    -.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >
>