You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/03/01 03:33:58 UTC

Google Summer of Code

Hi Gang,

I think we should put in for this: http://wiki.apache.org/general/SummerOfCode2008

I would be there are some students interested in doing ML on Hadoop.   
Any of the other committers willing to mentor?  I am, but would also  
like some others to help out if you have the time.  See http://wiki.apache.org/general/SummerOfCodeMentor 
.


Thanks,
Grant

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

Well, here's your chance.  Make a proposal of something you would like  
to work on that fits with what we are doing and we'll discuss it and  
possibly put it up as a project.

I think it would be great if anyone took on something like M/R SVM  
implementation, or one of the other ones that is not already under way.

-Grant

On Mar 1, 2008, at 2:08 AM, jaideep@research.iiit.ac.in wrote:

>> Hi Gang,
>>
>> I think we should put in for this:
>> http://wiki.apache.org/general/SummerOfCode2008
>>
>> I would be there are some students interested in doing ML on Hadoop.
> Yes. I would be happy to work :) Didn't know that Mahout is also
> participating in SoC.
>> Any of the other committers willing to mentor?  I am, but would also
>> like some others to help out if you have the time.  See
>> http://wiki.apache.org/general/SummerOfCodeMentor
>> .
>>
>>
>> Thanks,
>> Grant
>>
>>
>>
>>
>
>
>

Re: Google Summer of Code

Posted by ja...@research.iiit.ac.in.

> Hi Gang,
>
> I think we should put in for this:
> http://wiki.apache.org/general/SummerOfCode2008
>
> I would be there are some students interested in doing ML on Hadoop.
Yes. I would be happy to work :) Didn't know that Mahout is also
participating in SoC.
> Any of the other committers willing to mentor?  I am, but would also
> like some others to help out if you have the time.  See
> http://wiki.apache.org/general/SummerOfCodeMentor
> .
>
>
> Thanks,
> Grant
>
>
>
>

Re: Google Summer of Code

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> What about encouraging your students to submit their work at Mahout? Just a 
> naive thought of mine.

Those students I'm in charge of have their area of interest defined already -- 
too late to change it. Good idea for the future, I have been thinking about it, 
actually.

D.

Re: Google Summer of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Thursday 06 March 2008, Dawid Weiss wrote:
> > five hours per student per week in summer should be doable, at least for
> > me.
>
> I'm out, unfortunately -- my academic job already comes with a bunch of
> students to take care of. It's fun, but it's time-consuming.

What about encouraging your students to submit their work at Mahout? Just a 
naive thought of mine.

Isabel

-- 
Immature poets imitate, mature poets steal.		-- T. S. Eliot, "Philip 
Massinger"
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Anush Shetty <an...@gmail.com>.

On Mon, Mar 10, 2008 at 4:50 PM, Grant Ingersoll <gs...@apache.org>
wrote:

> Wow, maybe w/ all of our mentors we could get 2 students...
>

neat ++ :)



-- 
((Anush Shetty)) ((mail AT anushshetty DOT com))

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

Wow, maybe w/ all of our mentors we could get 2 students...


On Mar 10, 2008, at 3:42 AM, Isabel Drost wrote:

> On Saturday 08 March 2008, Grant Ingersoll wrote:
>> Please feel free to add your name to the list of
>> mentors if you can.
>
> I have added my name and created an account at the Google SoC web  
> application.
> Anything else - apart from reading the GSoC documentation we should  
> not
> forget?
>
> Isabel
>

Re: Google Summer of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Saturday 08 March 2008, Grant Ingersoll wrote:
> Please feel free to add your name to the list of
> mentors if you can.

I have added my name and created an account at the Google SoC web application. 
Anything else - apart from reading the GSoC documentation we should not 
forget?

Isabel

-- 
A woman forgives the audacity of which her beauty has prompted us to be 
guilty.		-- LeSage
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

You have to create an id and login...

On Mar 9, 2008, at 7:58 PM, Jeff Eastman wrote:

> I'd be willing to contribute. The page shows as immutable to me, so
> perhaps you could add my name next time you are there.
>
> Jeff
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Saturday, March 08, 2008 1:42 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Google Summer of Code
>
> Note, the deadline for project proposals is March 12.
>
> I put an item up for us at:
> http://wiki.apache.org/general/SummerOfCode2008
>    I think it is probably general enough to cover all of the bases
> discussed here.  Please feel free to add your name to the list of
> mentors if you can.  Perhaps we can share duties.
>
> -Grant
>
>
>
> On Mar 7, 2008, at 1:43 PM, Isabel Drost wrote:
>
>> On Friday 07 March 2008, Grant Ingersoll wrote:
>>> Sounds good.  I should also note that all mentoring should (barring
>>> personal conversation) should take place on the dev list.  That is,
>>> decisions, discussions on what to do should be done on the list so
>>> that we all benefit from the understanding.  Not that you were
>>> suggesting otherwise!
>>
>> Sure, after all, GSoC is about integrating students into free  
>> software
>> projects - and making decisions offline certainly is not the way,
>> Apache
>> projects work. Thanks for pointing that out.
>>
>> Isabel
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

RE: Google Summer of Code

Posted by Jeff Eastman <je...@collab.net>.

I'd be willing to contribute. The page shows as immutable to me, so
perhaps you could add my name next time you are there.

Jeff

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Saturday, March 08, 2008 1:42 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Google Summer of Code

Note, the deadline for project proposals is March 12.

I put an item up for us at:
http://wiki.apache.org/general/SummerOfCode2008 
    I think it is probably general enough to cover all of the bases  
discussed here.  Please feel free to add your name to the list of  
mentors if you can.  Perhaps we can share duties.

-Grant



On Mar 7, 2008, at 1:43 PM, Isabel Drost wrote:

> On Friday 07 March 2008, Grant Ingersoll wrote:
>> Sounds good.  I should also note that all mentoring should (barring
>> personal conversation) should take place on the dev list.  That is,
>> decisions, discussions on what to do should be done on the list so
>> that we all benefit from the understanding.  Not that you were
>> suggesting otherwise!
>
> Sure, after all, GSoC is about integrating students into free software
> projects - and making decisions offline certainly is not the way,  
> Apache
> projects work. Thanks for pointing that out.
>
> Isabel

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

Note, the deadline for project proposals is March 12.

I put an item up for us at: http://wiki.apache.org/general/SummerOfCode2008 
    I think it is probably general enough to cover all of the bases  
discussed here.  Please feel free to add your name to the list of  
mentors if you can.  Perhaps we can share duties.

-Grant



On Mar 7, 2008, at 1:43 PM, Isabel Drost wrote:

> On Friday 07 March 2008, Grant Ingersoll wrote:
>> Sounds good.  I should also note that all mentoring should (barring
>> personal conversation) should take place on the dev list.  That is,
>> decisions, discussions on what to do should be done on the list so
>> that we all benefit from the understanding.  Not that you were
>> suggesting otherwise!
>
> Sure, after all, GSoC is about integrating students into free software
> projects - and making decisions offline certainly is not the way,  
> Apache
> projects work. Thanks for pointing that out.
>
> Isabel

Re: Google Summer of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Friday 07 March 2008, Grant Ingersoll wrote:
> Sounds good.  I should also note that all mentoring should (barring
> personal conversation) should take place on the dev list.  That is,
> decisions, discussions on what to do should be done on the list so
> that we all benefit from the understanding.  Not that you were
> suggesting otherwise!

Sure, after all, GSoC is about integrating students into free software 
projects - and making decisions offline certainly is not the way, Apache 
projects work. Thanks for pointing that out.

Isabel


-- 
Never pay a compliment as if expecting a receipt.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 7, 2008, at 3:08 AM, Isabel Drost wrote:

> On Thursday 06 March 2008, Grant Ingersoll wrote:
>> I think we can split the duties a bit, too.
>
> I think the Apache FAQ also said that - according with the usual  
> Apache way of
> doing things - it would be ok if the GSoC students would receive  
> help from
> all community members. So the actual time spent for one mentor could  
> very
> well drop to about 3h per week.
>
> Still I would not rely on that when accepting the duty to become a  
> mentor -
> after all, at least officially it is the mentor who is responsible for
> encouraging the student.

Sounds good.  I should also note that all mentoring should (barring  
personal conversation) should take place on the dev list.  That is,  
decisions, discussions on what to do should be done on the list so  
that we all benefit from the understanding.  Not that you were  
suggesting otherwise!

-Grant

Re: Google Summer of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Thursday 06 March 2008, Grant Ingersoll wrote:
> I think we can split the duties a bit, too. 

I think the Apache FAQ also said that - according with the usual Apache way of 
doing things - it would be ok if the GSoC students would receive help from 
all community members. So the actual time spent for one mentor could very 
well drop to about 3h per week.

Still I would not rely on that when accepting the duty to become a mentor - 
after all, at least officially it is the mentor who is responsible for 
encouraging the student.

Isabel



-- 
The bug stops here.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

I think we can split the duties a bit, too.  Simon, can you share your  
experience since you did a GSOC for Lucene a few years back?  I seem  
to recall there being a couple of Lucene mentors.


On Mar 6, 2008, at 1:47 AM, Isabel Drost wrote:

> Finally found a general answer to my question:
>
> | While the answer to this question will vary widely depending on  
> the number
> | of students a mentor works with, the difficulty of the proposals  
> and the
> | skill level of the students, most mentors have let us know that they
> | underestimated the amount of time they would need to invest in  
> GSoC. Five
> | hours per student per week is a reasonable estimate.
>
> Sounds like doing it part time after work is going to be a bit tough  
> - but
> five hours per student per week in summer should be doable, at least  
> for me.
>
> As these are summer projects and certainly some of us are going to  
> be on
> vacation during that time we should plan for having one backup for  
> each
> mentor that goes on vacation at a different time.
>
>
> Isabel
>
>
> -- 
> You were s'posed to laugh!
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
> |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> five hours per student per week in summer should be doable, at least for me.

I'm out, unfortunately -- my academic job already comes with a bunch of students 
to take care of. It's fun, but it's time-consuming.

D.

Re: Google Summer of Code

Posted by Isabel Drost <ma...@isabel-drost.de>.

Finally found a general answer to my question:

| While the answer to this question will vary widely depending on the number
| of students a mentor works with, the difficulty of the proposals and the
| skill level of the students, most mentors have let us know that they
| underestimated the amount of time they would need to invest in GSoC. Five
| hours per student per week is a reasonable estimate.    

Sounds like doing it part time after work is going to be a bit tough - but 
five hours per student per week in summer should be doable, at least for me.

As these are summer projects and certainly some of us are going to be on 
vacation during that time we should plan for having one backup for each 
mentor that goes on vacation at a different time.


Isabel


-- 
You were s'posed to laugh!
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Ian Holsman <li...@holsman.net>.

Hi Grant.
I'll be happy to mentor someone for this project.

regards
Ian
>>
>>
>> | A person or group responsible for review and ranking of student
>> | applications,
>>
>> I'd be happy to help out here. Anyone else?
>
> Cool
>

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 6, 2008, at 1:32 AM, Isabel Drost wrote:

> On Wednesday 05 March 2008, Simon Willnauer wrote:
>> You could have a look at the FAQ or the GSoC pages
>> http://code.google.com/soc/2008/ and
>> http://code.google.com/soc/2008/faqs.html respectively.
>
> Hmm, there is little about what mentors are expected apart from the  
> following
> rather general question, is there?
>
> | 2. What is the role of a mentoring organization?
>
> If we want to take part in GSoC, from that question, I guess we need  
> a little
> more than only mentors:
>
> | A pool of project ideas for students to choose from.
>
> Grant already asked for ideas.
>
> | An organization administrator to act as the project's main point  
> of contact
> | for Google;
>
> Any volunteers?

I think this is covered by the ASF.

>
>
> | A person or group responsible for review and ranking of student
> | applications,
>
> I'd be happy to help out here. Anyone else?

Cool

Re: Google Summer of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Wednesday 05 March 2008, Simon Willnauer wrote:
> You could have a look at the FAQ or the GSoC pages
> http://code.google.com/soc/2008/ and
> http://code.google.com/soc/2008/faqs.html respectively.

Hmm, there is little about what mentors are expected apart from the following 
rather general question, is there?

| 2. What is the role of a mentoring organization?

If we want to take part in GSoC, from that question, I guess we need a little 
more than only mentors:

| A pool of project ideas for students to choose from.

Grant already asked for ideas.

| An organization administrator to act as the project's main point of contact
| for Google;  

Any volunteers?

| A person or group responsible for review and ranking of student
| applications,

I'd be happy to help out here. Anyone else?

| A person or group of people responsible for monitoring the progress of each
| accepted student and to mentor her/him as the project progresses;  + backup

That would be the mentors Grant already mentioned.

| A written evaluation of each student participant, including how s/he worked
| with the group, whether s/he should be invited back should we do another
| Google Summer of Code, etc.   

I guess this could be done by each member but should be reviewed by more than 
one person, as it looks like the evaluations are going to be highly 
subjective.

> Or join the #gsoc IRC channel on freenode.

Sorry, but working on Mahout only after work I usually do not have the time to 
follow irc channels :(

Anyone here, who already took part in GSoC and could give us a little summary 
of her experiences? Is it possible to do the mentoring job in the freetime 
after work or should one better plan more time than that?

Isabel

-- 
"Remember kids, if there's a loaded gun in the room, be sure that you're the 
one holding it"		-- Captain Combat
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Simon Willnauer <si...@googlemail.com>.

On Wed, Mar 5, 2008 at 8:52 PM, Isabel Drost
<ap...@isabel-drost.de> wrote:
> On Saturday 01 March 2008, Grant Ingersoll wrote:
>  > > Any of the other committers willing to mentor?
>
>  Could you please clarify - or point to a page that does so - about what it
>  means to become a Mentor? Anyone have any experience being a mentor? I would
>  be happy to help - but I would rather learn a bit more about the mentor side
>  of GSoC

You could have a look at the FAQ or the GSoC pages
http://code.google.com/soc/2008/ and
http://code.google.com/soc/2008/faqs.html respectively.
Or join the #gsoc IRC channel on freenode.

you could also contact some of the google folks they are very helpful
if you have questions beyond the FAQ. (watch out for "lh" in the IRC
channel)

best regards,

simon
>
>
>  > Also, any thoughts on what we might want someone to do?  I think it
>  > would be great to have someone implement one of the algorithms on our
>  > wiki.
>
>  I think just implementing one of the algorithms might help Mahout but it might
>  be a bit hard to attract students to do that without some real task at hand.
>
>  What about putting up tasks that solve problems e.g. from this years KDD cup
>  or the web spam challenge? Than the benefit for participants would be two
>  fold - first they would help Mahout and second they could compete with others
>  in the field.
>
>  Isabel
>
>  --
>  A man's best friend is his dogma.
>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>   /,`.-'`'    -.  ;-;;,_
>   |,4-  ) )-,_..;\ (  `'-'
>  '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>

Re: Google Summer of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Saturday 01 March 2008, Grant Ingersoll wrote:
> > Any of the other committers willing to mentor? 

Could you please clarify - or point to a page that does so - about what it 
means to become a Mentor? Anyone have any experience being a mentor? I would 
be happy to help - but I would rather learn a bit more about the mentor side 
of GSoC 

> Also, any thoughts on what we might want someone to do?  I think it
> would be great to have someone implement one of the algorithms on our
> wiki.

I think just implementing one of the algorithms might help Mahout but it might 
be a bit hard to attract students to do that without some real task at hand.

What about putting up tasks that solve problems e.g. from this years KDD cup 
or the web spam challenge? Than the benefit for participants would be two 
fold - first they would help Mahout and second they could compete with others 
in the field.

Isabel

-- 
A man's best friend is his dogma.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

I think the Mentoring Org is already setup.  After March 3, mentors  
can register.  See http://wiki.apache.org/general/SummerOfCode2008.   
I'm willing to mentor, but would like to share the load a bit too.

-Grant


On Mar 6, 2008, at 1:56 AM, Isabel Drost wrote:

> On Saturday 01 March 2008, Grant Ingersoll wrote:
>> Also, any thoughts on what we might want someone to do?  I think it
>> would be great to have someone implement one of the algorithms on our
>> wiki.
>
> Just as a general note, the deadline for applications:
>
> March 12: Mentoring organization application deadline (12 noon PDT/ 
> 19:00 UTC).
>
> I suppose we should identify interesing tasks until that deadline.  
> As a
> general guideline for mentors and for project proposals:
>
> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>
> Isabel
>
> -- 
> Better late than never.		-- Titus Livius (Livy)
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
> |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code[esp. More Clustering]

Posted by Matthew Riley <mr...@gmail.com>.

Hey Grant-

I believe scaling Mean-Shift clustering using M/R will be pretty
straightforward. I'm not as sure about K-Means using KD-Trees, since I
haven't personally implemented that algorithm, but since it follows K-Means
fairly closely I imagine it is possible.

I'll get to work on a proposal with some of my ideas, and hopefully get some
feedback from you guys during the process.

Thanks for all the responses so far.

Matt

On Thu, Mar 6, 2008 at 3:25 PM, Grant Ingersoll <gs...@apache.org> wrote:

> I haven't read the papers, but the big question is do you think they
> can scale using M/R or some other distributed techniques?
>
> If so, feel free to write up a bit of a proposal using the info at:
> http://wiki.apache.org/general/SummerOfCode2008
>   If you are unsure, that is fine too.  We could start with a simpler
> implementation, and then look to distribute it.
>
>
> On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote:
>
> > Hey Jeff-
> >
> > I'm certainly willing to put some energy into developing
> > implementations of
> > these algorithms, and it's good to hear that you may be interested in
> > guiding us in the right direction.
> >
> > Here are the references I learned the algorithms from- some are more
> > detailed than others:
> >
> > Mean-Shift clustering was introduced here and this paper is a thorough
> > reference:
> > Mean-Shift: A Robust Approach to Feature Space Analysis
> > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> >
> > And here's a PDF with just guts of the algorithm outlined:
> > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> >
> > It looks like there isn't a definitive reference for the k-means
> > approximation with randomized k-d trees, but there are promising
> > results
> > introduced here:
> >
> > Object retrieval with large vocabularies and fast spatial matching:
> > http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > *
> > And a deeper explanation of the technique here:
> >
> > Randomized KD-Trees for Real-Time Keypoint Detection:
> > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> >
> > Let me know what you think.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> > wrote:
> >
> >> Hi Matthew,
> >>
> >> As with most open source projects, "interest" is mainly a function of
> >> the willingness of somebody to contribute their energy. Clustering is
> >> certainly within the scope of the project. I'd be interested in
> >> exploring additional clustering algorithms with you and your
> >> colleague.
> >> I'm a complete noob in this area and it is always enlightening to
> >> work
> >> with students who have more current theoretical exposures.
> >>
> >> Do you have some links on these approaches that you find particularly
> >> helpful?
> >>
> >> Jeff
> >>
> >> -----Original Message-----
> >> From: Matthew Riley [mailto:mriley@gmail.com]
> >> Sent: Wednesday, March 05, 2008 11:11 PM
> >> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> >> Subject: Re: Google Summer of Code
> >>
> >> Hey everyone-
> >>
> >> I've been watching the mailing list for a little while now, hoping to
> >> contribute once I became more familiar, but I wanted to jump in
> >> here now
> >> and
> >> express my interest in the Summer of Code project. I'm currently a
> >> graduate
> >> student in electrical engineering at UT-Austin working in computer
> >> vision,
> >> which is closely tied to many of the problems Mahout is addressing
> >> (especially in my area of content-based retrieval).
> >>
> >> What can I do to help out?
> >>
> >> I've discussed some potential Mahout projects with another student
> >> recently-
> >> mostly focused around approximate k-means algorithms (since that's a
> >> problem
> >> I've been working on lately). It sounds like you guys are already
> >> implementing canopy clustering for k-means- Is there any interest in
> >> developing another approximation algorithm based on randomized kd-
> >> trees
> >> for
> >> high dimensional data? What about mean-shift clustering?
> >>
> >> Again, I would be glad to help in any way I can.
> >>
> >> Matt
> >>
> >> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <mainec@isabel-
> >> drost.de>
> >> wrote:
> >>
> >>> On Saturday 01 March 2008, Grant Ingersoll wrote:
> >>>> Also, any thoughts on what we might want someone to do?  I think it
> >>>> would be great to have someone implement one of the algorithms on
> >> our
> >>>> wiki.
> >>>
> >>> Just as a general note, the deadline for applications:
> >>>
> >>> March 12: Mentoring organization application deadline (12 noon
> >> PDT/19:00
> >>> UTC).
> >>>
> >>> I suppose we should identify interesing tasks until that deadline.
> >>> As
> >> a
> >>> general guideline for mentors and for project proposals:
> >>>
> >>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >>>
> >>> Isabel
> >>>
> >>> --
> >>> Better late than never.         -- Titus Livius (Livy)
> >>>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >>> /,`.-'`'    -.  ;-;;,_
> >>> |,4-  ) )-,_..;\ (  `'-'
> >>> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>

Re: Google Summer of Code[esp. More Clustering]

Posted by Grant Ingersoll <gs...@apache.org>.

I haven't read the papers, but the big question is do you think they  
can scale using M/R or some other distributed techniques?

If so, feel free to write up a bit of a proposal using the info at: http://wiki.apache.org/general/SummerOfCode2008 
   If you are unsure, that is fine too.  We could start with a simpler  
implementation, and then look to distribute it.


On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote:

> Hey Jeff-
>
> I'm certainly willing to put some energy into developing  
> implementations of
> these algorithms, and it's good to hear that you may be interested in
> guiding us in the right direction.
>
> Here are the references I learned the algorithms from- some are more
> detailed than others:
>
> Mean-Shift clustering was introduced here and this paper is a thorough
> reference:
> Mean-Shift: A Robust Approach to Feature Space Analysis
> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>
> And here's a PDF with just guts of the algorithm outlined:
> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>
> It looks like there isn't a definitive reference for the k-means
> approximation with randomized k-d trees, but there are promising  
> results
> introduced here:
>
> Object retrieval with large vocabularies and fast spatial matching:
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
> *
> And a deeper explanation of the technique here:
>
> Randomized KD-Trees for Real-Time Keypoint Detection:
> ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
>
> Let me know what you think.
>
> Matt
>
> On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>  
> wrote:
>
>> Hi Matthew,
>>
>> As with most open source projects, "interest" is mainly a function of
>> the willingness of somebody to contribute their energy. Clustering is
>> certainly within the scope of the project. I'd be interested in
>> exploring additional clustering algorithms with you and your  
>> colleague.
>> I'm a complete noob in this area and it is always enlightening to  
>> work
>> with students who have more current theoretical exposures.
>>
>> Do you have some links on these approaches that you find particularly
>> helpful?
>>
>> Jeff
>>
>> -----Original Message-----
>> From: Matthew Riley [mailto:mriley@gmail.com]
>> Sent: Wednesday, March 05, 2008 11:11 PM
>> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
>> Subject: Re: Google Summer of Code
>>
>> Hey everyone-
>>
>> I've been watching the mailing list for a little while now, hoping to
>> contribute once I became more familiar, but I wanted to jump in  
>> here now
>> and
>> express my interest in the Summer of Code project. I'm currently a
>> graduate
>> student in electrical engineering at UT-Austin working in computer
>> vision,
>> which is closely tied to many of the problems Mahout is addressing
>> (especially in my area of content-based retrieval).
>>
>> What can I do to help out?
>>
>> I've discussed some potential Mahout projects with another student
>> recently-
>> mostly focused around approximate k-means algorithms (since that's a
>> problem
>> I've been working on lately). It sounds like you guys are already
>> implementing canopy clustering for k-means- Is there any interest in
>> developing another approximation algorithm based on randomized kd- 
>> trees
>> for
>> high dimensional data? What about mean-shift clustering?
>>
>> Again, I would be glad to help in any way I can.
>>
>> Matt
>>
>> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <mainec@isabel- 
>> drost.de>
>> wrote:
>>
>>> On Saturday 01 March 2008, Grant Ingersoll wrote:
>>>> Also, any thoughts on what we might want someone to do?  I think it
>>>> would be great to have someone implement one of the algorithms on
>> our
>>>> wiki.
>>>
>>> Just as a general note, the deadline for applications:
>>>
>>> March 12: Mentoring organization application deadline (12 noon
>> PDT/19:00
>>> UTC).
>>>
>>> I suppose we should identify interesing tasks until that deadline.  
>>> As
>> a
>>> general guideline for mentors and for project proposals:
>>>
>>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>>>
>>> Isabel
>>>
>>> --
>>> Better late than never.         -- Titus Livius (Livy)
>>>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>>> /,`.-'`'    -.  ;-;;,_
>>> |,4-  ) )-,_..;\ (  `'-'
>>> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Google Summer of Code[esp. More Clustering]

Posted by Ted Dunning <td...@veoh.com>.

Shooting somewhat in the dark here, but isn't this susceptible to the
mean-field techniques that are used for variational integration of Bayesian
probabilistic techniques (except that you are starting with a field
representation rather than introducing it)?  I can't say what the details
are, but this does seem plausible.

Also, wouldn't a sampling of the points give a good approximation of the
field?  If so, then building an n-centroid approximation of the field should
be possible for a sampled subset.  Then given that approximation, you can
perform the mean shift very much in parallel at which point you can repeat
the approximation and be ready to roll for the next shift.  Moreover, since
the fields are additive, you can compute the field on all of the sub-sample
due to a sub-sub-sample at each of the map nodes and then combine them in a
reduce step.  That would allow you to simplify wha tis likely the most
expensive step of the approximation.

The k-means pre-step approach is also likely to be be helpful, at least in
relatively low dimensional problems.  In high dimensional spaces, however,
everything outside a cluster is approximately equidistant so using the
k-means approximation for the field simply reduces to a very expensive form
of k-means clustering, or a variant on mean-shift with a useless k-means
add-on.


On 3/10/08 5:57 PM, "Matthew Riley" <mr...@gmail.com> wrote:

> Hi Jeff-
> 
> I think your "basin of attraction" understanding is right on. I also like
> your ideas for distributing the mean-shift iterations by following a
> canopy-style method. My intuition was a little different, and I would like
> to hear your ideas on it:
> 
> Just to make sure we're on the same page....
> Say we have 1 million point in our original dataset, and we want to cluster
> by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> points from the original dataset and follow the gradient of those points to
> the region of highest density (and as we saw from the paper, rather than
> calculate the gradient itself we can equivalently compute the weighted
> average of our subsampled points and move the centroid to that point). This
> part seems fairly straightforward to distribute - we just send a different
> subsampled set to each processor and each processor returns the final
> centroid for that set.
> 
> The problem I see is that 10,000 points (or whatever value we choose), may
> be too much for a single processor if we have to compute the distance to
> every single point when we compute the weighted mean. My thought here was to
> exploit the fact that we're using a kernel function (gaussian, uniform,
> etc.) in the weighted mean calculation and that kernel will have a set
> radius. Because the radius is static, it may be easy to (quickly) identify
> the points that we must consider in the calculation (i.e. those within the
> radius) by using a locality sensitive hashing scheme, tuned to that
> particular radius. Of course, the degree of advantage we get from this
> method will depend on the data itself, but intuitively I think we will
> usually see a dramatic improvement.
> 
> Honestly, I should do more background work developing this idea, and
> possibly try a matlab implementation to test the feasibility. This sounds
> more like a research paper than something we should dive into immediately,
> but I wanted to share the idea and get some feedback if anyone has
> thoughts...
> 
> Matt
> 
> 
> On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net> wrote:
> 
>> Hi Matthew,
>> 
>> I've been looking over the mean-shift papers for the last several days.
>> While the details of the math are still sinking in, it looks like the
>> basic algorithm might be summarized thusly:
>> 
>> Points in an n-d feature space are migrated iteratively in the direction
>> of maxima in their local density functions. Points within a "basin of
>> attraction" all converge to the same maxima and thus belong to the same
>> cluster.
>> 
>> A physical analogy might be(?):
>> 
>> Gas particles in 3-space, operating with gravitational attraction but
>> without momentum, would tend to cluster similarly.
>> 
>> The algorithm seems to require that each point be compared with every
>> other point. This might be taken to require each mapper to see all of
>> the points, thus frustrating scalability. OTOH, Canopy clustering avoids
>> this by clustering the clusters produced by the subsets of points seen
>> by each mapper. K-means has the requirement that each point needs to be
>> compared with all of the cluster centers, not points. It has a similar
>> iterative structure over clusters (a much smaller, constant number) that
>> might be employed.
>> 
>> There is a lot of locality in the local density function window, and
>> this could perhaps be exploited. If points could be pre-clustered (as
>> canopy is often used to prime the k-means iterations), parallelization
>> might be feasible.
>> 
>> Are these observations within a "basin of attraction" to your
>> understanding of mean-shift <grin>?
>> 
>> Jeff
>> 
>> 
>> 
>> -----Original Message-----
>> From: Matthew Riley [mailto:mriley@gmail.com]
>> Sent: Thursday, March 06, 2008 11:46 AM
>> To: mahout-dev@lucene.apache.org
>> Subject: Re: Google Summer of Code[esp. More Clustering]
>> 
>> Hey Jeff-
>> 
>> I'm certainly willing to put some energy into developing implementations
>> of
>> these algorithms, and it's good to hear that you may be interested in
>> guiding us in the right direction.
>> 
>> Here are the references I learned the algorithms from- some are more
>> detailed than others:
>> 
>> Mean-Shift clustering was introduced here and this paper is a thorough
>> reference:
>> Mean-Shift: A Robust Approach to Feature Space Analysis
>> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>> 
>> And here's a PDF with just guts of the algorithm outlined:
>> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>> 
>> It looks like there isn't a definitive reference for the k-means
>> approximation with randomized k-d trees, but there are promising results
>> introduced here:
>> 
>> Object retrieval with large vocabularies and fast spatial matching:
>> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www
>> .robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
>> *
>> And a deeper explanation of the technique here:
>> 
>> Randomized KD-Trees for Real-Time Keypoint Detection:
>> ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
>> 
>> Let me know what you think.
>> 
>> Matt
>> 
>> On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
>> wrote:
>> 
>>> Hi Matthew,
>>> 
>>> As with most open source projects, "interest" is mainly a function of
>>> the willingness of somebody to contribute their energy. Clustering is
>>> certainly within the scope of the project. I'd be interested in
>>> exploring additional clustering algorithms with you and your
>> colleague.
>>> I'm a complete noob in this area and it is always enlightening to work
>>> with students who have more current theoretical exposures.
>>> 
>>> Do you have some links on these approaches that you find particularly
>>> helpful?
>>> 
>>> Jeff
>>> 
>>> -----Original Message-----
>>> From: Matthew Riley [mailto:mriley@gmail.com]
>>> Sent: Wednesday, March 05, 2008 11:11 PM
>>> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
>>> Subject: Re: Google Summer of Code
>>> 
>>> Hey everyone-
>>> 
>>> I've been watching the mailing list for a little while now, hoping to
>>> contribute once I became more familiar, but I wanted to jump in here
>> now
>>> and
>>> express my interest in the Summer of Code project. I'm currently a
>>> graduate
>>> student in electrical engineering at UT-Austin working in computer
>>> vision,
>>> which is closely tied to many of the problems Mahout is addressing
>>> (especially in my area of content-based retrieval).
>>> 
>>> What can I do to help out?
>>> 
>>> I've discussed some potential Mahout projects with another student
>>> recently-
>>> mostly focused around approximate k-means algorithms (since that's a
>>> problem
>>> I've been working on lately). It sounds like you guys are already
>>> implementing canopy clustering for k-means- Is there any interest in
>>> developing another approximation algorithm based on randomized
>> kd-trees
>>> for
>>> high dimensional data? What about mean-shift clustering?
>>> 
>>> Again, I would be glad to help in any way I can.
>>> 
>>> Matt
>>> 
>>> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
>>> wrote:
>>> 
>>>> On Saturday 01 March 2008, Grant Ingersoll wrote:
>>>>> Also, any thoughts on what we might want someone to do?  I think
>> it
>>>>> would be great to have someone implement one of the algorithms on
>>> our
>>>>> wiki.
>>>> 
>>>> Just as a general note, the deadline for applications:
>>>> 
>>>> March 12: Mentoring organization application deadline (12 noon
>>> PDT/19:00
>>>> UTC).
>>>> 
>>>> I suppose we should identify interesing tasks until that deadline.
>> As
>>> a
>>>> general guideline for mentors and for project proposals:
>>>> 
>>>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>>>> 
>>>> Isabel
>>>> 
>>>> --
>>>> Better late than never.         -- Titus Livius (Livy)
>>>>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>>>>  /,`.-'`'    -.  ;-;;,_
>>>>  |,4-  ) )-,_..;\ (  `'-'
>>>> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>>>> 
>>> 
>>

RE: Google Summer of Code[esp. More Clustering]

Posted by Jeff Eastman <je...@windwardsolutions.com>.

Hi Matthew,

I've implemented a minimal, non-MR version of the algorithm below to see how
it would behave. The operant code is in TestMeanShift.testMeanShift() and
MeanShiftCanopy.mergeCanopy(). The rest of the MR classes are stuff I copied
from Canopy so you can ignore them.

The TestMeanShift.setUp() method builds a 100x3 point matrix that represents
a 10x10 image with the diagonal intensified (i.e. a '\' character mask).
Then testMeanShift()creates an initial set of 100 canopies from it and
iterates over the canopies merging their centroids into a new set of
canopies until the canopy list size does not change any more. Finally it
prints out the canopies that were found for each cell in the original image.

Every time two canopies come within T2 distance of each other they merge,
reducing the number of canopies. The original points that were bound to each
canopy are also merged so that, at the end of the iteration, the original
points are available in their respective canopies.

Depending upon the values chosen for T1 and T2, the process either converges
quickly or slowly. The loop terminates before actual convergence is
achieved, but it does seem to cluster the input coherently.

I hesitate to call this MeanShift but it is something similar that follows
the same general algorithm, as I understand it at least. I hope you find it
interesting.

Jeff


> -----Original Message-----
> From: Jeff Eastman [mailto:jeff@windwardsolutions.com]
> Sent: Monday, March 10, 2008 9:09 PM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Google Summer of Code[esp. More Clustering]
> 
> Hi Matthew,
> 
> I'd like to pursue that canopy thought a little further and mix it in with
> your sub sampling idea. Optimizing can come later, once we figure out how
> to
> do mean-shift in M/R at all. How about this?
> 
> 1. Each mean-shift iteration consists of a canopy clustering of all the
> points, with T1 set to the desired sampling resolution (h?) and T2 set to
> 1.
> This will create one canopy centered on each point in the input set which
> contains all of its neighbors that are close enough to influence its next
> position in its trajectory (the window?).
> 
> 2. We then calculate the centroid of each canopy (that's actually done
> already by the canopy cluster reducer). Is this centroid not also the
> weighted average you desire for the next location of the point at its
> center?
> 
> 3. As the computation proceeds, the canopies will collapse together as
> their
> various centroids move inside the T2=1 radius. At the point when all
> points
> have converged, the remaining canopies will be the mean-shift clusters
> (modes?) of the dataset and their contents will be the migrated points in
> each cluster.
> 
> 4. If each original point is duplicated as its own payload, then the
> iterations will produce clusters of migrated points whose payloads are the
> final contents of each cluster.
> 
> Can you wrap your mind around this enough to validate my assumptions?
> 
> Jeff
> 
> > -----Original Message-----
> > From: Matthew Riley [mailto:mriley@gmail.com]
> > Sent: Monday, March 10, 2008 5:58 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: Google Summer of Code[esp. More Clustering]
> >
> > Hi Jeff-
> >
> > I think your "basin of attraction" understanding is right on. I also
> like
> > your ideas for distributing the mean-shift iterations by following a
> > canopy-style method. My intuition was a little different, and I would
> like
> > to hear your ideas on it:
> >
> > Just to make sure we're on the same page....
> > Say we have 1 million point in our original dataset, and we want to
> > cluster
> > by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> > points from the original dataset and follow the gradient of those points
> > to
> > the region of highest density (and as we saw from the paper, rather than
> > calculate the gradient itself we can equivalently compute the weighted
> > average of our subsampled points and move the centroid to that point).
> > This
> > part seems fairly straightforward to distribute - we just send a
> different
> > subsampled set to each processor and each processor returns the final
> > centroid for that set.
> >
> > The problem I see is that 10,000 points (or whatever value we choose),
> may
> > be too much for a single processor if we have to compute the distance to
> > every single point when we compute the weighted mean. My thought here
> was
> > to
> > exploit the fact that we're using a kernel function (gaussian, uniform,
> > etc.) in the weighted mean calculation and that kernel will have a set
> > radius. Because the radius is static, it may be easy to (quickly)
> identify
> > the points that we must consider in the calculation (i.e. those within
> the
> > radius) by using a locality sensitive hashing scheme, tuned to that
> > particular radius. Of course, the degree of advantage we get from this
> > method will depend on the data itself, but intuitively I think we will
> > usually see a dramatic improvement.
> >
> > Honestly, I should do more background work developing this idea, and
> > possibly try a matlab implementation to test the feasibility. This
> sounds
> > more like a research paper than something we should dive into
> immediately,
> > but I wanted to share the idea and get some feedback if anyone has
> > thoughts...
> >
> > Matt
> >
> >
> > On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net>
> > wrote:
> >
> > > Hi Matthew,
> > >
> > > I've been looking over the mean-shift papers for the last several
> days.
> > > While the details of the math are still sinking in, it looks like the
> > > basic algorithm might be summarized thusly:
> > >
> > > Points in an n-d feature space are migrated iteratively in the
> direction
> > > of maxima in their local density functions. Points within a "basin of
> > > attraction" all converge to the same maxima and thus belong to the
> same
> > > cluster.
> > >
> > > A physical analogy might be(?):
> > >
> > > Gas particles in 3-space, operating with gravitational attraction but
> > > without momentum, would tend to cluster similarly.
> > >
> > > The algorithm seems to require that each point be compared with every
> > > other point. This might be taken to require each mapper to see all of
> > > the points, thus frustrating scalability. OTOH, Canopy clustering
> avoids
> > > this by clustering the clusters produced by the subsets of points seen
> > > by each mapper. K-means has the requirement that each point needs to
> be
> > > compared with all of the cluster centers, not points. It has a similar
> > > iterative structure over clusters (a much smaller, constant number)
> that
> > > might be employed.
> > >
> > > There is a lot of locality in the local density function window, and
> > > this could perhaps be exploited. If points could be pre-clustered (as
> > > canopy is often used to prime the k-means iterations), parallelization
> > > might be feasible.
> > >
> > > Are these observations within a "basin of attraction" to your
> > > understanding of mean-shift <grin>?
> > >
> > > Jeff
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > Sent: Thursday, March 06, 2008 11:46 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: Google Summer of Code[esp. More Clustering]
> > >
> > > Hey Jeff-
> > >
> > > I'm certainly willing to put some energy into developing
> implementations
> > > of
> > > these algorithms, and it's good to hear that you may be interested in
> > > guiding us in the right direction.
> > >
> > > Here are the references I learned the algorithms from- some are more
> > > detailed than others:
> > >
> > > Mean-Shift clustering was introduced here and this paper is a thorough
> > > reference:
> > > Mean-Shift: A Robust Approach to Feature Space Analysis
> > > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> > >
> > > And here's a PDF with just guts of the algorithm outlined:
> > > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> > >
> > > It looks like there isn't a definitive reference for the k-means
> > > approximation with randomized k-d trees, but there are promising
> results
> > > introduced here:
> > >
> > > Object retrieval with large vocabularies and fast spatial matching:
> > >
> >
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://
> > www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > > *
> > > And a deeper explanation of the technique here:
> > >
> > > Randomized KD-Trees for Real-Time Keypoint Detection:
> > > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> > >
> > > Let me know what you think.
> > >
> > > Matt
> > >
> > > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> > > wrote:
> > >
> > > > Hi Matthew,
> > > >
> > > > As with most open source projects, "interest" is mainly a function
> of
> > > > the willingness of somebody to contribute their energy. Clustering
> is
> > > > certainly within the scope of the project. I'd be interested in
> > > > exploring additional clustering algorithms with you and your
> > > colleague.
> > > > I'm a complete noob in this area and it is always enlightening to
> work
> > > > with students who have more current theoretical exposures.
> > > >
> > > > Do you have some links on these approaches that you find
> particularly
> > > > helpful?
> > > >
> > > > Jeff
> > > >
> > > > -----Original Message-----
> > > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > > Sent: Wednesday, March 05, 2008 11:11 PM
> > > > To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> > > > Subject: Re: Google Summer of Code
> > > >
> > > > Hey everyone-
> > > >
> > > > I've been watching the mailing list for a little while now, hoping
> to
> > > > contribute once I became more familiar, but I wanted to jump in here
> > > now
> > > > and
> > > > express my interest in the Summer of Code project. I'm currently a
> > > > graduate
> > > > student in electrical engineering at UT-Austin working in computer
> > > > vision,
> > > > which is closely tied to many of the problems Mahout is addressing
> > > > (especially in my area of content-based retrieval).
> > > >
> > > > What can I do to help out?
> > > >
> > > > I've discussed some potential Mahout projects with another student
> > > > recently-
> > > > mostly focused around approximate k-means algorithms (since that's a
> > > > problem
> > > > I've been working on lately). It sounds like you guys are already
> > > > implementing canopy clustering for k-means- Is there any interest in
> > > > developing another approximation algorithm based on randomized
> > > kd-trees
> > > > for
> > > > high dimensional data? What about mean-shift clustering?
> > > >
> > > > Again, I would be glad to help in any way I can.
> > > >
> > > > Matt
> > > >
> > > > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <mainec@isabel-
> drost.de>
> > > > wrote:
> > > >
> > > > > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > > > > Also, any thoughts on what we might want someone to do?  I think
> > > it
> > > > > > would be great to have someone implement one of the algorithms
> on
> > > > our
> > > > > > wiki.
> > > > >
> > > > > Just as a general note, the deadline for applications:
> > > > >
> > > > > March 12: Mentoring organization application deadline (12 noon
> > > > PDT/19:00
> > > > > UTC).
> > > > >
> > > > > I suppose we should identify interesing tasks until that deadline.
> > > As
> > > > a
> > > > > general guideline for mentors and for project proposals:
> > > > >
> > > > > http://code.google.com/p/google-summer-of-
> code/wiki/AdviceforMentors
> > > > >
> > > > > Isabel
> > > > >
> > > > > --
> > > > > Better late than never.         -- Titus Livius (Livy)
> > > > >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> > > > >  /,`.-'`'    -.  ;-;;,_
> > > > >  |,4-  ) )-,_..;\ (  `'-'
> > > > > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> > > > >
> > > >
> > >
>

RE: Google Summer of Code[esp. More Clustering]

Posted by Jeff Eastman <je...@windwardsolutions.com>.

Hi Matthew,

I'd like to pursue that canopy thought a little further and mix it in with
your sub sampling idea. Optimizing can come later, once we figure out how to
do mean-shift in M/R at all. How about this?

1. Each mean-shift iteration consists of a canopy clustering of all the
points, with T1 set to the desired sampling resolution (h?) and T2 set to 1.
This will create one canopy centered on each point in the input set which
contains all of its neighbors that are close enough to influence its next
position in its trajectory (the window?).

2. We then calculate the centroid of each canopy (that's actually done
already by the canopy cluster reducer). Is this centroid not also the
weighted average you desire for the next location of the point at its
center?

3. As the computation proceeds, the canopies will collapse together as their
various centroids move inside the T2=1 radius. At the point when all points
have converged, the remaining canopies will be the mean-shift clusters
(modes?) of the dataset and their contents will be the migrated points in
each cluster.

4. If each original point is duplicated as its own payload, then the
iterations will produce clusters of migrated points whose payloads are the
final contents of each cluster.

Can you wrap your mind around this enough to validate my assumptions? 

Jeff

> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Monday, March 10, 2008 5:58 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Google Summer of Code[esp. More Clustering]
> 
> Hi Jeff-
> 
> I think your "basin of attraction" understanding is right on. I also like
> your ideas for distributing the mean-shift iterations by following a
> canopy-style method. My intuition was a little different, and I would like
> to hear your ideas on it:
> 
> Just to make sure we're on the same page....
> Say we have 1 million point in our original dataset, and we want to
> cluster
> by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
> points from the original dataset and follow the gradient of those points
> to
> the region of highest density (and as we saw from the paper, rather than
> calculate the gradient itself we can equivalently compute the weighted
> average of our subsampled points and move the centroid to that point).
> This
> part seems fairly straightforward to distribute - we just send a different
> subsampled set to each processor and each processor returns the final
> centroid for that set.
> 
> The problem I see is that 10,000 points (or whatever value we choose), may
> be too much for a single processor if we have to compute the distance to
> every single point when we compute the weighted mean. My thought here was
> to
> exploit the fact that we're using a kernel function (gaussian, uniform,
> etc.) in the weighted mean calculation and that kernel will have a set
> radius. Because the radius is static, it may be easy to (quickly) identify
> the points that we must consider in the calculation (i.e. those within the
> radius) by using a locality sensitive hashing scheme, tuned to that
> particular radius. Of course, the degree of advantage we get from this
> method will depend on the data itself, but intuitively I think we will
> usually see a dramatic improvement.
> 
> Honestly, I should do more background work developing this idea, and
> possibly try a matlab implementation to test the feasibility. This sounds
> more like a research paper than something we should dive into immediately,
> but I wanted to share the idea and get some feedback if anyone has
> thoughts...
> 
> Matt
> 
> 
> On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net>
> wrote:
> 
> > Hi Matthew,
> >
> > I've been looking over the mean-shift papers for the last several days.
> > While the details of the math are still sinking in, it looks like the
> > basic algorithm might be summarized thusly:
> >
> > Points in an n-d feature space are migrated iteratively in the direction
> > of maxima in their local density functions. Points within a "basin of
> > attraction" all converge to the same maxima and thus belong to the same
> > cluster.
> >
> > A physical analogy might be(?):
> >
> > Gas particles in 3-space, operating with gravitational attraction but
> > without momentum, would tend to cluster similarly.
> >
> > The algorithm seems to require that each point be compared with every
> > other point. This might be taken to require each mapper to see all of
> > the points, thus frustrating scalability. OTOH, Canopy clustering avoids
> > this by clustering the clusters produced by the subsets of points seen
> > by each mapper. K-means has the requirement that each point needs to be
> > compared with all of the cluster centers, not points. It has a similar
> > iterative structure over clusters (a much smaller, constant number) that
> > might be employed.
> >
> > There is a lot of locality in the local density function window, and
> > this could perhaps be exploited. If points could be pre-clustered (as
> > canopy is often used to prime the k-means iterations), parallelization
> > might be feasible.
> >
> > Are these observations within a "basin of attraction" to your
> > understanding of mean-shift <grin>?
> >
> > Jeff
> >
> >
> >
> > -----Original Message-----
> > From: Matthew Riley [mailto:mriley@gmail.com]
> > Sent: Thursday, March 06, 2008 11:46 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: Google Summer of Code[esp. More Clustering]
> >
> > Hey Jeff-
> >
> > I'm certainly willing to put some energy into developing implementations
> > of
> > these algorithms, and it's good to hear that you may be interested in
> > guiding us in the right direction.
> >
> > Here are the references I learned the algorithms from- some are more
> > detailed than others:
> >
> > Mean-Shift clustering was introduced here and this paper is a thorough
> > reference:
> > Mean-Shift: A Robust Approach to Feature Space Analysis
> > http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
> >
> > And here's a PDF with just guts of the algorithm outlined:
> > homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
> >
> > It looks like there isn't a definitive reference for the k-means
> > approximation with randomized k-d trees, but there are promising results
> > introduced here:
> >
> > Object retrieval with large vocabularies and fast spatial matching:
> >
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://
> www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> > *
> > And a deeper explanation of the technique here:
> >
> > Randomized KD-Trees for Real-Time Keypoint Detection:
> > ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
> >
> > Let me know what you think.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> > wrote:
> >
> > > Hi Matthew,
> > >
> > > As with most open source projects, "interest" is mainly a function of
> > > the willingness of somebody to contribute their energy. Clustering is
> > > certainly within the scope of the project. I'd be interested in
> > > exploring additional clustering algorithms with you and your
> > colleague.
> > > I'm a complete noob in this area and it is always enlightening to work
> > > with students who have more current theoretical exposures.
> > >
> > > Do you have some links on these approaches that you find particularly
> > > helpful?
> > >
> > > Jeff
> > >
> > > -----Original Message-----
> > > From: Matthew Riley [mailto:mriley@gmail.com]
> > > Sent: Wednesday, March 05, 2008 11:11 PM
> > > To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> > > Subject: Re: Google Summer of Code
> > >
> > > Hey everyone-
> > >
> > > I've been watching the mailing list for a little while now, hoping to
> > > contribute once I became more familiar, but I wanted to jump in here
> > now
> > > and
> > > express my interest in the Summer of Code project. I'm currently a
> > > graduate
> > > student in electrical engineering at UT-Austin working in computer
> > > vision,
> > > which is closely tied to many of the problems Mahout is addressing
> > > (especially in my area of content-based retrieval).
> > >
> > > What can I do to help out?
> > >
> > > I've discussed some potential Mahout projects with another student
> > > recently-
> > > mostly focused around approximate k-means algorithms (since that's a
> > > problem
> > > I've been working on lately). It sounds like you guys are already
> > > implementing canopy clustering for k-means- Is there any interest in
> > > developing another approximation algorithm based on randomized
> > kd-trees
> > > for
> > > high dimensional data? What about mean-shift clustering?
> > >
> > > Again, I would be glad to help in any way I can.
> > >
> > > Matt
> > >
> > > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> > > wrote:
> > >
> > > > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > > > Also, any thoughts on what we might want someone to do?  I think
> > it
> > > > > would be great to have someone implement one of the algorithms on
> > > our
> > > > > wiki.
> > > >
> > > > Just as a general note, the deadline for applications:
> > > >
> > > > March 12: Mentoring organization application deadline (12 noon
> > > PDT/19:00
> > > > UTC).
> > > >
> > > > I suppose we should identify interesing tasks until that deadline.
> > As
> > > a
> > > > general guideline for mentors and for project proposals:
> > > >
> > > > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> > > >
> > > > Isabel
> > > >
> > > > --
> > > > Better late than never.         -- Titus Livius (Livy)
> > > >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> > > >  /,`.-'`'    -.  ;-;;,_
> > > >  |,4-  ) )-,_..;\ (  `'-'
> > > > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> > > >
> > >
> >

Re: Google Summer of Code[esp. More Clustering]

Posted by Matthew Riley <mr...@gmail.com>.

Hi Jeff-

I think your "basin of attraction" understanding is right on. I also like
your ideas for distributing the mean-shift iterations by following a
canopy-style method. My intuition was a little different, and I would like
to hear your ideas on it:

Just to make sure we're on the same page....
Say we have 1 million point in our original dataset, and we want to cluster
by mean-shift. At each iteration of mean-shift we subsample (say) 10,000
points from the original dataset and follow the gradient of those points to
the region of highest density (and as we saw from the paper, rather than
calculate the gradient itself we can equivalently compute the weighted
average of our subsampled points and move the centroid to that point). This
part seems fairly straightforward to distribute - we just send a different
subsampled set to each processor and each processor returns the final
centroid for that set.

The problem I see is that 10,000 points (or whatever value we choose), may
be too much for a single processor if we have to compute the distance to
every single point when we compute the weighted mean. My thought here was to
exploit the fact that we're using a kernel function (gaussian, uniform,
etc.) in the weighted mean calculation and that kernel will have a set
radius. Because the radius is static, it may be easy to (quickly) identify
the points that we must consider in the calculation (i.e. those within the
radius) by using a locality sensitive hashing scheme, tuned to that
particular radius. Of course, the degree of advantage we get from this
method will depend on the data itself, but intuitively I think we will
usually see a dramatic improvement.

Honestly, I should do more background work developing this idea, and
possibly try a matlab implementation to test the feasibility. This sounds
more like a research paper than something we should dive into immediately,
but I wanted to share the idea and get some feedback if anyone has
thoughts...

Matt


On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman <je...@collab.net> wrote:

> Hi Matthew,
>
> I've been looking over the mean-shift papers for the last several days.
> While the details of the math are still sinking in, it looks like the
> basic algorithm might be summarized thusly:
>
> Points in an n-d feature space are migrated iteratively in the direction
> of maxima in their local density functions. Points within a "basin of
> attraction" all converge to the same maxima and thus belong to the same
> cluster.
>
> A physical analogy might be(?):
>
> Gas particles in 3-space, operating with gravitational attraction but
> without momentum, would tend to cluster similarly.
>
> The algorithm seems to require that each point be compared with every
> other point. This might be taken to require each mapper to see all of
> the points, thus frustrating scalability. OTOH, Canopy clustering avoids
> this by clustering the clusters produced by the subsets of points seen
> by each mapper. K-means has the requirement that each point needs to be
> compared with all of the cluster centers, not points. It has a similar
> iterative structure over clusters (a much smaller, constant number) that
> might be employed.
>
> There is a lot of locality in the local density function window, and
> this could perhaps be exploited. If points could be pre-clustered (as
> canopy is often used to prime the k-means iterations), parallelization
> might be feasible.
>
> Are these observations within a "basin of attraction" to your
> understanding of mean-shift <grin>?
>
> Jeff
>
>
>
> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Thursday, March 06, 2008 11:46 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Google Summer of Code[esp. More Clustering]
>
> Hey Jeff-
>
> I'm certainly willing to put some energy into developing implementations
> of
> these algorithms, and it's good to hear that you may be interested in
> guiding us in the right direction.
>
> Here are the references I learned the algorithms from- some are more
> detailed than others:
>
> Mean-Shift clustering was introduced here and this paper is a thorough
> reference:
> Mean-Shift: A Robust Approach to Feature Space Analysis
> http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
>
> And here's a PDF with just guts of the algorithm outlined:
> homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>
> It looks like there isn't a definitive reference for the k-means
> approximation with randomized k-d trees, but there are promising results
> introduced here:
>
> Object retrieval with large vocabularies and fast spatial matching:
> http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*<http://www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf*>
> *
> And a deeper explanation of the technique here:
>
> Randomized KD-Trees for Real-Time Keypoint Detection:
> ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521
>
> Let me know what you think.
>
> Matt
>
> On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
> wrote:
>
> > Hi Matthew,
> >
> > As with most open source projects, "interest" is mainly a function of
> > the willingness of somebody to contribute their energy. Clustering is
> > certainly within the scope of the project. I'd be interested in
> > exploring additional clustering algorithms with you and your
> colleague.
> > I'm a complete noob in this area and it is always enlightening to work
> > with students who have more current theoretical exposures.
> >
> > Do you have some links on these approaches that you find particularly
> > helpful?
> >
> > Jeff
> >
> > -----Original Message-----
> > From: Matthew Riley [mailto:mriley@gmail.com]
> > Sent: Wednesday, March 05, 2008 11:11 PM
> > To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> > Subject: Re: Google Summer of Code
> >
> > Hey everyone-
> >
> > I've been watching the mailing list for a little while now, hoping to
> > contribute once I became more familiar, but I wanted to jump in here
> now
> > and
> > express my interest in the Summer of Code project. I'm currently a
> > graduate
> > student in electrical engineering at UT-Austin working in computer
> > vision,
> > which is closely tied to many of the problems Mahout is addressing
> > (especially in my area of content-based retrieval).
> >
> > What can I do to help out?
> >
> > I've discussed some potential Mahout projects with another student
> > recently-
> > mostly focused around approximate k-means algorithms (since that's a
> > problem
> > I've been working on lately). It sounds like you guys are already
> > implementing canopy clustering for k-means- Is there any interest in
> > developing another approximation algorithm based on randomized
> kd-trees
> > for
> > high dimensional data? What about mean-shift clustering?
> >
> > Again, I would be glad to help in any way I can.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> > wrote:
> >
> > > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > > Also, any thoughts on what we might want someone to do?  I think
> it
> > > > would be great to have someone implement one of the algorithms on
> > our
> > > > wiki.
> > >
> > > Just as a general note, the deadline for applications:
> > >
> > > March 12: Mentoring organization application deadline (12 noon
> > PDT/19:00
> > > UTC).
> > >
> > > I suppose we should identify interesing tasks until that deadline.
> As
> > a
> > > general guideline for mentors and for project proposals:
> > >
> > > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> > >
> > > Isabel
> > >
> > > --
> > > Better late than never.         -- Titus Livius (Livy)
> > >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> > >  /,`.-'`'    -.  ;-;;,_
> > >  |,4-  ) )-,_..;\ (  `'-'
> > > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> > >
> >
>

RE: Google Summer of Code[esp. More Clustering]

Posted by Jeff Eastman <je...@collab.net>.

Hi Matthew,

I've been looking over the mean-shift papers for the last several days.
While the details of the math are still sinking in, it looks like the
basic algorithm might be summarized thusly:

Points in an n-d feature space are migrated iteratively in the direction
of maxima in their local density functions. Points within a "basin of
attraction" all converge to the same maxima and thus belong to the same
cluster.

A physical analogy might be(?):

Gas particles in 3-space, operating with gravitational attraction but
without momentum, would tend to cluster similarly.

The algorithm seems to require that each point be compared with every
other point. This might be taken to require each mapper to see all of
the points, thus frustrating scalability. OTOH, Canopy clustering avoids
this by clustering the clusters produced by the subsets of points seen
by each mapper. K-means has the requirement that each point needs to be
compared with all of the cluster centers, not points. It has a similar
iterative structure over clusters (a much smaller, constant number) that
might be employed.

There is a lot of locality in the local density function window, and
this could perhaps be exploited. If points could be pre-clustered (as
canopy is often used to prime the k-means iterations), parallelization
might be feasible.

Are these observations within a "basin of attraction" to your
understanding of mean-shift <grin>? 

Jeff

-----Original Message-----
From: Matthew Riley [mailto:mriley@gmail.com] 
Sent: Thursday, March 06, 2008 11:46 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Google Summer of Code[esp. More Clustering]

Hey Jeff-

I'm certainly willing to put some energy into developing implementations
of
these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising results
introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net>
wrote:

> Hi Matthew,
>
> As with most open source projects, "interest" is mainly a function of
> the willingness of somebody to contribute their energy. Clustering is
> certainly within the scope of the project. I'd be interested in
> exploring additional clustering algorithms with you and your
colleague.
> I'm a complete noob in this area and it is always enlightening to work
> with students who have more current theoretical exposures.
>
> Do you have some links on these approaches that you find particularly
> helpful?
>
> Jeff
>
> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Wednesday, March 05, 2008 11:11 PM
> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> Subject: Re: Google Summer of Code
>
> Hey everyone-
>
> I've been watching the mailing list for a little while now, hoping to
> contribute once I became more familiar, but I wanted to jump in here
now
> and
> express my interest in the Summer of Code project. I'm currently a
> graduate
> student in electrical engineering at UT-Austin working in computer
> vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).
>
> What can I do to help out?
>
> I've discussed some potential Mahout projects with another student
> recently-
> mostly focused around approximate k-means algorithms (since that's a
> problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized
kd-trees
> for
> high dimensional data? What about mean-shift clustering?
>
> Again, I would be glad to help in any way I can.
>
> Matt
>
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> wrote:
>
> > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > Also, any thoughts on what we might want someone to do?  I think
it
> > > would be great to have someone implement one of the algorithms on
> our
> > > wiki.
> >
> > Just as a general note, the deadline for applications:
> >
> > March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> > UTC).
> >
> > I suppose we should identify interesing tasks until that deadline.
As
> a
> > general guideline for mentors and for project proposals:
> >
> > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >
> > Isabel
> >
> > --
> > Better late than never.         -- Titus Livius (Livy)
> >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'    -.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >
>

Re: Google Summer of Code[esp. More Clustering]

Posted by Matthew Riley <mr...@gmail.com>.

Hey Jeff-

I'm certainly willing to put some energy into developing implementations of
these algorithms, and it's good to hear that you may be interested in
guiding us in the right direction.

Here are the references I learned the algorithms from- some are more
detailed than others:

Mean-Shift clustering was introduced here and this paper is a thorough
reference:
Mean-Shift: A Robust Approach to Feature Space Analysis
http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf

And here's a PDF with just guts of the algorithm outlined:
homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

It looks like there isn't a definitive reference for the k-means
approximation with randomized k-d trees, but there are promising results
introduced here:

Object retrieval with large vocabularies and fast spatial matching:
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*
*
And a deeper explanation of the technique here:

Randomized KD-Trees for Real-Time Keypoint Detection:
ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521

Let me know what you think.

Matt

On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman <je...@collab.net> wrote:

> Hi Matthew,
>
> As with most open source projects, "interest" is mainly a function of
> the willingness of somebody to contribute their energy. Clustering is
> certainly within the scope of the project. I'd be interested in
> exploring additional clustering algorithms with you and your colleague.
> I'm a complete noob in this area and it is always enlightening to work
> with students who have more current theoretical exposures.
>
> Do you have some links on these approaches that you find particularly
> helpful?
>
> Jeff
>
> -----Original Message-----
> From: Matthew Riley [mailto:mriley@gmail.com]
> Sent: Wednesday, March 05, 2008 11:11 PM
> To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
> Subject: Re: Google Summer of Code
>
> Hey everyone-
>
> I've been watching the mailing list for a little while now, hoping to
> contribute once I became more familiar, but I wanted to jump in here now
> and
> express my interest in the Summer of Code project. I'm currently a
> graduate
> student in electrical engineering at UT-Austin working in computer
> vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).
>
> What can I do to help out?
>
> I've discussed some potential Mahout projects with another student
> recently-
> mostly focused around approximate k-means algorithms (since that's a
> problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized kd-trees
> for
> high dimensional data? What about mean-shift clustering?
>
> Again, I would be glad to help in any way I can.
>
> Matt
>
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> wrote:
>
> > On Saturday 01 March 2008, Grant Ingersoll wrote:
> > > Also, any thoughts on what we might want someone to do?  I think it
> > > would be great to have someone implement one of the algorithms on
> our
> > > wiki.
> >
> > Just as a general note, the deadline for applications:
> >
> > March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> > UTC).
> >
> > I suppose we should identify interesing tasks until that deadline. As
> a
> > general guideline for mentors and for project proposals:
> >
> > http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >
> > Isabel
> >
> > --
> > Better late than never.         -- Titus Livius (Livy)
> >   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'    -.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >
>

RE: Google Summer of Code[esp. More Clustering]

Posted by Jeff Eastman <je...@collab.net>.

Hi Matthew,

As with most open source projects, "interest" is mainly a function of
the willingness of somebody to contribute their energy. Clustering is
certainly within the scope of the project. I'd be interested in
exploring additional clustering algorithms with you and your colleague.
I'm a complete noob in this area and it is always enlightening to work
with students who have more current theoretical exposures.

Do you have some links on these approaches that you find particularly
helpful?

Jeff

-----Original Message-----
From: Matthew Riley [mailto:mriley@gmail.com] 
Sent: Wednesday, March 05, 2008 11:11 PM
To: mahout-dev@lucene.apache.org; mainec@isabel-drost.de
Subject: Re: Google Summer of Code

Hey everyone-

I've been watching the mailing list for a little while now, hoping to
contribute once I became more familiar, but I wanted to jump in here now
and
express my interest in the Summer of Code project. I'm currently a
graduate
student in electrical engineering at UT-Austin working in computer
vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).

What can I do to help out?

I've discussed some potential Mahout projects with another student
recently-
mostly focused around approximate k-means algorithms (since that's a
problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees
for
high dimensional data? What about mean-shift clustering?

Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
wrote:

> On Saturday 01 March 2008, Grant Ingersoll wrote:
> > Also, any thoughts on what we might want someone to do?  I think it
> > would be great to have someone implement one of the algorithms on
our
> > wiki.
>
> Just as a general note, the deadline for applications:
>
> March 12: Mentoring organization application deadline (12 noon
PDT/19:00
> UTC).
>
> I suppose we should identify interesing tasks until that deadline. As
a
> general guideline for mentors and for project proposals:
>
> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>
> Isabel
>
> --
> Better late than never.         -- Titus Livius (Livy)
>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>

Re: Google Summer of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Thursday 06 March 2008, Matthew Riley wrote:
> I would basically be interested in doing anything that fits in well with
> the overall goals of the Mahout project. Whether that is implementing well
> known algorithms within the Hadoop framework or working on some novel idea
> is up to the mentors, I presume. 

I would be happy with both options: Working on well known algorithms within 
the Hadoop framework certainly is one of our main goals. But at least me 
personally am also interested in providing space for novel ideas. I consider 
it really important for researchers to not only publish the data they 
experimented on but also the implementation used. If working on the latter 
within Mahout helps to maybe focus a little more than usual on scalability 
and maintainability - great.

So if you have an idea that fits well with your day to day work as well as 
with the overall goals of Mahout that would be fine. I would guess, this 
makes it easier to find some spare time to work on the project ;)

Isabel 

-- 
Each new user of a new system uncovers a new class of bugs.		-- Kernighan
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 6, 2008, at 4:36 PM, Matthew Riley wrote:

> I would basically be interested in doing anything that fits in well  
> with the
> overall goals of the Mahout project. Whether that is implementing  
> well known
> algorithms within the Hadoop framework or working on some novel idea  
> is up
> to the mentors, I presume. Personally, if I'm going to be working on
> something novel, I would like to relate it to my current research  
> work...
> and I'm happy to discuss that with anyone on the list who is  
> interested.
>

Please do share your research work.  As for novel, versus w/in the  
goals, we like both.  I think at this stage, however, we do want to  
focus on those approaches that have stood the test of time (as short  
as that is) to some extent.  Personally, I would love to see someone  
take on SVM on Hadoop, but I am open to pretty much anything, so...

-Grant

Re: Google Summer of Code

Posted by Matthew Riley <mr...@gmail.com>.

Hey Dawid,

Is it information retrieval from visual data you're working on? We have
> recently
> had a presentation about a guy who implemented motion detection on GPUs
> with
> very impressive speedups (orders of magnitude compared to normal CPUs).
> I'm
> wondering if your expertise here could be used to implement map-reduce
> distributed jobs for running multiple GPUs in parallel. I know this sounds
> a bit
> crazy, but I've heard of bio-engineering companies doing just that --
> running a
> cluster of GPUs to speed up their computations. Just a wild thought. Back
> to
> your proposal though.


Yes, it is basically information retrieval that I'm performing on sets of
images- in fact, a lot of the best algorithms employed today for object
detection, object retrieval, etc. are adaptations of basic text-retrieval
approaches (e.g. tfidf-weighted vector space models). I've personally never
worked with GPUs for image processing, but I imagine the vector processing
abilities would be useful at almost every stage of the indexing and
retrieval processes. I would be interested in looking into those
possibilities in more details.


> > mostly focused around approximate k-means algorithms (since that's a
> problem
> > I've been working on lately). It sounds like you guys are already
> > implementing canopy clustering for k-means- Is there any interest in
> > developing another approximation algorithm based on randomized kd-trees
> for
> > high dimensional data? What about mean-shift clustering?
>
>  From my experience the largest challenge in data clustering is not
> figuring out
> a new clustering methodology, but finding the right existing one to tackle
> a
> particular problem. Isabel mentioned web spam detection challenge --  this
> is a
> good example of a multi-feature classification problem and I know people
> have
> tried clustering the host graph to come up with more coarse-grained
> features for
> hosts. From my own interest, a very interesting challenge is doing
> something
> like Google News does (event aggregation). This is less trivial than you
> might
> think at first -- most news are very similar to each other (copy/paste and
> editing changes), so it's trivial to find small clusters of near-clones.
> Then
> the problem becomes more difficult because all news speak about pretty
> much the
> same people/ events (take presidential election in the U.S.). I think the
> problems you could state here are:
>
> 1) approximating optimal clustering granularity (call it the number of
> clusters
> if you wish, although I think clustering should be driven by other factors
> rather than just the number of clusters),
>
> 2) coming up with clusters of news items _other_ than keyword-based
> similarity.
> One example here is grouping news by region (geolocation), sentiment
> (positive/
> negative news), people-related news, etc.
>
> 3) multilingual news matching and clustering.
>
> All the above issues are on the border of different domains -- NLP,
> clustering,
> classification. The tricky part is being able to put them together. What
> would
> be of interest to you?


These are all interesting problems, actually. I've done some research into
sentiment analysis, as you mentioned in (2), and I think it's still a wide
open problem. Oren Etzioni at UWash does some interesting related work:
www.cs.washington.edu/homes/etzioni/.

I would basically be interested in doing anything that fits in well with the
overall goals of the Mahout project. Whether that is implementing well known
algorithms within the Hadoop framework or working on some novel idea is up
to the mentors, I presume. Personally, if I'm going to be working on
something novel, I would like to relate it to my current research work...
and I'm happy to discuss that with anyone on the list who is interested.

Matt


>
>
> D.
>
> >
> > Again, I would be glad to help in any way I can.
> >
> > Matt
> >
> > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> > wrote:
> >
> >> On Saturday 01 March 2008, Grant Ingersoll wrote:
> >>> Also, any thoughts on what we might want someone to do?  I think it
> >>> would be great to have someone implement one of the algorithms on our
> >>> wiki.
> >> Just as a general note, the deadline for applications:
> >>
> >> March 12: Mentoring organization application deadline (12 noon
> PDT/19:00
> >> UTC).
> >>
> >> I suppose we should identify interesing tasks until that deadline. As a
> >> general guideline for mentors and for project proposals:
> >>
> >> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
> >>
> >> Isabel
> >>
> >> --
> >> Better late than never.         -- Titus Livius (Livy)
> >>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >>  /,`.-'`'    -.  ;-;;,_
> >>  |,4-  ) )-,_..;\ (  `'-'
> >> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
> >>
> >
>

Re: Google Summer of Code

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Hi Matthew,

> student in electrical engineering at UT-Austin working in computer vision,
> which is closely tied to many of the problems Mahout is addressing
> (especially in my area of content-based retrieval).

Is it information retrieval from visual data you're working on? We have recently 
had a presentation about a guy who implemented motion detection on GPUs with 
very impressive speedups (orders of magnitude compared to normal CPUs). I'm 
wondering if your expertise here could be used to implement map-reduce 
distributed jobs for running multiple GPUs in parallel. I know this sounds a bit 
crazy, but I've heard of bio-engineering companies doing just that -- running a 
cluster of GPUs to speed up their computations. Just a wild thought. Back to 
your proposal though.

> mostly focused around approximate k-means algorithms (since that's a problem
> I've been working on lately). It sounds like you guys are already
> implementing canopy clustering for k-means- Is there any interest in
> developing another approximation algorithm based on randomized kd-trees for
> high dimensional data? What about mean-shift clustering?

 From my experience the largest challenge in data clustering is not figuring out 
a new clustering methodology, but finding the right existing one to tackle a 
particular problem. Isabel mentioned web spam detection challenge --  this is a 
good example of a multi-feature classification problem and I know people have 
tried clustering the host graph to come up with more coarse-grained features for 
hosts. From my own interest, a very interesting challenge is doing something 
like Google News does (event aggregation). This is less trivial than you might 
think at first -- most news are very similar to each other (copy/paste and 
editing changes), so it's trivial to find small clusters of near-clones. Then 
the problem becomes more difficult because all news speak about pretty much the 
same people/ events (take presidential election in the U.S.). I think the 
problems you could state here are:

1) approximating optimal clustering granularity (call it the number of clusters 
if you wish, although I think clustering should be driven by other factors 
rather than just the number of clusters),

2) coming up with clusters of news items _other_ than keyword-based similarity. 
One example here is grouping news by region (geolocation), sentiment (positive/ 
negative news), people-related news, etc.

3) multilingual news matching and clustering.

All the above issues are on the border of different domains -- NLP, clustering, 
classification. The tricky part is being able to put them together. What would 
be of interest to you?

D.

> 
> Again, I would be glad to help in any way I can.
> 
> Matt
> 
> On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
> wrote:
> 
>> On Saturday 01 March 2008, Grant Ingersoll wrote:
>>> Also, any thoughts on what we might want someone to do?  I think it
>>> would be great to have someone implement one of the algorithms on our
>>> wiki.
>> Just as a general note, the deadline for applications:
>>
>> March 12: Mentoring organization application deadline (12 noon PDT/19:00
>> UTC).
>>
>> I suppose we should identify interesing tasks until that deadline. As a
>> general guideline for mentors and for project proposals:
>>
>> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>>
>> Isabel
>>
>> --
>> Better late than never.         -- Titus Livius (Livy)
>>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>>  /,`.-'`'    -.  ;-;;,_
>>  |,4-  ) )-,_..;\ (  `'-'
>> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>>
>

Re: Google Summer of Code

Posted by Matthew Riley <mr...@gmail.com>.

Hey everyone-

I've been watching the mailing list for a little while now, hoping to
contribute once I became more familiar, but I wanted to jump in here now and
express my interest in the Summer of Code project. I'm currently a graduate
student in electrical engineering at UT-Austin working in computer vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).

What can I do to help out?

I've discussed some potential Mahout projects with another student recently-
mostly focused around approximate k-means algorithms (since that's a problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees for
high dimensional data? What about mean-shift clustering?

Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <ma...@isabel-drost.de>
wrote:

> On Saturday 01 March 2008, Grant Ingersoll wrote:
> > Also, any thoughts on what we might want someone to do?  I think it
> > would be great to have someone implement one of the algorithms on our
> > wiki.
>
> Just as a general note, the deadline for applications:
>
> March 12: Mentoring organization application deadline (12 noon PDT/19:00
> UTC).
>
> I suppose we should identify interesing tasks until that deadline. As a
> general guideline for mentors and for project proposals:
>
> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
>
> Isabel
>
> --
> Better late than never.         -- Titus Livius (Livy)
>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>

Re: Google Summer of Code

Posted by Isabel Drost <ma...@isabel-drost.de>.

On Saturday 01 March 2008, Grant Ingersoll wrote:
> Also, any thoughts on what we might want someone to do?  I think it
> would be great to have someone implement one of the algorithms on our
> wiki.

Just as a general note, the deadline for applications:

March 12: Mentoring organization application deadline (12 noon PDT/19:00 UTC).

I suppose we should identify interesing tasks until that deadline. As a 
general guideline for mentors and for project proposals:

http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors

Isabel

-- 
Better late than never.		-- Titus Livius (Livy)
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summer of Code

Posted by Grant Ingersoll <gs...@apache.org>.

Also, any thoughts on what we might want someone to do?  I think it  
would be great to have someone implement one of the algorithms on our  
wiki.

-Grant

On Feb 29, 2008, at 9:33 PM, Grant Ingersoll wrote:

> Hi Gang,
>
> I think we should put in for this: http://wiki.apache.org/general/SummerOfCode2008
>
> I would be there are some students interested in doing ML on  
> Hadoop.  Any of the other committers willing to mentor?  I am, but  
> would also like some others to help out if you have the time.  See http://wiki.apache.org/general/SummerOfCodeMentor 
> .
>
>
> Thanks,
> Grant
>
>