You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Marko Novakovic <at...@yahoo.com> on 2008/03/27 00:02:30 UTC

kMeans

Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Re: kMeans

Posted by Karl Wettin <ka...@gmail.com>.

At the same time as I sent my reply, I received all the other replies 
that I did not read yet :)

Re: kMeans

Posted by Karl Wettin <ka...@gmail.com>.

Marko Novakovic skrev:
> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?

(Your question is better suited the users- than the dev-forum.)

It depends on your needs, so you need to be more specific about how you 
plan to use the results in order to get a good answer.

But choosing an algorithm to extract clusters is only half of your 
problem. You need to transform the web pages to instance data accepted 
by the clusterer. How much effort do you want to put in to that?


     karl

Index clustering (was: kMeans)

Posted by Karl Wettin <ka...@gmail.com>.

Khalil Honsali skrev:
> Hello,

Hi Khalil,

> Is there any relevant papers/work about index-clustering (not search results
> clustering) ? I wonder if it will impact queries if index is clustered and
> distributed somehow?

LUCENE-1025 is a heirarchial clusterer that I later refactored to be 
persist the tree in a BDB so I could build a cluster of a complete index 
that could come up with "more like this"-suggestions in an instant. It 
was sort of slow, but the results where not too bad. Never compared it 
with anything else thogh. It never became more than a proof of concept.

I'm looking at reimplenting this for Mahout, but I have a hard time 
figuring out if building the tree is something one wants to (or even if 
one can do) using map reduce. The more I think of it there more I want 
to solve it with a grid.



     karl

Re: kMeans

Posted by Marko Novakovic <at...@yahoo.com>.

Try to view the issue of IEEE Computer 2007. There are
a lot of phenomenons about indexnig results. Maybe you
could find some good reference there about the
clustering of index.

--- Khalil Honsali <k....@gmail.com> wrote:

> Hello,
> 
> Is there any relevant papers/work about
> index-clustering (not search results
> clustering) ? I wonder if it will impact queries if
> index is clustered and
> distributed somehow?
> 
> K. Honsali
> 
> On 27/03/2008, Ted Dunning <td...@veoh.com>
> wrote:
> >
> >
> > Kmeans can be used to cluster web-sites if you use
> a cosine measure of
> > similarity based on content.
> >
> > You can also use the first few eigenvectors of the
> linkage graph to do
> > spectral clustering (this will essentially be a
> strongly connected
> > component
> > analysis).
> >
> > Using browse logs can also give you clusters if
> you look at common viewing
> > of pages during particular sessions.  This should
> mostly replicate the
> > linkage graph analysis.
> >
> >
> >
> > On 3/26/08 4:02 PM, "Marko Novakovic"
> <at...@yahoo.com> wrote:
> >
> > > Is good idea to apply project for integrating
> kMeans
> > > algorithm to clustering web pages?
> > >
> > >
> > >
> > >
> >
>
______________________________________________________________________________
> > > ______
> > > Never miss a thing.  Make Yahoo your home page.
> > > http://www.yahoo.com/r/hs
> >
> >
> 



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Re: kMeans

Posted by Khalil Honsali <k....@gmail.com>.

Hello,

Is there any relevant papers/work about index-clustering (not search results
clustering) ? I wonder if it will impact queries if index is clustered and
distributed somehow?

K. Honsali

On 27/03/2008, Ted Dunning <td...@veoh.com> wrote:
>
>
> Kmeans can be used to cluster web-sites if you use a cosine measure of
> similarity based on content.
>
> You can also use the first few eigenvectors of the linkage graph to do
> spectral clustering (this will essentially be a strongly connected
> component
> analysis).
>
> Using browse logs can also give you clusters if you look at common viewing
> of pages during particular sessions.  This should mostly replicate the
> linkage graph analysis.
>
>
>
> On 3/26/08 4:02 PM, "Marko Novakovic" <at...@yahoo.com> wrote:
>
> > Is good idea to apply project for integrating kMeans
> > algorithm to clustering web pages?
> >
> >
> >
> >
> ______________________________________________________________________________
> > ______
> > Never miss a thing.  Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
>
>

Re: kMeans

Posted by Marko Novakovic <at...@yahoo.com>.

OK, thanks for information.

I wrote application for this topic.
I want to know if this topic is acceptable for Google
Summer of Code.

--- Ted Dunning <td...@veoh.com> wrote:

> 
> Kmeans can be used to cluster web-sites if you use a
> cosine measure of
> similarity based on content.
> 
> You can also use the first few eigenvectors of the
> linkage graph to do
> spectral clustering (this will essentially be a
> strongly connected component
> analysis).
> 
> Using browse logs can also give you clusters if you
> look at common viewing
> of pages during particular sessions.  This should
> mostly replicate the
> linkage graph analysis.
> 
> 
> On 3/26/08 4:02 PM, "Marko Novakovic"
> <at...@yahoo.com> wrote:
> 
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> > 
> > 
> >       
> >
>
______________________________________________________________________________
> > ______
> > Never miss a thing.  Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> 
> 



      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Re: kMeans

Posted by Ted Dunning <td...@veoh.com>.

Kmeans can be used to cluster web-sites if you use a cosine measure of
similarity based on content.

You can also use the first few eigenvectors of the linkage graph to do
spectral clustering (this will essentially be a strongly connected component
analysis).

Using browse logs can also give you clusters if you look at common viewing
of pages during particular sessions.  This should mostly replicate the
linkage graph analysis.

On 3/26/08 4:02 PM, "Marko Novakovic" <at...@yahoo.com> wrote:

> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?
> 
> 
>       
> ______________________________________________________________________________
> ______
> Never miss a thing.  Make Yahoo your home page.
> http://www.yahoo.com/r/hs

Re: kMeans

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Hi Marko,

> Is it acceptable solution for Google Summer of Code?

I don't think it's an acceptable project for Mahout -- Mahout goals are in large 
data set processing, supported by Map-Reduce. Clustering search results is 
usually in-memory, on-line clustering with few information sources (titles, 
snippets) and the resulting high noise.

That said, what I envisage could be done is to work on data structures that 
could _support_ sensible on-line faceting/clustering of search results, 
similarly to what Google supposedly does behind the scenes to reorder search 
results (similar concept clustering). Building semantic relationships between 
terms or detecting frequently recurring phrases with significantly different 
meanings is definitely interesting and challenging (if not done naively), 
especially on large scale.

Dawid

Re: kMeans

Posted by Marko Novakovic <at...@yahoo.com>.

Is it acceptable solution for Google Summer of Code?

--- Dawid Weiss <da...@cs.put.poznan.pl> wrote:

> 
> Carrot2 is for clustering web search results -- it's
> not exactly the same thing.
> 
> D.
> 
> shunkai.fu wrote:
> > There is one project called Carrot2 focusing on
> this topic already.
> > 
> > -----é®ä»¶åä»¶-----
> > åä»¶äºº: Marko Novakovic
> [mailto:atisha34@yahoo.com] 
> > åéæ¶é´: 2008å¹´3æ27æ¥ 7:03
> > æ¶ä»¶äºº: mahout-dev@lucene.apache.org
> > ä¸»é¢: kMeans
> > 
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> > 
> > 
> >  
> >
>
____________________________________________________________________________
> > ________
> > Never miss a thing.  Make Yahoo your home page. 
> > http://www.yahoo.com/r/hs
> > 
> 



      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Re: [?? Probable Spam] 答复: kMeans

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Carrot2 is for clustering web search results -- it's not exactly the same thing.

D.

shunkai.fu wrote:
> There is one project called Carrot2 focusing on this topic already.
> 
> -----邮件原件-----
> 发件人: Marko Novakovic [mailto:atisha34@yahoo.com] 
> 发送时间: 2008年3月27日 7:03
> 收件人: mahout-dev@lucene.apache.org
> 主题: kMeans
> 
> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?
> 
> 
>  
> ____________________________________________________________________________
> ________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
>

答复: kMeans

Posted by "shunkai.fu" <sh...@roboo.com>.

There is one project called Carrot2 focusing on this topic already.

-----邮件原件-----
发件人: Marko Novakovic [mailto:atisha34@yahoo.com] 
发送时间: 2008年3月27日 7:03
收件人: mahout-dev@lucene.apache.org
主题: kMeans

Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?


 
____________________________________________________________________________
________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs