You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Marko Novakovic <at...@yahoo.com> on 2008/03/27 00:02:30 UTC
kMeans
Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
Re: kMeans
Posted by Karl Wettin <ka...@gmail.com>.
At the same time as I sent my reply, I received all the other replies
that I did not read yet :)
Re: kMeans
Posted by Karl Wettin <ka...@gmail.com>.
Marko Novakovic skrev:
> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?
(Your question is better suited the users- than the dev-forum.)
It depends on your needs, so you need to be more specific about how you
plan to use the results in order to get a good answer.
But choosing an algorithm to extract clusters is only half of your
problem. You need to transform the web pages to instance data accepted
by the clusterer. How much effort do you want to put in to that?
karl
Index clustering (was: kMeans)
Posted by Karl Wettin <ka...@gmail.com>.
Khalil Honsali skrev:
> Hello,
Hi Khalil,
> Is there any relevant papers/work about index-clustering (not search results
> clustering) ? I wonder if it will impact queries if index is clustered and
> distributed somehow?
LUCENE-1025 is a heirarchial clusterer that I later refactored to be
persist the tree in a BDB so I could build a cluster of a complete index
that could come up with "more like this"-suggestions in an instant. It
was sort of slow, but the results where not too bad. Never compared it
with anything else thogh. It never became more than a proof of concept.
I'm looking at reimplenting this for Mahout, but I have a hard time
figuring out if building the tree is something one wants to (or even if
one can do) using map reduce. The more I think of it there more I want
to solve it with a grid.
karl
Re: kMeans
Posted by Marko Novakovic <at...@yahoo.com>.
Try to view the issue of IEEE Computer 2007. There are
a lot of phenomenons about indexnig results. Maybe you
could find some good reference there about the
clustering of index.
--- Khalil Honsali <k....@gmail.com> wrote:
> Hello,
>
> Is there any relevant papers/work about
> index-clustering (not search results
> clustering) ? I wonder if it will impact queries if
> index is clustered and
> distributed somehow?
>
> K. Honsali
>
> On 27/03/2008, Ted Dunning <td...@veoh.com>
> wrote:
> >
> >
> > Kmeans can be used to cluster web-sites if you use
> a cosine measure of
> > similarity based on content.
> >
> > You can also use the first few eigenvectors of the
> linkage graph to do
> > spectral clustering (this will essentially be a
> strongly connected
> > component
> > analysis).
> >
> > Using browse logs can also give you clusters if
> you look at common viewing
> > of pages during particular sessions. This should
> mostly replicate the
> > linkage graph analysis.
> >
> >
> >
> > On 3/26/08 4:02 PM, "Marko Novakovic"
> <at...@yahoo.com> wrote:
> >
> > > Is good idea to apply project for integrating
> kMeans
> > > algorithm to clustering web pages?
> > >
> > >
> > >
> > >
> >
>
______________________________________________________________________________
> > > ______
> > > Never miss a thing. Make Yahoo your home page.
> > > http://www.yahoo.com/r/hs
> >
> >
>
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
Re: kMeans
Posted by Khalil Honsali <k....@gmail.com>.
Hello,
Is there any relevant papers/work about index-clustering (not search results
clustering) ? I wonder if it will impact queries if index is clustered and
distributed somehow?
K. Honsali
On 27/03/2008, Ted Dunning <td...@veoh.com> wrote:
>
>
> Kmeans can be used to cluster web-sites if you use a cosine measure of
> similarity based on content.
>
> You can also use the first few eigenvectors of the linkage graph to do
> spectral clustering (this will essentially be a strongly connected
> component
> analysis).
>
> Using browse logs can also give you clusters if you look at common viewing
> of pages during particular sessions. This should mostly replicate the
> linkage graph analysis.
>
>
>
> On 3/26/08 4:02 PM, "Marko Novakovic" <at...@yahoo.com> wrote:
>
> > Is good idea to apply project for integrating kMeans
> > algorithm to clustering web pages?
> >
> >
> >
> >
> ______________________________________________________________________________
> > ______
> > Never miss a thing. Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
>
>
Re: kMeans
Posted by Marko Novakovic <at...@yahoo.com>.
OK, thanks for information.
I wrote application for this topic.
I want to know if this topic is acceptable for Google
Summer of Code.
--- Ted Dunning <td...@veoh.com> wrote:
>
> Kmeans can be used to cluster web-sites if you use a
> cosine measure of
> similarity based on content.
>
> You can also use the first few eigenvectors of the
> linkage graph to do
> spectral clustering (this will essentially be a
> strongly connected component
> analysis).
>
> Using browse logs can also give you clusters if you
> look at common viewing
> of pages during particular sessions. This should
> mostly replicate the
> linkage graph analysis.
>
>
> On 3/26/08 4:02 PM, "Marko Novakovic"
> <at...@yahoo.com> wrote:
>
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> >
> >
> >
> >
>
______________________________________________________________________________
> > ______
> > Never miss a thing. Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
>
>
____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
Re: kMeans
Posted by Ted Dunning <td...@veoh.com>.
Kmeans can be used to cluster web-sites if you use a cosine measure of
similarity based on content.
You can also use the first few eigenvectors of the linkage graph to do
spectral clustering (this will essentially be a strongly connected component
analysis).
Using browse logs can also give you clusters if you look at common viewing
of pages during particular sessions. This should mostly replicate the
linkage graph analysis.
On 3/26/08 4:02 PM, "Marko Novakovic" <at...@yahoo.com> wrote:
> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?
>
>
>
> ______________________________________________________________________________
> ______
> Never miss a thing. Make Yahoo your home page.
> http://www.yahoo.com/r/hs
Re: kMeans
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi Marko,
> Is it acceptable solution for Google Summer of Code?
I don't think it's an acceptable project for Mahout -- Mahout goals are in large
data set processing, supported by Map-Reduce. Clustering search results is
usually in-memory, on-line clustering with few information sources (titles,
snippets) and the resulting high noise.
That said, what I envisage could be done is to work on data structures that
could _support_ sensible on-line faceting/clustering of search results,
similarly to what Google supposedly does behind the scenes to reorder search
results (similar concept clustering). Building semantic relationships between
terms or detecting frequently recurring phrases with significantly different
meanings is definitely interesting and challenging (if not done naively),
especially on large scale.
Dawid
Re: kMeans
Posted by Marko Novakovic <at...@yahoo.com>.
Is it acceptable solution for Google Summer of Code?
--- Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
> Carrot2 is for clustering web search results -- it's
> not exactly the same thing.
>
> D.
>
> shunkai.fu wrote:
> > There is one project called Carrot2 focusing on
> this topic already.
> >
> > -----é®ä»¶å件-----
> > å件人: Marko Novakovic
> [mailto:atisha34@yahoo.com]
> > åéæ¶é´: 2008å¹´3æ27æ¥ 7:03
> > æ¶ä»¶äºº: mahout-dev@lucene.apache.org
> > 主é¢: kMeans
> >
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> >
> >
> >
> >
>
____________________________________________________________________________
> > ________
> > Never miss a thing. Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> >
>
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
Re: [?? Probable Spam] 答复: kMeans
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Carrot2 is for clustering web search results -- it's not exactly the same thing.
D.
shunkai.fu wrote:
> There is one project called Carrot2 focusing on this topic already.
>
> -----邮件原件-----
> 发件人: Marko Novakovic [mailto:atisha34@yahoo.com]
> 发送时间: 2008年3月27日 7:03
> 收件人: mahout-dev@lucene.apache.org
> 主题: kMeans
>
> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?
>
>
>
> ____________________________________________________________________________
> ________
> Never miss a thing. Make Yahoo your home page.
> http://www.yahoo.com/r/hs
>
答复: kMeans
Posted by "shunkai.fu" <sh...@roboo.com>.
There is one project called Carrot2 focusing on this topic already.
-----邮件原件-----
发件人: Marko Novakovic [mailto:atisha34@yahoo.com]
发送时间: 2008年3月27日 7:03
收件人: mahout-dev@lucene.apache.org
主题: kMeans
Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?
____________________________________________________________________________
________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs