You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Gökhan Çapan <gk...@gmail.com> on 2010/01/13 10:33:20 UTC

Some newbie questions- Mahout clustering

Hi,
We have a local news aggregation(and news search engine) web site, which
show news stories within a cluster (a cluster of news articles from
different news sites that are about the same(sometimes just very similar)
story).
For clustering the news of the last crawl(not results of search, news
themselves), we use Carrot2, and it works pretty good.

However, we sometimes need to publish summary of the week/month/year.

I am not experienced about clustering, and from what I read about clustering
in this mailing list, I guess applying kmeans to data after intelligently
selecting initial clusters with canopy will fulfill our needs.
I have some questions about topic:
-Could anyone who is experienced about clustering stuff suggest me the
rightest way to detect news stories? Does the method I mentioned above seem
reasonable ?
-Do I need some initial work before clustering? Should I partition the data
into daily groups before clustering, for example?
(Again, in our case; a news story is an aggregated view of the
similar(nearly same) stories from different sources.)

Finally, our search engine is built on Lucene/Solr. I've read our index may
be easily converted to Mahout vector format by lucene driver on Wiki pages.

-Are the documents about clustering jobs in Wiki pages  applicable with
"trunk"? If they are out of date, is there anywhere that I can reach the
documents about trunk?

Thanks.
-- 
Gökhan Çapan

Re: Some newbie questions- Mahout clustering

Posted by Ted Dunning <te...@gmail.com>.
I would suggest using the carrot2 daily clusters as seeds.

On Wed, Jan 13, 2010 at 4:02 AM, Grant Ingersoll <gs...@apache.org>wrote:

> > I am not experienced about clustering, and from what I read about
> clustering
> > in this mailing list, I guess applying kmeans to data after intelligently
> > selecting initial clusters with canopy will fulfill our needs.
>
> You can also try randomly selecting initial seeds or see some other threads
> about "kmeans++"




-- 
Ted Dunning, CTO
DeepDyve

Re: Some newbie questions- Mahout clustering

Posted by Gökhan Çapan <gk...@gmail.com>.
Thanks for advice, Grant.

On Wed, Jan 13, 2010 at 2:02 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jan 13, 2010, at 4:33 AM, Gökhan Çapan wrote:
>
> > Hi,
> > We have a local news aggregation(and news search engine) web site, which
> > show news stories within a cluster (a cluster of news articles from
> > different news sites that are about the same(sometimes just very similar)
> > story).
> > For clustering the news of the last crawl(not results of search, news
> > themselves), we use Carrot2, and it works pretty good.
> >
> > However, we sometimes need to publish summary of the week/month/year.
> >
> > I am not experienced about clustering, and from what I read about
> clustering
> > in this mailing list, I guess applying kmeans to data after intelligently
> > selecting initial clusters with canopy will fulfill our needs.
>
> You can also try randomly selecting initial seeds or see some other threads
> about "kmeans++"
>
> > I have some questions about topic:
> > -Could anyone who is experienced about clustering stuff suggest me the
> > rightest way to detect news stories? Does the method I mentioned above
> seem
> > reasonable ?
>
> I think that way sounds reasonable, although publishing a summary may be
> tricky, depending on your needs.  I would try out the various clustering
> algorithms and see which one gives you the best performance.  Getting labels
> for the cluster is one thing, but a summary may be a whole other case.
>
> > -Do I need some initial work before clustering? Should I partition the
> data
> > into daily groups before clustering, for example?
>
> I would partition it into the length of time you want the results for
> (week/month/year)
>
> > (Again, in our case; a news story is an aggregated view of the
> > similar(nearly same) stories from different sources.)
> >
> > Finally, our search engine is built on Lucene/Solr. I've read our index
> may
> > be easily converted to Mahout vector format by lucene driver on Wiki
> pages.
>
> If you have stored term vectors for the documents in question, then yes.
>  Otherwise, no, you will not be able to.
>
> >
> > -Are the documents about clustering jobs in Wiki pages  applicable with
> > "trunk"? If they are out of date, is there anywhere that I can reach the
> > documents about trunk?
>
> I believe they are pretty stable at this point, but I haven't reviewed
> every last one.  Probably the best thing to do to see the inputs to the
> Driver is run the command with --help.
>
> -Grant
>
>


-- 
Gökhan Çapan

Re: Some newbie questions- Mahout clustering

Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 13, 2010, at 4:33 AM, Gökhan Çapan wrote:

> Hi,
> We have a local news aggregation(and news search engine) web site, which
> show news stories within a cluster (a cluster of news articles from
> different news sites that are about the same(sometimes just very similar)
> story).
> For clustering the news of the last crawl(not results of search, news
> themselves), we use Carrot2, and it works pretty good.
> 
> However, we sometimes need to publish summary of the week/month/year.
> 
> I am not experienced about clustering, and from what I read about clustering
> in this mailing list, I guess applying kmeans to data after intelligently
> selecting initial clusters with canopy will fulfill our needs.

You can also try randomly selecting initial seeds or see some other threads about "kmeans++"

> I have some questions about topic:
> -Could anyone who is experienced about clustering stuff suggest me the
> rightest way to detect news stories? Does the method I mentioned above seem
> reasonable ?

I think that way sounds reasonable, although publishing a summary may be tricky, depending on your needs.  I would try out the various clustering algorithms and see which one gives you the best performance.  Getting labels for the cluster is one thing, but a summary may be a whole other case.

> -Do I need some initial work before clustering? Should I partition the data
> into daily groups before clustering, for example?

I would partition it into the length of time you want the results for (week/month/year)

> (Again, in our case; a news story is an aggregated view of the
> similar(nearly same) stories from different sources.)
> 
> Finally, our search engine is built on Lucene/Solr. I've read our index may
> be easily converted to Mahout vector format by lucene driver on Wiki pages.

If you have stored term vectors for the documents in question, then yes.  Otherwise, no, you will not be able to.

> 
> -Are the documents about clustering jobs in Wiki pages  applicable with
> "trunk"? If they are out of date, is there anywhere that I can reach the
> documents about trunk?

I believe they are pretty stable at this point, but I haven't reviewed every last one.  Probably the best thing to do to see the inputs to the Driver is run the command with --help.

-Grant