You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jorge Luis Betancourt Gonzalez <jl...@uci.cu> on 2012/11/30 16:44:18 UTC

News clustering

Hi all:

I'm thinking on using nutch combined with solr to index some news sites in an intranet. And I was wondering how effective could be using the clustering component to cluster the search results? Any success history on using solr clustering component for news clustering? Any existing solution for clustering/classification on index time?

Greetings! 
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: News clustering

Posted by Otis Gospodnetic <ot...@gmail.com>.
If you're talking about Carrot2 - I used it yeeears ago and it worked well
for clustering results against a giant blog search engine ... back when
Technorati was big.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 30, 2012 10:45 AM, "Jorge Luis Betancourt Gonzalez" <
jlbetancourt@uci.cu> wrote:

> Hi all:
>
> I'm thinking on using nutch combined with solr to index some news sites in
> an intranet. And I was wondering how effective could be using the
> clustering component to cluster the search results? Any success history on
> using solr clustering component for news clustering? Any existing solution
> for clustering/classification on index time?
>
> Greetings!
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>

Re: News clustering

Posted by Iwan Hanjoyo <ih...@gmail.com>.
Hi Stanislaw Osinski,


On Mon, Dec 3, 2012 at 6:13 PM, Stanislaw Osinski <st...@osinski.name>wrote:

> One of our clients uses Solr's search results clustering for grouping news.
> Instead of the default Carrot2 algorithm that ships with Solr they use a
> commercial one, but Carrot2 should give you decent clusters too. Here's an
> example clustering result:
>
> http://imagebin.org/238001
>
> Staszek
>
> --
> Stanislaw Osinski
> http://carrotsearch.com
>
> On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez <
> jlbetancourt@uci.cu> wrote:
>
> > Hi all:
> >
> > I'm thinking on using nutch combined with solr to index some news sites
> in
> > an intranet. And I was wondering how effective could be using the
> > clustering component to cluster the search results? Any success history
> on
> > using solr clustering component for news clustering? Any existing
> solution
> > for clustering/classification on index time?
> >
> > Greetings!
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> > INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> >
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
> >
>

Re: News clustering

Posted by Iwan Hanjoyo <ih...@gmail.com>.
Hi Stanislaw,

I see. Thank you for the reference.

Kind regards,

Hanjoyo

On Tue, Dec 4, 2012 at 12:37 AM, Stanislaw Osinski
<st...@osinski.name>wrote:

> > I mean measuring the similarity between the document in each cluster.
> > Also, difference between document on one cluster with another cluster.
> >
> > I saw the sample code ClusteringQualityBencmark.java
> > However, I do not know how to make use of it for assessing my Solr
> > Clustering performance.
> >
>
> You'd need to write your own code for this, here are the most common
> clustering quality measures you mentioned:
>
>
> http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results
>
> These are meant for the general case (numeric attributes), to apply them to
> texts, you'd need to use the vector representation of the documents.
>
> One a more general note, synthetic measures test only the document-cluster
> assignments, but none take the quality of labels into account (this is
> really hard to measure objectively).
>
> Staszek
>

Re: News clustering

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.
I'm trying to using to search though news websites, but I was interested in classification on index time, is there any available solution for this?

Greetings!

On Dec 3, 2012, at 12:37 PM, Stanislaw Osinski <st...@osinski.name> wrote:

>> I mean measuring the similarity between the document in each cluster.
>> Also, difference between document on one cluster with another cluster.
>> 
>> I saw the sample code ClusteringQualityBencmark.java
>> However, I do not know how to make use of it for assessing my Solr
>> Clustering performance.
>> 
> 
> You'd need to write your own code for this, here are the most common
> clustering quality measures you mentioned:
> 
> http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results
> 
> These are meant for the general case (numeric attributes), to apply them to
> texts, you'd need to use the vector representation of the documents.
> 
> One a more general note, synthetic measures test only the document-cluster
> assignments, but none take the quality of labels into account (this is
> really hard to measure objectively).
> 
> Staszek
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: News clustering

Posted by Stanislaw Osinski <st...@osinski.name>.
> I mean measuring the similarity between the document in each cluster.
> Also, difference between document on one cluster with another cluster.
>
> I saw the sample code ClusteringQualityBencmark.java
> However, I do not know how to make use of it for assessing my Solr
> Clustering performance.
>

You'd need to write your own code for this, here are the most common
clustering quality measures you mentioned:

http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results

These are meant for the general case (numeric attributes), to apply them to
texts, you'd need to use the vector representation of the documents.

One a more general note, synthetic measures test only the document-cluster
assignments, but none take the quality of labels into account (this is
really hard to measure objectively).

Staszek

Re: News clustering

Posted by Iwan Hanjoyo <ih...@gmail.com>.
Hi Stanislaw,

I mean measuring the similarity between the document in each cluster.
Also, difference between document on one cluster with another cluster.

I saw the sample code ClusteringQualityBencmark.java
However, I do not know how to make use of it for assessing my Solr
Clustering performance.

Kind regards,

Hanjoyo

On Mon, Dec 3, 2012 at 8:11 PM, Stanislaw Osinski <st...@osinski.name>wrote:

> > Was the picture generated using Lingo 3G algorihtms?
> > I saw some sub-clusters inside it.
> > Nice pic :)
> >
>
> That is correct.
>
>
> I am interested to learn it.
> > How long is the Lingo 3G trial period?
> >
>
> I'll send you the details in a private e-mail in a second.
>
>
>
> > Is there any way to programmatically measure the performance of Carrot2
> > clustering algorithm?
> >
>
> I'm not sure what you mean by performance. Measuring clustering time is
> pretty straightforward, measuring the quality of clusters is not, a lot
> depends on your specific data and application.
>
> Staszek
>

Re: News clustering

Posted by Stanislaw Osinski <st...@osinski.name>.
> Was the picture generated using Lingo 3G algorihtms?
> I saw some sub-clusters inside it.
> Nice pic :)
>

That is correct.


I am interested to learn it.
> How long is the Lingo 3G trial period?
>

I'll send you the details in a private e-mail in a second.



> Is there any way to programmatically measure the performance of Carrot2
> clustering algorithm?
>

I'm not sure what you mean by performance. Measuring clustering time is
pretty straightforward, measuring the quality of clusters is not, a lot
depends on your specific data and application.

Staszek

Re: News clustering

Posted by Iwan Hanjoyo <ih...@gmail.com>.
Hi Stanislaw Osinski,

Was the picture generated using Lingo 3G algorihtms?
I saw some sub-clusters inside it.
Nice pic :)

I am interested to learn it.
How long is the Lingo 3G trial period?

Is there any way to programmatically measure the performance of Carrot2
clustering algorithm?
thanx

cheers

Hanjoyo

On Mon, Dec 3, 2012 at 6:13 PM, Stanislaw Osinski <st...@osinski.name>wrote:

> One of our clients uses Solr's search results clustering for grouping news.
> Instead of the default Carrot2 algorithm that ships with Solr they use a
> commercial one, but Carrot2 should give you decent clusters too. Here's an
> example clustering result:
>
> http://imagebin.org/238001
>
> Staszek
>
> --
> Stanislaw Osinski
> http://carrotsearch.com
>
> On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez <
> jlbetancourt@uci.cu> wrote:
>
> > Hi all:
> >
> > I'm thinking on using nutch combined with solr to index some news sites
> in
> > an intranet. And I was wondering how effective could be using the
> > clustering component to cluster the search results? Any success history
> on
> > using solr clustering component for news clustering? Any existing
> solution
> > for clustering/classification on index time?
> >
> > Greetings!
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> > INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> >
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
> >
>

Re: News clustering

Posted by Stanislaw Osinski <st...@osinski.name>.
One of our clients uses Solr's search results clustering for grouping news.
Instead of the default Carrot2 algorithm that ships with Solr they use a
commercial one, but Carrot2 should give you decent clusters too. Here's an
example clustering result:

http://imagebin.org/238001

Staszek

--
Stanislaw Osinski
http://carrotsearch.com

On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez <
jlbetancourt@uci.cu> wrote:

> Hi all:
>
> I'm thinking on using nutch combined with solr to index some news sites in
> an intranet. And I was wondering how effective could be using the
> clustering component to cluster the search results? Any success history on
> using solr clustering component for news clustering? Any existing solution
> for clustering/classification on index time?
>
> Greetings!
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>