You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Stanislaw Osinski <st...@osinski.name> on 2011/05/16 18:38:42 UTC

Re: assit with the Clustering component in Solr/Lucene

>
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are
>> designed to allow one document to appear in more than one cluster, which
>> actually does make sense in many scenarios. There's no easy way to force
>> them to produce hard clusterings because this would require a complete
>> change in the way the algorithms work. If you need each document to belong
>> to exactly one cluster, you'd have to post-process the clusters to remove
>> the redundant document assignments.
>>
>
> On the second thought, I have a simple implementation of k-means clustering
> that could do hard clustering for you. It's not available yet, it will most
> probably be part of the next major release of Carrot2 (the package that does
> the clustering). Please watch this issue
> http://issues.carrot2.org/browse/CARROT-791 to get updates on this.
>

Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x,
so you can use the bisecting k-means clustering algorithm
(org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which
will produce non-overlapping clusters for you. The downside of this simple
implementation of k-means is that, for the time being, it produces one-word
cluster labels rather than phrases as Lingo and STC.

Cheers,

S.

Re: assit with the Clustering component in Solr/Lucene

Posted by ra...@thomsonreuters.com.
Thanks much Stan,


Ramdev

On May 16, 2011, at 11:38 AM, Stanislaw Osinski wrote:


			Both of the clustering algorithms that ship with Solr (Lingo and STC) are designed to allow one document to appear in more than one cluster, which actually does make sense in many scenarios. There's no easy way to force them to produce hard clusterings because this would require a complete change in the way the algorithms work. If you need each document to belong to exactly one cluster, you'd have to post-process the clusters to remove the redundant document assignments.
			


		On the second thought, I have a simple implementation of k-means clustering that could do hard clustering for you. It's not available yet, it will most probably be part of the next major release of Carrot2 (the package that does the clustering). Please watch this issue http://issues.carrot2.org/browse/CARROT-791 to get updates on this.
		


	Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x, so you can use the bisecting k-means clustering algorithm (org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will produce non-overlapping clusters for you. The downside of this simple implementation of k-means is that, for the time being, it produces one-word cluster labels rather than phrases as Lingo and STC.

	Cheers,

	S.