You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ra...@thomsonreuters.com on 2011/03/30 18:15:24 UTC

assit with the Clustering component in Solr/Lucene

Hi:
  I recently included the CLustering component into Solr and updated the requestHandler accordingly (in solrconfig.xml).
Snippet of the Config for the CLuserting:

  <searchComponent
    name="clusteringComponent"
    enable="${solr.clustering.enabled:false}"
    class="org.apache.solr.handler.clustering.ClusteringComponent" >
    <!-- Declare an engine -->
    <lst name="engine">
      <!-- The name, only one can be named "default" -->
      <str name="name">default</str>
      <!-- 
           Class name of Carrot2 clustering algorithm. Currently available algorithms are:
           
           * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
           * org.carrot2.clustering.stc.STCClusteringAlgorithm
           
           See http://project.carrot2.org/algorithms.html for the algorithm's characteristics.
        -->
      <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
      <!-- 
           Overriding values for Carrot2 default algorithm attributes. For a description
           of all available attributes, see: http://download.carrot2.org/stable/manual/#chapter.components.
           Use attribute key as name attribute of str elements below. These can be further
           overridden for individual requests by specifying attribute key as request
           parameter name and attribute value as parameter value.
        -->
      <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
    </lst>
    <lst name="engine">
      <str name="name">stc</str>
      <str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>
    </lst>
  </searchComponent>

snippet of the Config for requestHandler
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <!--
       <int name="rows">10</int>
       <str name="fl">*</str>
       <str name="version">2.1</str>
        -->
       <bool name="clustering">true</bool>
       <str name="clustering.engine">default</str>
       <bool name="clustering.results">true</bool>
       <!-- The title field -->
       <str name="carrot.title">headline</str>
       <str name="carrot.url">pi</str>
       <!-- The field to cluster on -->
       <str name="carrot.snippet">headline</str>
       <!-- produce summaries -->
       <bool name="carrot.produceSummary">true</bool>
       <!-- the maximum number of labels per cluster -->
       <!--<int name="carrot.numDescriptions">5</int>-->
       <!-- produce sub clusters -->
       <bool name="carrot.outputSubClusters">false</bool>
     </lst>
    <arr name="last-components">
      <str>clusteringComponent</str>
    </arr>
  </requestHandler>


When I perform a search, I see that the Cluster section within the Solr results
shows me results that are not quite consistent. There are two documents that are reported in two different documents

Are there parameters that can be set that will prevent this from happening ?


Thanks much

Ramdev

Re: assit with the Clustering component in Solr/Lucene

Posted by Stanislaw Osinski <st...@osinski.name>.

Thanks for the confirmation, I'll take a look at the issue.

S.

On Thu, Mar 31, 2011 at 17:24, <ra...@thomsonreuters.com> wrote:

> That did make a difference, I now see the exact number of cluster i see
> from the workbench.
> I am of course interested in why the config changes did not have much
> effect. However, I am happy that by adding the threshold to my request URL
> produces the desired results
>
> let me know if I can do any more tests and I will do so. Thanks much
>
> Ramdev
>
>
>
> On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote:
>
>
>      I added the parameter as you suggested.
>> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent
>> section that describes the Clustering module
>> Changing the value of the parameter  did not have any effect on my search
>> results.
>>
>> However, when I used the Carrot2 workbench, I could see the effect of
>> changing the value. (from 6 clusters it went down to 2 clusters)
>>
>
> Interesting... Can you, for the sake of debugging, append
> &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?
>
> S.
>
>
>

Re: assit with the Clustering component in Solr/Lucene

Posted by ra...@thomsonreuters.com.

That did make a difference, I now see the exact number of cluster i see from the workbench.
I am of course interested in why the config changes did not have much effect. However, I am happy that by adding the threshold to my request URL produces the desired results

let me know if I can do any more tests and I will do so. Thanks much

Ramdev



On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote:



		     I added the parameter as you suggested. (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent section that describes the Clustering module
		Changing the value of the parameter  did not have any effect on my search results.

		However, when I used the Carrot2 workbench, I could see the effect of changing the value. (from 6 clusters it went down to 2 clusters)


	Interesting... Can you, for the sake of debugging, append &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?
	
	S.

Re: assit with the Clustering component in Solr/Lucene

Posted by Stanislaw Osinski <st...@osinski.name>.

>      I added the parameter as you suggested.
> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent
> section that describes the Clustering module
> Changing the value of the parameter  did not have any effect on my search
> results.
>
> However, when I used the Carrot2 workbench, I could see the effect of
> changing the value. (from 6 clusters it went down to 2 clusters)
>

Interesting... Can you, for the sake of debugging, append
&LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?

S.

Re: assit with the Clustering component in Solr/Lucene

Posted by ra...@thomsonreuters.com.

Hi Staszek:
     I added the parameter as you suggested. (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent section that describes the Clustering module
Changing the value of the parameter  did not have any effect on my search results.

However, when I used the Carrot2 workbench, I could see the effect of changing the value. (from 6 clusters it went down to 2 clusters)

here is the XML snippet for the searchComponent:

  <searchComponent
    name="clusteringComponent"
    enable="${solr.clustering.enabled:false}"
    class="org.apache.solr.handler.clustering.ClusteringComponent" >
    <!-- Declare an engine -->
    <lst name="engine">
      <!-- The name, only one can be named "default" -->
      <str name="name">default</str>
      <!-- 
           Class name of Carrot2 clustering algorithm. Currently available algorithms are:
           
           * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
           * org.carrot2.clustering.stc.STCClusteringAlgorithm
           
           See http://project.carrot2..org/algorithms.html <http://project.carrot2.org/algorithms.html>  for the algorithm's characteristics.
        -->
      <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
      <!-- 
           Overriding values for Carrot2 default algorithm attributes. For a description
           of all available attributes, see: http://download.carrot2.org/stable/manual/#chapter.components.
           Use attribute key as name attribute of str elements below. These can be further
           overridden for individual requests by specifying attribute key as request
           parameter name and attribute value as parameter value.
        -->
      <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
      <str name="LingoClusteringAlgorithm.clusterMergingThreshold">0.0</str>
    </lst>
  </searchComponent>


I would appreciate any insights into this behavior. 

Thanks

Ramdev


On Mar 30, 2011, at 11:51 AM, Stanislaw Osinski wrote:


	Hi Ramdev,
	
	Both of the clustering algorithms that ship with Solr (Lingo and STC) are designed to allow one document to appear in more than one cluster, which actually does make sense in many scenarios. There's no easy way to force them to produce hard clusterings because this would require a complete change in the way the algorithms work. If you need each document to belong to exactly one cluster, you'd have to post-process the clusters to remove the redundant document assignments. Alternatively, in case of the Lingo algorithm, you can try lowering the "LingoClusteringAlgorithm.clusterMergingThreshold" to some value in the range of 0.2--0.5. If you do that, clusters containing overlapping documents will get merged. For more information about this attribute, see here: http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold.
	
	Cheers,
	
	Staszek
	
	
	On Wed, Mar 30, 2011 at 18:21, Markus Jelsma <ma...@openindex.io> wrote:
	

		Yes, you can set engine specific parameters. Check the comments in your
		snippety.
		

		> Hi:
		>   I recently included the CLustering component into Solr and updated the
		> requestHandler accordingly (in solrconfig.xml). Snippet of the Config for
		> the CLuserting:
		>
		>   <searchComponent
		>     name="clusteringComponent"
		>     enable="${solr.clustering.enabled:false}"
		>     class="org.apache.solr.handler.clustering.ClusteringComponent" >
		>     <!-- Declare an engine -->
		>     <lst name="engine">
		>       <!-- The name, only one can be named "default" -->
		>       <str name="name">default</str>
		>       <!--
		>            Class name of Carrot2 clustering algorithm. Currently available
		> algorithms are:
		>
		>            * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
		>            * org.carrot2.clustering.stc.STCClusteringAlgorithm
		>
		>            See http://project.carrot2.org/algorithms.html for the
		> algorithm's characteristics. -->
		>       <str
		> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
		> thm</str> <!--
		>            Overriding values for Carrot2 default algorithm attributes. For
		> a description of all available attributes, see:
		> http://download.carrot2.org/stable/manual/#chapter.components. Use
		> attribute key as name attribute of str elements below. These can be
		> further overridden for individual requests by specifying attribute key as
		> request parameter name and attribute value as parameter value.
		>         -->
		>       <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
		>     </lst>
		>     <lst name="engine">
		>       <str name="name">stc</str>
		>       <str
		> name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
		> /str> </lst>
		>   </searchComponent>
		>
		> snippet of the Config for requestHandler
		>   <requestHandler name="standard" class="solr.SearchHandler"
		> default="true"> <!-- default values for query parameters -->
		>      <lst name="defaults">
		>        <str name="echoParams">explicit</str>
		>        <!--
		>        <int name="rows">10</int>
		>        <str name="fl">*</str>
		>        <str name="version">2.1</str>
		>         -->
		>        <bool name="clustering">true</bool>
		>        <str name="clustering.engine">default</str>
		>        <bool name="clustering.results">true</bool>
		>        <!-- The title field -->
		>        <str name="carrot.title">headline</str>
		>        <str name="carrot.url">pi</str>
		>        <!-- The field to cluster on -->
		>        <str name="carrot.snippet">headline</str>
		>        <!-- produce summaries -->
		>        <bool name="carrot.produceSummary">true</bool>
		>        <!-- the maximum number of labels per cluster -->
		>        <!--<int name="carrot.numDescriptions">5</int>-->
		>        <!-- produce sub clusters -->
		>        <bool name="carrot.outputSubClusters">false</bool>
		>      </lst>
		>     <arr name="last-components">
		>       <str>clusteringComponent</str>
		>     </arr>
		>   </requestHandler>
		>
		>
		> When I perform a search, I see that the Cluster section within the Solr
		> results shows me results that are not quite consistent. There are two
		> documents that are reported in two different documents
		>
		> Are there parameters that can be set that will prevent this from happening
		> ?
		>
		>
		> Thanks much
		>
		> Ramdev

Re: assit with the Clustering component in Solr/Lucene

Posted by ra...@thomsonreuters.com.

Thanks much Stan,


Ramdev

On May 16, 2011, at 11:38 AM, Stanislaw Osinski wrote:


			Both of the clustering algorithms that ship with Solr (Lingo and STC) are designed to allow one document to appear in more than one cluster, which actually does make sense in many scenarios. There's no easy way to force them to produce hard clusterings because this would require a complete change in the way the algorithms work. If you need each document to belong to exactly one cluster, you'd have to post-process the clusters to remove the redundant document assignments.
			


		On the second thought, I have a simple implementation of k-means clustering that could do hard clustering for you. It's not available yet, it will most probably be part of the next major release of Carrot2 (the package that does the clustering). Please watch this issue http://issues.carrot2.org/browse/CARROT-791 to get updates on this.
		


	Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x, so you can use the bisecting k-means clustering algorithm (org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will produce non-overlapping clusters for you. The downside of this simple implementation of k-means is that, for the time being, it produces one-word cluster labels rather than phrases as Lingo and STC.

	Cheers,

	S.

Re: assit with the Clustering component in Solr/Lucene

Posted by Stanislaw Osinski <st...@osinski.name>.

>
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are
>> designed to allow one document to appear in more than one cluster, which
>> actually does make sense in many scenarios. There's no easy way to force
>> them to produce hard clusterings because this would require a complete
>> change in the way the algorithms work. If you need each document to belong
>> to exactly one cluster, you'd have to post-process the clusters to remove
>> the redundant document assignments.
>>
>
> On the second thought, I have a simple implementation of k-means clustering
> that could do hard clustering for you. It's not available yet, it will most
> probably be part of the next major release of Carrot2 (the package that does
> the clustering). Please watch this issue
> http://issues.carrot2.org/browse/CARROT-791 to get updates on this.
>

Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x,
so you can use the bisecting k-means clustering algorithm
(org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which
will produce non-overlapping clusters for you. The downside of this simple
implementation of k-means is that, for the time being, it produces one-word
cluster labels rather than phrases as Lingo and STC.

Cheers,

S.

Re: assit with the Clustering component in Solr/Lucene

Posted by Stanislaw Osinski <st...@osinski.name>.

> Both of the clustering algorithms that ship with Solr (Lingo and STC) are
> designed to allow one document to appear in more than one cluster, which
> actually does make sense in many scenarios. There's no easy way to force
> them to produce hard clusterings because this would require a complete
> change in the way the algorithms work. If you need each document to belong
> to exactly one cluster, you'd have to post-process the clusters to remove
> the redundant document assignments.
>

On the second thought, I have a simple implementation of k-means clustering
that could do hard clustering for you. It's not available yet, it will most
probably be part of the next major release of Carrot2 (the package that does
the clustering). Please watch this issue
http://issues.carrot2.org/browse/CARROT-791 to get updates on this.

Cheers,

S.

Re: assit with the Clustering component in Solr/Lucene

Posted by Stanislaw Osinski <st...@osinski.name>.

Hi Ramdev,

Both of the clustering algorithms that ship with Solr (Lingo and STC) are
designed to allow one document to appear in more than one cluster, which
actually does make sense in many scenarios. There's no easy way to force
them to produce hard clusterings because this would require a complete
change in the way the algorithms work. If you need each document to belong
to exactly one cluster, you'd have to post-process the clusters to remove
the redundant document assignments. Alternatively, in case of the Lingo
algorithm, you can try lowering the
"LingoClusteringAlgorithm.clusterMergingThreshold" to some value in the
range of 0.2--0.5. If you do that, clusters containing overlapping documents
will get merged. For more information about this attribute, see here:
http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold
.

Cheers,

Staszek

On Wed, Mar 30, 2011 at 18:21, Markus Jelsma <ma...@openindex.io>wrote:

> Yes, you can set engine specific parameters. Check the comments in your
> snippety.
>
> > Hi:
> >   I recently included the CLustering component into Solr and updated the
> > requestHandler accordingly (in solrconfig.xml). Snippet of the Config for
> > the CLuserting:
> >
> >   <searchComponent
> >     name="clusteringComponent"
> >     enable="${solr.clustering.enabled:false}"
> >     class="org.apache.solr.handler.clustering.ClusteringComponent" >
> >     <!-- Declare an engine -->
> >     <lst name="engine">
> >       <!-- The name, only one can be named "default" -->
> >       <str name="name">default</str>
> >       <!--
> >            Class name of Carrot2 clustering algorithm. Currently
> available
> > algorithms are:
> >
> >            * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
> >            * org.carrot2.clustering.stc.STCClusteringAlgorithm
> >
> >            See http://project.carrot2.org/algorithms.html for the
> > algorithm's characteristics. -->
> >       <str
> >
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
> > thm</str> <!--
> >            Overriding values for Carrot2 default algorithm attributes.
> For
> > a description of all available attributes, see:
> > http://download.carrot2.org/stable/manual/#chapter.components. Use
> > attribute key as name attribute of str elements below. These can be
> > further overridden for individual requests by specifying attribute key as
> > request parameter name and attribute value as parameter value.
> >         -->
> >       <str
> name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
> >     </lst>
> >     <lst name="engine">
> >       <str name="name">stc</str>
> >       <str
> >
> name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
> > /str> </lst>
> >   </searchComponent>
> >
> > snippet of the Config for requestHandler
> >   <requestHandler name="standard" class="solr.SearchHandler"
> > default="true"> <!-- default values for query parameters -->
> >      <lst name="defaults">
> >        <str name="echoParams">explicit</str>
> >        <!--
> >        <int name="rows">10</int>
> >        <str name="fl">*</str>
> >        <str name="version">2.1</str>
> >         -->
> >        <bool name="clustering">true</bool>
> >        <str name="clustering.engine">default</str>
> >        <bool name="clustering.results">true</bool>
> >        <!-- The title field -->
> >        <str name="carrot.title">headline</str>
> >        <str name="carrot.url">pi</str>
> >        <!-- The field to cluster on -->
> >        <str name="carrot.snippet">headline</str>
> >        <!-- produce summaries -->
> >        <bool name="carrot.produceSummary">true</bool>
> >        <!-- the maximum number of labels per cluster -->
> >        <!--<int name="carrot.numDescriptions">5</int>-->
> >        <!-- produce sub clusters -->
> >        <bool name="carrot.outputSubClusters">false</bool>
> >      </lst>
> >     <arr name="last-components">
> >       <str>clusteringComponent</str>
> >     </arr>
> >   </requestHandler>
> >
> >
> > When I perform a search, I see that the Cluster section within the Solr
> > results shows me results that are not quite consistent. There are two
> > documents that are reported in two different documents
> >
> > Are there parameters that can be set that will prevent this from
> happening
> > ?
> >
> >
> > Thanks much
> >
> > Ramdev
>

Re: assit with the Clustering component in Solr/Lucene

Posted by Markus Jelsma <ma...@openindex.io>.

Yes, you can set engine specific parameters. Check the comments in your 
snippety.

> Hi:
>   I recently included the CLustering component into Solr and updated the
> requestHandler accordingly (in solrconfig.xml). Snippet of the Config for
> the CLuserting:
> 
>   <searchComponent
>     name="clusteringComponent"
>     enable="${solr.clustering.enabled:false}"
>     class="org.apache.solr.handler.clustering.ClusteringComponent" >
>     <!-- Declare an engine -->
>     <lst name="engine">
>       <!-- The name, only one can be named "default" -->
>       <str name="name">default</str>
>       <!--
>            Class name of Carrot2 clustering algorithm. Currently available
> algorithms are:
> 
>            * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
>            * org.carrot2.clustering.stc.STCClusteringAlgorithm
> 
>            See http://project.carrot2.org/algorithms.html for the
> algorithm's characteristics. -->
>       <str
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
> thm</str> <!--
>            Overriding values for Carrot2 default algorithm attributes. For
> a description of all available attributes, see:
> http://download.carrot2.org/stable/manual/#chapter.components. Use
> attribute key as name attribute of str elements below. These can be
> further overridden for individual requests by specifying attribute key as
> request parameter name and attribute value as parameter value.
>         -->
>       <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
>     </lst>
>     <lst name="engine">
>       <str name="name">stc</str>
>       <str
> name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
> /str> </lst>
>   </searchComponent>
> 
> snippet of the Config for requestHandler
>   <requestHandler name="standard" class="solr.SearchHandler"
> default="true"> <!-- default values for query parameters -->
>      <lst name="defaults">
>        <str name="echoParams">explicit</str>
>        <!--
>        <int name="rows">10</int>
>        <str name="fl">*</str>
>        <str name="version">2.1</str>
>         -->
>        <bool name="clustering">true</bool>
>        <str name="clustering.engine">default</str>
>        <bool name="clustering.results">true</bool>
>        <!-- The title field -->
>        <str name="carrot.title">headline</str>
>        <str name="carrot.url">pi</str>
>        <!-- The field to cluster on -->
>        <str name="carrot.snippet">headline</str>
>        <!-- produce summaries -->
>        <bool name="carrot.produceSummary">true</bool>
>        <!-- the maximum number of labels per cluster -->
>        <!--<int name="carrot.numDescriptions">5</int>-->
>        <!-- produce sub clusters -->
>        <bool name="carrot.outputSubClusters">false</bool>
>      </lst>
>     <arr name="last-components">
>       <str>clusteringComponent</str>
>     </arr>
>   </requestHandler>
> 
> 
> When I perform a search, I see that the Cluster section within the Solr
> results shows me results that are not quite consistent. There are two
> documents that are reported in two different documents
> 
> Are there parameters that can be set that will prevent this from happening
> ?
> 
> 
> Thanks much
> 
> Ramdev