You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/01/02 22:47:03 UTC
[Solr Wiki] Update of "ClusteringComponent" by GrantIngersoll

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ClusteringComponent" page has been changed by GrantIngersoll.
http://wiki.apache.org/solr/ClusteringComponent?action=diff&rev1=33&rev2=34

--------------------------------------------------

  NOTE: This code is marked as experimental and is the APIs and responses are subject to change in future releases. See https://issues.apache.org/jira/browse/SOLR-769 for discussions around the development of this feature.
  
  = Introduction =
- 
  This component can cluster both search results and documents.  In case you're wondering what clustering is good for, think of it as a quick way to summarize a whole bunch of results/documents, or as a way to group together like results/documents.
  
  See http://en.wikipedia.org/wiki/Data_clustering for more background, as well as links to further reading.
  
  = Clustering Component =
- 
  The clustering implements a pluggable approach that allows for the implementation of any clustering engine.
  
  The !ClusteringComponent is responsible for taking in the request, identify the clustering engine to be used (a !SolrClusteringEngine implementation) and then delegating the work to that engine.  Once the engine is done, the results are then added to the response.
@@ -19, +17 @@

  The !ClusteringComponent currently does not support distributed processing.
  
  == Installation ==
- 
  The !ClusteringComponent is in the contrib area of Solr.  Due to some dependencies on LGPL libraries for the Carrot2 implementation, we cannot package a complete binary solution (with all the dependencies).  To get the Carrot2 solution, you will need to download these libraries.  To do this, on the command line in the contrib/clustering directory, run {{{ant get-libraries}}}.  This will create a downloads directory under the lib directory for the downloaded jars.
  
  == Quick Start ==
+ Once you have downloaded the library dependencies, you can run the example using the following commands:
  
- Once you have downloaded the library dependencies, you can run the example using the following commands:
  {{{
  $ cd example
  $ java -Dsolr.clustering.enabled=true -jar start.jar
  }}}
- 
  This is the same as the main Solr example, using the same index, but with the clustering component and a SearchHandler configured to use that component enabled.
  
  In a different window, add some docs using the post tool in the exampledocs directory (if you haven't already).
+ 
  {{{
  $ cd example/exampledocs
  $ ./post.sh *.xml
  }}}
  Now try a query using the handler configured for clustering (It is confugred with clustering=true as a default param):
+ 
  {{{
  http://localhost:8983/solr/clustering?q=*:*&rows=10
  }}}
  This should yield results that include cluster information at the bottom of the response, like:
+ 
  {{{
  <arr name="clusters">
   <lst>
    <arr name="labels">
- 	<str>DDR</str>
+         <str>DDR</str>
    </arr>
    <arr name="docs">
- 	<str>TWINX2048-3200PRO</str>
+         <str>TWINX2048-3200PRO</str>
- 	<str>VS1GB400C3</str>
+         <str>VS1GB400C3</str>
- 	<str>VDBDB1A16</str>
+         <str>VDBDB1A16</str>
    </arr>
   </lst>
   <lst>
    <arr name="labels">
- 	<str>Car Power Adapter</str>
+         <str>Car Power Adapter</str>
    </arr>
    <arr name="docs">
- 	<str>F8V7067-APL-KIT</str>
+         <str>F8V7067-APL-KIT</str>
- 	<str>IW-02</str>
+         <str>IW-02</str>
    </arr>
   </lst>
   <lst>
    <arr name="labels">
- 	<str>Hard Drive</str>
+         <str>Hard Drive</str>
    </arr>
    <arr name="docs">
- 	<str>SP2514N</str>
+         <str>SP2514N</str>
- 	<str>6H500F0</str>
+         <str>6H500F0</str>
    </arr>
   </lst>
   <lst>
  [...]
  }}}
- 
  Clusters produced by Carrot2 group the results into different product categories: DDR (memory), Car Power Adapter, Display, Hard Drive. Notice that, depending on the quality of input documents, some clusters may not make much sense.
  
+ == Configuration ==
+ The !ClusteringComponent gets added just like any other !SearchComponent.  Just declare it in the solrconfig.xml, as in:
  
- == Configuration ==
- 
- The !ClusteringComponent gets added just like any other !SearchComponent.  Just declare it in the solrconfig.xml, as in:
  {{{
  <searchComponent class="org.apache.solr.handler.clustering.ClusteringComponent" name="clustering">
    <lst name="engine">
@@ -90, +87 @@

    </lst>
  </searchComponent>
  }}}
- 
  = Search Results Clustering =
- 
  == Carrot2 Clustering ==
- 
  Carrot2 is a scalable, BSD licensed search results clustering engine.  It can cluster many different types of search results, including Y!, Google, etc.  Our implementation, naturally, clusters Solr results.
  
  Carrot2 is best suited for clustering small-to-medium collections of short documents. While Carrot2 may work for longer documents, processing times may be too long to meet on-line clustering requirements.
  
  See http://project.carrot2.org
  
+ === Parameters ===
+ 
+  * carrot.algorithm - The engine to use as configured in the !SearchComponent.
+  * carrot.title - The title field name to use.
+  * carrot.url - The url field name. 
+  * carrot.snippet - The snippet field name.
+  * carrot.produceSummary - If true, then the snippet field (if no snippet field, then the title field) will be highlighted and the highlighted text will be used for the snippet.
+  * carrot.numDescriptions - The maximum number of labels to produce
+  * carrot.outputSubClusters - if true, generate subclusters
+  * carrot.fragSize - <!>Solr1.5<!> The frag size to use when produceSummary is true, for highlighting.  If not specified, the default highlighting fragsize (hl.fragsize) will be used.  If that isn't specified, then 100.
+ 
+ === Config ===
+ 
  The configuration (solrconfig.xml) looks like:
+ 
  {{{
  <searchComponent class="org.apache.solr.handler.clustering.ClusteringComponent" name="clustering">
    <!-- Declare an engine -->
    <lst name="engine">
      <!-- The name, only one can be named "default" -->
      <str name="name">default</str>
-     <!-- 
+     <!--
           Class name of Carrot2 clustering algorithm. Currently available algorithms are:
-          
+ 
           * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
           * org.carrot2.clustering.stc.STCClusteringAlgorithm
-          
+ 
           See http://project.carrot2.org/algorithms.html for the algorithm's characteristics.
        -->
      <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
-     <!-- 
+     <!--
           Overriding values for Carrot2 default algorithm attributes. For a description
           of all available attributes, see: http://download.carrot2.org/stable/manual/#chapter.components.
           Use attribute key as name attribute of str elements below. These can be further
@@ -128, +136 @@

    </lst>
  </searchcomponent>
  }}}
+ And the Standard !ReqHandler looks like:
  
- And the Standard !ReqHandler looks like:
  {{{
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
      <!-- default values for query parameters -->
       <lst name="defaults">
         <str name="echoParams">explicit</str>
-        <!-- 
+        <!--
         <int name="rows">10</int>
         <str name="fl">*</str>
         <str name="version">2.1</str>
@@ -161, +169 @@

      </arr>
    </requestHandler>
  }}}
- 
  The thing to note here is the mapping of Solr Fields (name, id, etc.) to the Carrot2 needs of title, snippet and url. Clustering will take into account the text of title and snippet.
  
- 
  == Tuning Carrot2 clustering ==
- 
  The easiest way to tune Carrot2 clustering for your specific data is to use a dedicated Carrot2 tool called Document Clustering Workbench.
  
   1. [[http://project.carrot2.org/download.html|Download Carrot2 Document Clustering Workbench]] for your platform.
-  2. [[http://download.carrot2.org/head/manual/#section.getting-started.solr|Attach]] your Solr instance as a document source in the Workbench.
+  1. [[http://download.carrot2.org/head/manual/#section.getting-started.solr|Attach]] your Solr instance as a document source in the Workbench.
-  3. [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words|Fine tune stop words]], [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps|stop labels]] and possibly [[http://download.carrot2.org/head/manual/#section.component.lingo|other attributes]] of the clustering algorithms to suit your needs.
+  1. [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words|Fine tune stop words]], [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps|stop labels]] and possibly [[http://download.carrot2.org/head/manual/#section.component.lingo|other attributes]] of the clustering algorithms to suit your needs.
-  4. To transfer the modified `stopwords.*` and `stoplabels.*` files to your Solr instance, simply make the modified files accessible in the classpath. If you're using the Solr example scripts, try putting the files in the `example/resources` folder (Jetty starter from `start.jar` adds all files from that folder to the classpath). Alternatively, you can overwrite the corresponding `stopwords.*` and `stoplabels.*` files directly in `carrot2-mini-*.jar`.
+  1. To transfer the modified `stopwords.*` and `stoplabels.*` files to your Solr instance, simply make the modified files accessible in the classpath. If you're using the Solr example scripts, try putting the files in the `example/resources` folder (Jetty starter from `start.jar` adds all files from that folder to the classpath). Alternatively, you can overwrite the corresponding `stopwords.*` and `stoplabels.*` files directly in `carrot2-mini-*.jar`.
- 
  
  = Document Clustering =
- 
  <!> THIS IS NOT FULLY IMPLEMENTED YET.
  
  The Document Clustering implementation is designed to cluster whole documents across a collection.  This can be done as an offline task.  Once the clustering is done, the clusters can be retrieved.