You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/17 03:47:47 UTC

[Solr Wiki] Update of "FieldCollapsing" by YonikSeeley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "FieldCollapsing" page has been changed by YonikSeeley.
The comment on this change is: doc search result grouping / field collapsing.
http://wiki.apache.org/solr/FieldCollapsing?action=diff&rev1=17&rev2=18

--------------------------------------------------

  <!> [[Solr4.0]]
  
+ = Result Grouping / Field Collapsing =
  <<TableOfContents>>
  
- /!\ This page refers to functionality from [[https://issues.apache.org/jira/browse/SOLR-236|SOLR-236]]. It is not yet available in trunk.
+ = Introduction =
+ Field Collapsing and Result Grouping are different ways to think about the same Solr feature.
  
- = Introduction =
+ Field Collapsing collapses a group of results with the same field value down to a single (or fixed number) of entries.  For example, most search engines such as Google collapse on site so only 1 or two entries are shown, along with a link to click to see more results from that site.  Field collapsing can also be used to suppress duplicate documents. 
+ 
+ Result Grouping groups documents with a common field value into groups, returning the top documents per group, and the top groups based on what documents are in the groups.  One example is a search at BestBuy for a common term such as DVD, that shows the top 3 results for each category ("TVs & Video","Movies","Computers", etc)
+ 
+ = Quick Start =
+ If you haven't already, get a recent nightly build of [[Solr4.0]], start the example server and index the example data as shown in the [[http://lucene.apache.org/solr/tutorial.html|solr tutorial]].
+ 
+ Now send a query request to solr and turn on result grouping.  We'll first try grouping on the manufacturer name (the manu_exact field). <!> You can currently only group on single-valued fields!
+ 
+ [[http://localhost:8983/solr/select?wt=json&indent=true&q=solr%20memory&fl=id,name&group=true&group.field=manu_exact]]
+ 
+ And the grouped response is returned:
  
  {{{
+ [...]
+   "grouped":{
+     "manu_exact":{
+       "matches":9,
+       "groups":[{
+           "groupValue":"Apache Software Foundation",
+           "doclist":{"numFound":1,"start":0,"docs":[
+               {
+                 "id":"SOLR1000",
+                 "name":"Solr, the Enterprise Search Server"}]
+           }},
+         {
+           "groupValue":"Corsair Microsystems Inc.",
+           "doclist":{"numFound":4,"start":0,"docs":[
+               {
+                 "id":"VS1GB400C3",
+                 "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"}]
+           }},
+         {
+           "groupValue":"A-DATA Technology Inc.",
+           "doclist":{"numFound":2,"start":0,"docs":[
+               {
+                 "id":"VDBDB1A16",
+                 "name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"}]
+           }},
+ [...]
- "Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
- }}}
- From [[http://www.fastsearch.com/glossary.aspx?m=48&amid=299|fast search]] (TODO: this link is broken, fix it)
- 
- This topic was discussed a while ago: http://www.nabble.com/result-grouping--tf2910425.html#a8131895
- 
- = Setup =
- The easiest way to configure field collapsing is by overriding the query component. This can be achieved by adding the following xml in your solrconfig.xml:
- 
- {{{
- <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent" />
- }}}
- That is all, now you can have field collapse enabled searches. The CollapseComponents extends from the QueryComponent, so a normal search is still possible.
- 
- If you wish to use both the QueryComponent and the CollapseComponent along side each other then you need to configure a little bit more in your solrconfig.xml. First, register the collapse searchComponent like this:
- 
- {{{
-   <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />
- }}}
- Then reference that search component in a custom search handler. For example, you could modify the standard request handler to look like this:
- 
- {{{
-   <requestHandler name="standard" class="solr.SearchHandler" default="true">
-     <!-- default values for query parameters -->
-      <lst name="defaults">
-        <str name="echoParams">explicit</str>
- 
-      </lst>
-      <arr name="components">
-         <str>collapse</str>
-         <str>facet</str>
-         <str>highlight</str>
-         <str>debug</str>
-      </arr>
-   </requestHandler>
- }}}
- Note that we have not included "query" in the list of component; the collapse handler implements query functionality itself.
- 
- In the latest patch it is possible to configure caching for the field collapsing execution. There are memory issues with this cache.
- Its therefore recommend to keep this cache small (e.g. with size 20) or to disable this cache. How big the cache should be depends on your environment.
- 
- This is an extra cache in addition
- to the already existing caches. It caches the result of the collapse logic and configured collapse collectors.
- The following xml configuration can be placed inside the solrconfig.xml as child of the config element.
- {{{
-   <fieldCollapsing>
- 
-   	<fieldCollapseCache
-       class="solr.FastLRUCache"
-       size="512"
-       initialSize="512"
-       autowarmCount="128"/>
- 
-   </fieldCollapsing>
  }}}
  
- If the field collapse cache is not configured then the field collapse logic will not be cached. 
+ The response indicates that there are 9 total matches to our query.
+ For each unique value of collapse.field (manufacturer names in this example) a docList with the top scoring document is returned.  The docList also gives the total number of matches in that group as "numFound".  The groups themselves are also sorted by the score of the top document within each group.
  
  <<Anchor(parameters)>>
  = Request Parameters =
- ||'''param''' ||'''description''' ||
+ ||'''param name''' ||'''param value'''||'''description''' ||
+ ||group||true/false||if true, turn on result grouping||
+ ||group.field||[fieldname]||Group based on the unique values of a field.  The field must currently be single-valued and must be either indexed, or be another field type that has a value source and works in a function query - such as [[http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html|ExternalFileField]]||
+ ||group.func||[function query]||Group based on the unique values of a function query.||
+ ||rows||[number]||The number of groups to return. Defaults to 10.||
+ ||group.limit||[number]||The number of results (documents) to return for each group.  Defaults to 1.||
+ ||sort||[sortspec]||How to sort the groups relative to each other.  For example, {{{sort=popularity desc}}} will cause the groups to be sorted by the popularity of the highest ranking (first) document in each group.  Defaults to "score desc".||
+ ||group.sort||[sortspec]||How to sort documents within a single group.  Defaults to the same value as the {{{sort}}} parameter.||
- ||collapse.type ||normal/adjacent -- does this collapse all documents or just the ones that are next to each other.  Defaults to normal ||
- ||collapse.field ||Which field to collapse. If this field is not specified then field collapsing is not enabled and falls back to to the QueryComponent to do a search. ||
- ||collapse.facet ||before/after -- apply faceting before or after collapsing.  Defaults to after ||
- ||collapse.max ||Deprecated use collapse.threshold instead. This parameter is removed in the latest patch. ||
- ||collapse.threshold ||The number of documents with the same value for collapse.field after which collapsing kicks in. The default value is one. ||
- ||collapse.maxdocs ||Maximum number of documents to process during field collapsin. This parameter defaults to one greater then the largest document number. ||
- ||collapse.info.doc ||Return collapse count for each document? Defaults to true ||
- ||collapse.info.count ||Return collapse count for each field value? Defaults to true ||
- ||collapse.includeCollapsedDocs.fl ||Parameter indicating to return the collapsed documents in the response and what fields to return in comma separated manner. A value * indicates that all fields will be returned ||
- ||collapse.debug ||wheter to include collapse debug information ||
- ||collapse.aggregate ||Execute aggregate functions on the collapsed documents. The parameter expect the functions in the following format: function_name(field_name) [, function_name(field_name]. So for example: sum(stock), avg(weight). Currently there are four functions available: min(...), max(...), sum(...), avg(...). The functionality is available from the patch added at 2009-10-25 10:13 PM. ||
  
+ Notes:
+  * Distributed search support for result grouping has not yet been implemented.
- <<Anchor(examples)>>
- = Examples =
- Using the example data:
  
- Collapse all documents using 'manu_exact' and 'normal' collapse type:
- http://localhost:8983/solr/select/?q=*:*&collapse.field=manu_exact&collapse.threshold=1&collapse.type=normal
- {{{
- <lst name="collapse_counts">
-     <str name="field">manu_exact</str>
-     <lst name="results">
-         <lst name="F8V7067-APL-KIT">
-             <int name="collapseCount">1</int>
-             <str name="fieldValue">Belkin</str>
-         </lst>
-         <lst name="TWINX2048-3200PRO">
-             <int name="collapseCount">3</int>
-             <str name="fieldValue">Corsair Microsystems Inc.</str>
-         </lst>
-         <lst name="VDBDB1A16">
-             <int name="collapseCount">1</int>
-             <str name="fieldValue">A-DATA Technology Inc.</str>
-         </lst>
-         <lst name="0579B002">
-             <int name="collapseCount">1</int>
-             <str name="fieldValue">Canon Inc.</str>
-         </lst>
-         <lst name="SOLR1000">
-             <int name="collapseCount">1</int>
-             <str name="fieldValue">Apache Software Foundation</str>
-         </lst>
-     </lst>
- </lst>
- }}}
- 
- Collapse all documents using 'manu_exact' and 'adjacent' collapse type:
- http://localhost:8983/solr/select/?q=*:*&collapse.field=manu_exact&collapse.threshold=1&collapse.type=adjacent
- 
- {{{
- <lst name="collapse_counts">
-     <str name="field">manu_exact</str>
-     <lst name="results">
-         <lst name="F8V7067-APL-KIT">
-             <int name="collapseCount">1</int>
-             <str name="fieldValue">Belkin</str>
-         </lst>
-         <lst name="TWINX2048-3200PRO">
-             <int name="collapseCount">1</int>
-             <str name="fieldValue">Corsair Microsystems Inc.</str>
-         </lst>
-         <lst name="TWINX2048-3200PRO-payload">
-             <int name="collapseCount">1</int>
-             <str name="fieldValue">Corsair Microsystems Inc.</str>
-         </lst>
-     </lst>
- </lst>
- }}}
- 
- The response is centred around collapse groups. A collapse group represents documents that were collapsed during the search. A collapse group is identifier by the most relevant document of that collapse group, which is document that did not get collapsed and remained present in the search result. So the ids like 233238 are from documents that are also present in the search result.
- 
- = Distributed field collapsing =
- In a distributed environment fieldcollapsing is supported in a limited manner. While indexing you must make sure that the documents of a collapse group are not scattered across different shards. Documents of a collapse group must reside on the same shard, failing to do so will corrupt your search results. Doing a distributed search with collapsing requires not extra parameters to be send with the request. For example the following request is sufficient: http://localhost:8080/solr/select/?q=solr&collapse.field=my_field&shards=localhost:55527/solr,localhost:55529/solr
- 
- = Other resources =
- Some other resources regarding to field collapsing:
- 
-  * [[http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/|Result grouping / field collapsing with Solr]]
- 
- If anyone has links about this topic feel free to add it.
-